| Title: | Biclustering via Simplivariate Component Analysis |
|---|---|
| Description: | Identifies constant, additive, multiplicative, and user-defined simplivariate components in numeric data matrices using a genetic algorithm. Supports flexible pattern definitions and provides visualization for general biclustering applications across diverse domains. The method builds on simplivariate models as introduced in Hageman et al. (2008) <doi:10.1371/journal.pone.0003259> and is related to biclustering frameworks as reviewed by Madeira and Oliveira (2004) <doi:10.1109/TCBB.2004.2>. |
| Authors: | Jos Hageman [aut, cre] |
| Maintainer: | Jos Hageman <[email protected]> |
| License: | GPL-3 |
| Version: | 1.0.0 |
| Built: | 2026-06-01 11:29:49 UTC |
| Source: | https://github.com/joshageman/simplica |
Constructs a matrix with an additive structure based on the row means, column means, and the grand mean of the original matrix.
additiveMatrix(mat)additiveMatrix(mat)
mat |
A numeric matrix or data frame with numeric entries. |
The result is an approximation of the original matrix, where:
This model captures main additive effects of rows and columns, commonly used in exploratory data analysis or baseline modeling.
A numeric matrix of the same dimension as mat, approximating it
using an additive model.
m <- matrix(c(1, 2, 3, 4), nrow = 2) additiveMatrix(m)m <- matrix(c(1, 2, 3, 4), nrow = 2) additiveMatrix(m)
Fits an additive pattern to matrix data using training data specified by a mask. The pattern is based on row and column main effects, creating an additive model where each cell value is the sum of a row effect and column effect minus the overall mean.
additiveMatrixFitter(mat, trainMask)additiveMatrixFitter(mat, trainMask)
mat |
Numeric matrix containing the data to fit |
trainMask |
Logical matrix of same dimensions as mat, indicating which cells to use for training |
The function creates an additive pattern by:
Masking non-training cells as NA to compute statistics only on training data
Computing overall mean of training data
Computing row means and column means from training data
Replacing non-finite means with overall mean as fallback
Creating additive pattern as outer sum of row and column effects minus overall mean
Numeric matrix of same dimensions as mat containing the fitted additive pattern
Performs pure cross-validation over specified patterns with mandatory fitters. This function evaluates different pattern fitting models using cross-validation to determine the best model for a given data subset.
componentCVPatterns( df, rows, cols, patternFunctions, patternFitters, preferenceOrder = names(patternFunctions), nRepeats = 40, testFraction = 0.2, minCellsForModels = 25, parsimonyMargin = 0.05, requireFitters = TRUE, verbose = FALSE )componentCVPatterns( df, rows, cols, patternFunctions, patternFitters, preferenceOrder = names(patternFunctions), nRepeats = 40, testFraction = 0.2, minCellsForModels = 25, parsimonyMargin = 0.05, requireFitters = TRUE, verbose = FALSE )
df |
A matrix or data frame containing the data |
rows |
Row indices to subset from df |
cols |
Column indices to subset from df |
patternFunctions |
A named list of pattern functions to evaluate |
patternFitters |
A named list of fitter functions corresponding to each pattern |
preferenceOrder |
Character vector specifying the preference order of patterns (default: names of patternFunctions) |
nRepeats |
Integer, number of cross-validation repeats (default: 40) |
testFraction |
Numeric, fraction of data to use for testing in each CV fold (default: 0.2) |
minCellsForModels |
Integer, minimum number of cells required for reliable CV (default: 25) |
parsimonyMargin |
Numeric, margin for parsimony selection as fraction (default: 0.05) |
requireFitters |
Logical, whether to require fitters for all patterns (default: TRUE) |
verbose |
Logical, whether to print progress messages (default: FALSE) |
A list containing:
decision |
Character, the selected best pattern name |
reason |
Character, explanation of the selection reasoning |
cv |
Data frame with CV summary statistics for each model |
repeats |
Data frame with detailed results from each CV repeat |
meta |
List with metadata about the CV procedure |
Returns a list of default pattern fitting functions used in SIMPLICA. These fitters estimate different types of patterns (constant, additive, multiplicative) from matrix data using specified training cells.
defaultPatternFitters()defaultPatternFitters()
Each fitter function takes two arguments:
mat: Numeric matrix containing the data to fit
trainMask: Logical matrix indicating which cells to use for training
All fitters return a fitted matrix of the same dimensions as the input.
A named list containing pattern fitting functions:
constant: Fits a constant value (mean of training data)
additive: Fits an additive pattern using additiveMatrixFitter
multiplicative: Fits a multiplicative pattern using multiplicativeMatrixFitter
# Retrieve default pattern fitters fitters <- defaultPatternFitters() # Add a custom diagonal pattern fitter diagonalFitter <- function(mat, trainMask) { # Extract diagonal values from training data minDim <- min(nrow(mat), ncol(mat)) diagIndices <- cbind(1:minDim, 1:minDim) # Only use diagonal elements that are in the training mask validDiag <- trainMask[diagIndices] if (any(validDiag)) { diagVal <- mean(mat[diagIndices][validDiag]) } else { diagVal <- mean(mat[trainMask]) # fallback to overall mean } matrix(diagVal, nrow = nrow(mat), ncol = ncol(mat)) } # Extend the list with your own pattern fitters$diagonal <- diagonalFitter# Retrieve default pattern fitters fitters <- defaultPatternFitters() # Add a custom diagonal pattern fitter diagonalFitter <- function(mat, trainMask) { # Extract diagonal values from training data minDim <- min(nrow(mat), ncol(mat)) diagIndices <- cbind(1:minDim, 1:minDim) # Only use diagonal elements that are in the training mask validDiag <- trainMask[diagIndices] if (any(validDiag)) { diagVal <- mean(mat[diagIndices][validDiag]) } else { diagVal <- mean(mat[trainMask]) # fallback to overall mean } matrix(diagVal, nrow = nrow(mat), ncol = ncol(mat)) } # Extend the list with your own pattern fitters$diagonal <- diagonalFitter
Returns a named list of default matrix approximation functions used to score component patterns. Each function must take a numeric matrix (i.e., a component: subset of rows and columns) and return a matrix of the same dimensions that approximates the original matrix according to a specific structural pattern (e.g., constant, additive, etc.).
defaultPatternFunctions()defaultPatternFunctions()
This list can be passed to fitness2() via the patternFunctions argument.
Users can extend or override the default patterns by modifying the returned list.
Custom pattern functions must:
Take a numeric matrix as input.
Return a numeric matrix of the same dimensions.
Be compatible with sum(abs(...)) and sum((...)^2) operations for fitness scoring.
A named list of functions, each representing a matrix approximation method.
# Retrieve default pattern functions patterns <- defaultPatternFunctions() # Add a custom pattern based on diagonal structure diagonalPattern <- function(m) { diagVal <- mean(diag(as.matrix(m))) matrix(diagVal, nrow = nrow(m), ncol = ncol(m)) } # Extend the list with your own pattern patterns$diagonal <- diagonalPattern# Retrieve default pattern functions patterns <- defaultPatternFunctions() # Add a custom pattern based on diagonal structure diagonalPattern <- function(m) { diagVal <- mean(diag(as.matrix(m))) matrix(diagVal, nrow = nrow(m), ncol = ncol(m)) } # Extend the list with your own pattern patterns$diagonal <- diagonalPattern
Fitness function with automatic pattern selection per Simplivariate Component
fitness( string, df, dfMean, penalty, patternFunctions = defaultPatternFunctions(), returnPatterns = FALSE, ... )fitness( string, df, dfMean, penalty, patternFunctions = defaultPatternFunctions(), returnPatterns = FALSE, ... )
string |
Vector with length nrow(df) + ncol(df): component labels for rows and columns (in this order). |
df |
Numeric matrix: full data. |
dfMean |
Scalar: global mean of df. |
penalty |
Named vector with penalty weights per pattern. |
patternFunctions |
Named list of functions returning pattern-based approximations. |
returnPatterns |
Logical: if TRUE, also returns chosen pattern per component. |
... |
Additional arguments passed to the GA functions |
Either total fitness (numeric), or list(fitness, componentPatterns) if returnPatterns = TRUE.
Compute best pattern-based fitness for a single Simplivariate Component
fitnessForOneComponent(mat, dfMean, patternFunctions, penalty)fitnessForOneComponent(mat, dfMean, patternFunctions, penalty)
mat |
A numeric matrix (the component) |
dfMean |
Overall mean of the full data matrix |
patternFunctions |
Named list of functions for structure types |
penalty |
Named numeric vector of penalties per pattern type |
Numeric fitness value (higher is better)
m <- matrix(rnorm(100, mean = 10), nrow = 10) f <- fitnessForOneComponent(m, mean(m), defaultPatternFunctions(), c(constant = 0, additive = 1.0, multiplicative = 0))m <- matrix(rnorm(100, mean = 10), nrow = 10) f <- fitnessForOneComponent(m, mean(m), defaultPatternFunctions(), c(constant = 0, additive = 1.0, multiplicative = 0))
Applies mutation to a selected parent vector by replacing each gene with a random value (within bounds) with a given mutation probability. Used in integer-encoded GAs.
gaintegerMutation(object, parent, ...)gaintegerMutation(object, parent, ...)
object |
A GA object containing at least the slots |
parent |
An integer index indicating which individual in the population to mutate. |
... |
Further arguments (unused, included for compatibility). |
A numeric vector representing the mutated individual.
Performs one-point crossover on two parent individuals. A single crossover point is selected, and all genes before (and including) that point are exchanged between the parents.
gaintegerOnePointCrossover(object, parents, ...)gaintegerOnePointCrossover(object, parents, ...)
object |
A GA object with a |
parents |
A 2-row matrix of values indexing the parents from the current population. |
... |
Further arguments (unused, included for compatibility). |
A list with two elements:
A 2-row matrix of the resulting offspring.
A numeric vector of NA values to be replaced by fitness evaluation.
Creates a factory function for generating initial populations for genetic algorithms with integer chromosomes, where a specified fraction of variables are set to zero.
gaintegerPopulationFactory(zeroFraction, verbose = FALSE)gaintegerPopulationFactory(zeroFraction, verbose = FALSE)
zeroFraction |
Numeric value between 0 and 1 specifying the fraction of variables to set to zero in each individual |
verbose |
Logical indicating whether to print information about zero fraction |
A function that takes a GA object and returns an initial population matrix
This file declares global variables used in the SIMPLICA package to avoid R CMD check notes about "no visible binding for global variable". These variables are typically used within ggplot2 aesthetics and data manipulation functions where they refer to column names in data frames created during execution.
Approximates a data matrix using a low-rank multiplicative model based on a fixed outer product of centered row and column effects.
multiplicativeMatrix(mat)multiplicativeMatrix(mat)
mat |
A numeric matrix with values to approximate. |
The model assumes:
where mu is the overall mean of the input matrix, and
rowEffect and colEffect are centered integer sequences.
A numeric matrix of the same size as mat, containing the fitted values.
Fits a multiplicative pattern to matrix data using training data specified by a mask. The pattern is based on the outer product of centered row and column effects, creating a bilinear surface that can capture multiplicative interactions between rows and columns.
multiplicativeMatrixFitter(mat, trainMask)multiplicativeMatrixFitter(mat, trainMask)
mat |
Numeric matrix containing the data to fit |
trainMask |
Logical matrix of same dimensions as mat, indicating which cells to use for training |
The function creates a multiplicative pattern by:
Computing the mean of training data as baseline
Creating centered row effects (0 to nRows-1, mean-centered)
Creating centered column effects (0 to nCols-1, mean-centered)
Taking outer product to form bilinear pattern
Fitting scaling coefficient using least squares on training data
Returning baseline plus scaled pattern
Numeric matrix of same dimensions as mat containing the fitted multiplicative pattern
Visualizes GA-detected simplivariate components on the original matrix as outlined cells, colored by pattern type.
plotComponentResult( df, string, componentPatterns, componentScores, scoreCutoff = 0, showAxisLabels = TRUE, showComponentLabels = TRUE, title = "Detected Components", rearrange = FALSE, grayscale = TRUE )plotComponentResult( df, string, componentPatterns, componentScores, scoreCutoff = 0, showAxisLabels = TRUE, showComponentLabels = TRUE, title = "Detected Components", rearrange = FALSE, grayscale = TRUE )
df |
Original data matrix |
string |
Best GA string (vector of length nrow(df) + ncol(df); rows first, then cols) |
componentPatterns |
Vector of component types (from fitness(..., returnPatterns = TRUE)) |
componentScores |
Vector of fitness scores per component |
scoreCutoff |
Minimum score a component must have to be shown (default: 0 = show all) |
showAxisLabels |
Logical: show axis tick labels (default: TRUE) |
showComponentLabels |
Logical: show component labels inside clusters (default: TRUE) |
title |
Title for the plot (default: "Detected Components") |
rearrange |
Logical: reorder rows and columns to group components (default: FALSE) |
grayscale |
Logical: use grayscale for heatmap background (default: TRUE) |
ggplot object
Print method for summaryComponents
## S3 method for class 'summaryComponents' print(x, showDetails = FALSE, maxLinesDetails = 200L, ...)## S3 method for class 'summaryComponents' print(x, showDetails = FALSE, maxLinesDetails = 200L, ...)
x |
An object produced by |
showDetails |
Logical. If TRUE, also print row/column indices per component. |
maxLinesDetails |
Integer. Max number of indices to print per dimension (rows/cols) before truncation (default 200). |
... |
Further arguments passed to or from other methods (not used here). |
The input object x, invisibly. Called for its side effect of
printing a formatted component summary to the console.
Creates a monitoring function that prints the current generation and the best fitness score
to the console at specified intervals. Intended for use as a monitor function in GA runs.
SCAMonitorFactory(interval = 100)SCAMonitorFactory(interval = 100)
interval |
An integer specifying the interval for printing progress updates. Default is 100 (prints every 100 generations). |
A monitoring function that can be used with GA. The returned function takes a GA object and prints progress information at the specified interval.
# Create monitor that prints every 100 generations (default) monitor <- SCAMonitorFactory() # ga(..., monitor = monitor) # Create monitor that prints every 50 generations monitor <- SCAMonitorFactory(50) # ga(..., monitor = monitor)# Create monitor that prints every 100 generations (default) monitor <- SCAMonitorFactory() # ga(..., monitor = monitor) # Create monitor that prints every 50 generations monitor <- SCAMonitorFactory(50) # ga(..., monitor = monitor)
Implements the SIMPLICA algorithm to identify Simplivariate Components in data matrices using a genetic algorithm. These components are related to clusters or biclusters, but defined here in terms of specific structural patterns (constant, additive, multiplicative, or user-defined).
simplica( df, maxIter = 2000, popSize = 300, pCrossover = 0.6, pMutation = 0.03, zeroFraction = 0.9, elitism = 100, numSimComp = 5, verbose = FALSE, mySeeds = 1:5, interval = 100, penalty = c(constant = 0, additive = 1, multiplicative = 0), patternFunctions = defaultPatternFunctions(), doSimplicaCV = TRUE, cvControl = NULL )simplica( df, maxIter = 2000, popSize = 300, pCrossover = 0.6, pMutation = 0.03, zeroFraction = 0.9, elitism = 100, numSimComp = 5, verbose = FALSE, mySeeds = 1:5, interval = 100, penalty = c(constant = 0, additive = 1, multiplicative = 0), patternFunctions = defaultPatternFunctions(), doSimplicaCV = TRUE, cvControl = NULL )
df |
A numeric data matrix to analyze |
maxIter |
Maximum number of generations for the genetic algorithm (default: 2000) |
popSize |
Population size for the genetic algorithm (default: 300) |
pCrossover |
Crossover probability for genetic algorithm (default: 0.6) |
pMutation |
Mutation probability for genetic algorithm (default: 0.03) |
zeroFraction |
Fraction of population initialized with zeros (default: 0.9) |
elitism |
Number of best individuals preserved between generations (default: 100) |
numSimComp |
Number of Simplivariate Components simultaneously optimized (default: 5) |
verbose |
Logical, whether to print SIMPLICA progress information (default: FALSE) |
mySeeds |
Vector of random seeds for replicate runs (default: 1:5) |
interval |
Interval for monitoring GA progress (default: 100) |
penalty |
Named vector of penalty values for each pattern type (default: c(constant = 0, additive = 1, multiplicative = 0)) |
patternFunctions |
List of pattern functions used for fitness evaluation (default: defaultPatternFunctions()) |
doSimplicaCV |
Logical, run cross-validated relabeling with simplicaCV() after GA (default: TRUE) |
cvControl |
Optional list to tune simplicaCV; fields passed to simplicaCV via do.call. Defaults if omitted:
|
A list with:
best: simplica object (includes original GA result; if doSimplicaCV=TRUE, also componentPatternsUpdated and componentAudit)
raw: list of "ga" objects (one per seed, from the GA package)
Hageman, J. A., Wehrens, R., & Buydens, L. M. C. (2008). "Simplivariate Models: Ideas and First Examples." PLoS ONE, 3(9), e3259. doi:10.1371/journal.pone.0003259
Madeira, S. C., & Oliveira, A. L. (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey." IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 24–45. doi:10.1109/TCBB.2004.2
data("simplicaToy") # Minimal run just to demonstrate function usage, run with default GA parameters fit <- simplica(df = simplicaToy$data, maxIter = 200, popSize = 50, mySeeds = 1, elitism = 1, verbose = TRUE) plotComponentResult(df = simplicaToy$data, string = fit$best$string, componentPatterns = fit$best$componentPatternsUpdated, componentScores = fit$best$componentScores, showAxisLabels = FALSE, title = "SIMPLICA on simplicaToy", scoreCutoff = 25000)data("simplicaToy") # Minimal run just to demonstrate function usage, run with default GA parameters fit <- simplica(df = simplicaToy$data, maxIter = 200, popSize = 50, mySeeds = 1, elitism = 1, verbose = TRUE) plotComponentResult(df = simplicaToy$data, string = fit$best$string, componentPatterns = fit$best$componentPatternsUpdated, componentScores = fit$best$componentScores, showAxisLabels = FALSE, title = "SIMPLICA on simplicaToy", scoreCutoff = 25000)
This function performs cross-validation-based pattern testing for Simplivariate Components in a SIMPLICA object. It evaluates different pattern functions using cross-validation and selects the best performing pattern for each component. Fitters are required for all patterns with no fallback options.
simplicaCV( foundObject, df, patternFunctions = defaultPatternFunctions(), patternFitters = defaultPatternFitters(), preferenceOrder = names(patternFunctions), nRepeats = 40, testFraction = 0.2, minCellsForModels = 25, parsimonyMargin = 0.05, requireFitters = TRUE, updateObject = TRUE, verbose = FALSE, ignoreNaComponents = TRUE )simplicaCV( foundObject, df, patternFunctions = defaultPatternFunctions(), patternFitters = defaultPatternFitters(), preferenceOrder = names(patternFunctions), nRepeats = 40, testFraction = 0.2, minCellsForModels = 25, parsimonyMargin = 0.05, requireFitters = TRUE, updateObject = TRUE, verbose = FALSE, ignoreNaComponents = TRUE )
foundObject |
A simplica object containing Simplivariate Components |
df |
Data frame or matrix with the original data |
patternFunctions |
List of pattern functions to evaluate (default: defaultPatternFunctions()) |
patternFitters |
List of pattern fitting functions (default: defaultPatternFitters()) |
preferenceOrder |
Character vector specifying preference order for pattern selection (default: names(patternFunctions)) |
nRepeats |
Integer, number of cross-validation repeats (default: 40) |
testFraction |
Numeric, fraction of data to use for testing (default: 0.2) |
minCellsForModels |
Integer, minimum number of cells required for model fitting (default: 25) |
parsimonyMargin |
Numeric, margin for parsimony-based model selection (default: 0.05) |
requireFitters |
Logical, whether fitters are required for all patterns (default: TRUE) |
updateObject |
Logical, whether to update and return the input object (default: TRUE) |
verbose |
Logical, whether to print progress messages (default: FALSE) |
ignoreNaComponents |
Logical, whether to skip components with NA patterns (default: TRUE) |
The function performs the following steps:
Validates the input simplica object and data dimensions
Checks that all pattern functions have corresponding fitters
For each simplivariate component, performs cross-validation pattern evaluation
Selects the best performing pattern based on RMSE and parsimony
Updates component patterns and provides detailed test information
If updateObject = TRUE, returns the input simplica object
with two new fields:
componentPatternsUpdatedCharacter vector with the selected
pattern per component after cross-validation. If a component is skipped or empty,
the entry is NA.
componentAuditData frame containing detailed cross-validation results for each component, with the following columns:
componentIdNumeric ID of the component.
originalPatternPattern label originally assigned.
selectedPatternPattern chosen after CV-based evaluation.
reasonExplanation of why a pattern was selected or skipped.
nRows, nCols, nCells
Dimensions of the component.
nRepeats, testFraction, parsimonyMargin
CV settings used.
cvMean_<pattern>Mean RMSE over CV folds for each tested pattern.
cvSd_<pattern>Standard deviation of RMSE across CV folds.
winFrac_<pattern>Fraction of CV repeats where the pattern was the best performer.
If updateObject = FALSE, returns a list with the same two elements
(componentPatternsUpdated, componentAudit).
A small 30×60 matrix to demonstrate SIMPLICA in a controlled setting. Contains one multiplicative and one additive simplivariate component (non-overlapping).
data(simplicaToy)data(simplicaToy)
A list with three elements:
numeric matrix of dimension
list of length 2 with type, rows, cols
character string
data("simplicaToy") str(simplicaToy) image(t(simplicaToy$data))data("simplicaToy") str(simplicaToy) image(t(simplicaToy$data))
Compute a tidy summary of simplivariate Components found by SIMPLICA.
Returns a data.frame with class "summaryComponents" and an attribute holding
row/column index lists for printing details.
summaryComponents(results, scoreCutoff = 0)summaryComponents(results, scoreCutoff = 0)
results |
A list or 'simplica' object with fields:
|
scoreCutoff |
Numeric. Minimum score to include a component (default 0). |
A data.frame with columns:
componentId, pattern, score, rows, cols, size;
class is c("summaryComponents","data.frame"). The attribute
"indices" stores a list with per-component rowIdx and colIdx.