Title: | Kernel Learning Integrative Clustering |
---|---|
Description: | Kernel Learning Integrative Clustering (KLIC) is an algorithm that allows to combine multiple kernels, each representing a different measure of the similarity between a set of observations. The contribution of each kernel on the final clustering is weighted according to the amount of information carried by it. As well as providing the functions required to perform the kernel-based clustering, this package also allows the user to simply give the data as input: the kernels are then built using consensus clustering. Different strategies to choose the best number of clusters are also available. For further details please see Cabassi and Kirk (2020) <doi:10.1093/bioinformatics/btaa593>. |
Authors: | Alessandra Cabassi [aut, cre] , Paul DW Kirk [ths] , Mehmet Gonen [ctb] |
Maintainer: | Alessandra Cabassi <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.4 |
Built: | 2024-11-03 04:05:58 UTC |
Source: | https://github.com/acabassi/klic |
Compute the cophenetic correlation coefficient of a kernel matrix, which is a measure of how faithfully hierarchical clustering would preserve the pairwise distances between the original data points.
copheneticCorrelation(kernelMatrix)
copheneticCorrelation(kernelMatrix)
kernelMatrix |
kernel matrix. |
This functions returns the cophenetic correlation coefficient of the kernel matrix provided as input.
Alessandra Cabassi [email protected]
Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.
Sokal, R.R. and Rohlf, F.J., 1962. The comparison of dendrograms by objective methods. Taxon, 11(2), pp.33-40.
# Load kernel matrix consensus_matrix <- as.matrix(read.csv(system.file('extdata', 'consensus_matrix1.csv', package = 'klic'), row.names = 1)) # Compute cophenetic correlation coph_corr_coeff <- copheneticCorrelation(consensus_matrix) cat(coph_corr_coeff)
# Load kernel matrix consensus_matrix <- as.matrix(read.csv(system.file('extdata', 'consensus_matrix1.csv', package = 'klic'), row.names = 1)) # Compute cophenetic correlation coph_corr_coeff <- copheneticCorrelation(consensus_matrix) cat(coph_corr_coeff)
Perform the training step of kernel k-means.
kkmeans(K, parameters, seed = NULL)
kkmeans(K, parameters, seed = NULL)
K |
Kernel matrix. |
parameters |
A list containing the number of clusters
|
seed |
The seed used inside the |
This function returns a list containing:
clustering |
the cluster labels for each element (i.e. row/column) of the kernel matrix. |
objective |
the value of the objective function for the given clustering. |
parameters |
same parameters as in the input. |
Mehmet Gonen
Gonen, M. and Margolin, A.A., 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In Advances in Neural Information Processing Systems (pp. 1305-1313).
# Load one dataset with 100 observations, 2 variables, 4 clusters data <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1)) # Compute consensus clustering with K=4 clusters cm <- coca::consensusCluster(data, 4) # Shift eigenvalues of the matrix by a constant: (min eigenvalue) * (coeff) km <- spectrumShift(cm, coeff = 1.05) # Initalize the parameters of the algorithm parameters <- list() # Set the number of clusters parameters$cluster_count <- 4 # Perform training state <- kkmeans(km, parameters) # Display the clustering print(state$clustering)
# Load one dataset with 100 observations, 2 variables, 4 clusters data <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1)) # Compute consensus clustering with K=4 clusters cm <- coca::consensusCluster(data, 4) # Shift eigenvalues of the matrix by a constant: (min eigenvalue) * (coeff) km <- spectrumShift(cm, coeff = 1.05) # Initalize the parameters of the algorithm parameters <- list() # Set the number of clusters parameters$cluster_count <- 4 # Perform training state <- kkmeans(km, parameters) # Display the clustering print(state$clustering)
This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.
klic( data, M, individualK = NULL, individualMaxK = 6, individualClAlgorithm = "kkmeans", globalK = NULL, globalMaxK = 6, B = 1000, C = 100, scale = FALSE, savePNG = FALSE, fileName = "klic", verbose = TRUE, annotations = NULL, ccClMethods = "kmeans", ccDistHCs = "euclidean", widestGap = FALSE, dunns = FALSE, dunn2s = FALSE )
klic( data, M, individualK = NULL, individualMaxK = 6, individualClAlgorithm = "kkmeans", globalK = NULL, globalMaxK = 6, B = 1000, C = 100, scale = FALSE, savePNG = FALSE, fileName = "klic", verbose = TRUE, annotations = NULL, ccClMethods = "kmeans", ccDistHCs = "euclidean", widestGap = FALSE, dunns = FALSE, dunn2s = FALSE )
data |
List of M datasets, each of size N X P_m, m = 1, ..., M. |
M |
number of datasets. |
individualK |
Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette. |
individualMaxK |
Maximum number of clusters considered for the individual data. Default is 6. |
individualClAlgorithm |
Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters. |
globalK |
Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette. |
globalMaxK |
Maximum number of clusters considered for the final clustering. Default is 6. |
B |
Number of iterations for consensus clustering. Default is 1000. |
C |
Maximum number of iterations for localised kernel k-means. Default is 100. |
scale |
Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance. |
savePNG |
Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE. |
fileName |
If |
verbose |
Boolean. Default is TRUE. |
annotations |
Data frame containing annotations for final plot. |
ccClMethods |
The i-th element of this vector goes into the
|
ccDistHCs |
The i-th element of this vector goes into the |
widestGap |
Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE. |
dunns |
Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE. |
dunn2s |
Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE. |
The function returns a list contatining:
consensusMatrices |
an array containing one consensus matrix per data set. |
weights |
a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix. |
weightedKM |
the weighted kernel matrix obtained by taking a weighted
sum of all kernels, where the weights are those specified in the
|
globalClusterLabels |
a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices. |
bestK |
a vector containing the best number of clusters between 2 and
|
globalK |
the
best number of clusters for the final (global) clustering. This is chosen so
as to maximise the silhouette and only returned if the final number of
clusters |
Alessandra Cabassi [email protected]
Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Load synthetic data data1 <- as.matrix(read.csv(system.file('extdata', 'dataset1.csv', package = 'klic'), row.names = 1)) data2 <- as.matrix(read.csv(system.file('extdata', 'dataset2.csv', package = 'klic'), row.names = 1)) data3 <- as.matrix(read.csv(system.file('extdata', 'dataset3.csv', package = 'klic'), row.names = 1)) data <- list(data1, data2, data3) # Perform clustering with KLIC assuming to know the # number of clusters in each individual dataset and in # the final clustering klicOutput <- klic(data, 3, individualK = c(4, 4, 4), globalK = 4, B = 30, C = 5) # Extract cluster labels klic_labels <- klicOutput$globalClusterLabels cluster_labels <- as.matrix(read.csv(system.file('extdata', 'cluster_labels.csv', package = 'klic'), row.names = 1)) # Compute ARI ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels) }
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Load synthetic data data1 <- as.matrix(read.csv(system.file('extdata', 'dataset1.csv', package = 'klic'), row.names = 1)) data2 <- as.matrix(read.csv(system.file('extdata', 'dataset2.csv', package = 'klic'), row.names = 1)) data3 <- as.matrix(read.csv(system.file('extdata', 'dataset3.csv', package = 'klic'), row.names = 1)) data <- list(data1, data2, data3) # Perform clustering with KLIC assuming to know the # number of clusters in each individual dataset and in # the final clustering klicOutput <- klic(data, 3, individualK = c(4, 4, 4), globalK = 4, B = 30, C = 5) # Extract cluster labels klic_labels <- klicOutput$globalClusterLabels cluster_labels <- as.matrix(read.csv(system.file('extdata', 'cluster_labels.csv', package = 'klic'), row.names = 1)) # Compute ARI ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels) }
Perform the training step of the localised multiple kernel k-means.
lmkkmeans(Km, parameters, verbose = FALSE)
lmkkmeans(Km, parameters, verbose = FALSE)
Km |
An array of size N x N x M containing M different N x N kernel matrices. |
parameters |
A list of parameters containing the desired number of
clusters, |
verbose |
Boolean flag. If TRUE, at each iteration the iteration number is printed. Default is FALSE. |
This function returns a list containing:
clustering |
the cluster labels for each element (i.e. row/column) of the kernel matrix. |
objective |
the value of the objective function for the given clustering. |
parameters |
same parameters as in the input. |
Theta |
N x M matrix of weights, each row corresponds to an observation and each column to one of the kernels. |
Mehmet Gonen
Gonen, M. and Margolin, A.A., 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In Advances in Neural Information Processing Systems (pp. 1305-1313).
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Initialise 100 x 100 x 3 array containing M kernel matrices # representing three different types of similarities between 100 data points km <- array(NA, c(100, 100, 3)) # Load kernel matrices km[,,1] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix1.csv', package = 'klic'), row.names = 1)) km[,,2] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix2.csv', package = 'klic'), row.names = 1)) km[,,3] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix3.csv', package = 'klic'), row.names = 1)) # Initalize the parameters of the algorithm parameters <- list() # Set the number of clusters parameters$cluster_count <- 4 # Set the number of iterations parameters$iteration_count <- 10 # Perform training state <- lmkkmeans(km, parameters) # Display the clustering print(state$clustering) # Display the kernel weights print(state$Theta) }
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Initialise 100 x 100 x 3 array containing M kernel matrices # representing three different types of similarities between 100 data points km <- array(NA, c(100, 100, 3)) # Load kernel matrices km[,,1] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix1.csv', package = 'klic'), row.names = 1)) km[,,2] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix2.csv', package = 'klic'), row.names = 1)) km[,,3] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix3.csv', package = 'klic'), row.names = 1)) # Initalize the parameters of the algorithm parameters <- list() # Set the number of clusters parameters$cluster_count <- 4 # Set the number of iterations parameters$iteration_count <- 10 # Perform training state <- lmkkmeans(km, parameters) # Display the clustering print(state$clustering) # Display the kernel weights print(state$Theta) }
Perform the training step of the localised multiple kernel k-means.
lmkkmeans_missingData(Km, parameters, missing = NULL, verbose = FALSE)
lmkkmeans_missingData(Km, parameters, missing = NULL, verbose = FALSE)
Km |
Array of size N X N X M containing M different N x N kernel matrices. |
parameters |
A list of parameters containing the desired number of
clusters, |
missing |
Matrix of size N X M containing missingness indicators, i.e.
missing[i,j] = 1 (or = TRUE) if observation |
verbose |
Boolean flag. If TRUE, at each iteration the iteration number is printed. Defaults to FALSE. |
This function returns a list containing:
clustering |
the cluster labels for each element (i.e. row/column) of the kernel matrix. |
objective |
the value of the objective function for the given clustering. |
parameters |
same parameters as in the input. |
Theta |
N x M matrix of weights, each row corresponds to an observation and each column to one of the kernels. |
Mehmet Gonen, Alessandra Cabassi
Gonen, M. and Margolin, A.A., 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In Advances in Neural Information Processing Systems (pp. 1305-1313).
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Intialise 100 x 100 x 3 array containing M kernel matrices # representing three different types of similarities between 100 data points km <- array(NA, c(100, 100, 3)) # Load kernel matrices km[,,1] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix1.csv', package = 'klic'), row.names = 1)) km[,,2] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix2.csv', package = 'klic'), row.names = 1)) km[,,3] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix3.csv', package = 'klic'), row.names = 1)) # Introduce some missing data km[76:80, , 1] <- NA km[, 76:80, 1] <- NA # Define missingness indicators missing <- matrix(FALSE, 100, 3) missing[76:80,1] <- TRUE # Initalize the parameters of the algorithm parameters <- list() # Set the number of clusters parameters$cluster_count <- 4 # Set the number of iterations parameters$iteration_count <- 10 # Perform training state <- lmkkmeans_missingData(km, parameters, missing) # Display the clustering print(state$clustering) # Display the kernel weights print(state$Theta) }
if(requireNamespace("Rmosek", quietly = TRUE) && (!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){ # Intialise 100 x 100 x 3 array containing M kernel matrices # representing three different types of similarities between 100 data points km <- array(NA, c(100, 100, 3)) # Load kernel matrices km[,,1] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix1.csv', package = 'klic'), row.names = 1)) km[,,2] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix2.csv', package = 'klic'), row.names = 1)) km[,,3] <- as.matrix(read.csv(system.file('extdata', 'kernel_matrix3.csv', package = 'klic'), row.names = 1)) # Introduce some missing data km[76:80, , 1] <- NA km[, 76:80, 1] <- NA # Define missingness indicators missing <- matrix(FALSE, 100, 3) missing[76:80,1] <- TRUE # Initalize the parameters of the algorithm parameters <- list() # Set the number of clusters parameters$cluster_count <- 4 # Set the number of iterations parameters$iteration_count <- 10 # Perform training state <- lmkkmeans_missingData(km, parameters, missing) # Display the clustering print(state$clustering) # Display the kernel weights print(state$Theta) }
Plot similarity matrix with pheatmap
plotSimilarityMatrix( X, y = NULL, clusLabels = NULL, colX = NULL, colY = NULL, myLegend = NULL, fileName = "posteriorSimilarityMatrix", savePNG = FALSE, semiSupervised = FALSE, showObsNames = FALSE, clr = FALSE, clc = FALSE, plotWidth = 500, plotHeight = 450 )
plotSimilarityMatrix( X, y = NULL, clusLabels = NULL, colX = NULL, colY = NULL, myLegend = NULL, fileName = "posteriorSimilarityMatrix", savePNG = FALSE, semiSupervised = FALSE, showObsNames = FALSE, clr = FALSE, clc = FALSE, plotWidth = 500, plotHeight = 450 )
X |
Similarity matrix. |
y |
Vector |
clusLabels |
Cluster labels |
colX |
Colours for the matrix |
colY |
Colours for the response |
myLegend |
Vector of strings with the names of the variables |
fileName |
If |
savePNG |
Boolean: if TRUE, the plot is saved as a png file. Default is FALSE. |
semiSupervised |
Boolean flag: if TRUE, the response is plotted next to the matrix. |
showObsNames |
Boolean. If TRUE, observation names are shown in the plot. Default is FALSE. |
clr |
Boolean. If TRUE, rows are ordered by hierarchical clustering. Default is FALSE. |
clc |
Boolean. If TRUE, columns are ordered by hierarchical clustering. Default is FALSE. |
plotWidth |
Plot width. Default is 500. |
plotHeight |
Plot height. Default is 450. |
No return value. This function plots the similarity matrix either to screen or to a png file.
Alessandra Cabassi [email protected]
# Load one dataset with 100 observations, 2 variables, 4 clusters data <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1)) # Load cluster labels cluster_labels <- as.matrix(read.csv(system.file("extdata", "cluster_labels.csv", package = "klic"), row.names = 1)) # Compute consensus clustering with K=4 clusters cm <- coca::consensusCluster(data, 4) # Plot consensus (similarity) matrix plotSimilarityMatrix(cm) # Plot consensus (similarity) matrix with response names(cluster_labels) <- as.character(1:100) rownames(cm) <- names(cluster_labels) plotSimilarityMatrix(cm, y = cluster_labels)
# Load one dataset with 100 observations, 2 variables, 4 clusters data <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1)) # Load cluster labels cluster_labels <- as.matrix(read.csv(system.file("extdata", "cluster_labels.csv", package = "klic"), row.names = 1)) # Compute consensus clustering with K=4 clusters cm <- coca::consensusCluster(data, 4) # Plot consensus (similarity) matrix plotSimilarityMatrix(cm) # Plot consensus (similarity) matrix with response names(cluster_labels) <- as.character(1:100) rownames(cm) <- names(cluster_labels) plotSimilarityMatrix(cm, y = cluster_labels)
Make a symmetric matrix positive semi-definite.
spectrumShift(kernelMatrix, coeff = 1.2, shift = NULL, verbose = FALSE)
spectrumShift(kernelMatrix, coeff = 1.2, shift = NULL, verbose = FALSE)
kernelMatrix |
symmetric matrix |
coeff |
Coefficient by which the minimum eigenvalue is multiplied when shifting the eigenvalues, in order to avoid numeric problems. Default is 1.2. |
shift |
Value of the constant added to the diagonal, if known a priori. Default is NULL. |
verbose |
Boolean flag: if TRUE, information about the shift is printed to screen. Default is FALSE. |
This function returns the matrix kernelMatrix
after applying
the required spectrum shift.
Alessandra Cabassi [email protected]
# Load one dataset with 300 observations, 2 variables, 6 clusters data <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1)) # Compute consensus clustering with K=4 clusters cm <- coca::consensusCluster(data, 4) # Shift eigenvalues of the matrix by a constant: (min eigenvalue) * (coeff) km <- spectrumShift(cm, coeff = 1.05)
# Load one dataset with 300 observations, 2 variables, 6 clusters data <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1)) # Compute consensus clustering with K=4 clusters cm <- coca::consensusCluster(data, 4) # Shift eigenvalues of the matrix by a constant: (min eigenvalue) * (coeff) km <- spectrumShift(cm, coeff = 1.05)