An R package for non-negative and sparse canonical correlation analysis (CCA).
CCA is a method for finding associations between paired data sets. For example, a health study might record the gene expression levels and a number of physiological parameters for a patient cohort. If one conjectures that the cause for the physiological symptoms has a genetic component, one could expect to find a correlation between the expression of certain genes and the strength of certain symptoms. CCA finds a pair of linear projections (called canonical vectors), one for each data modality, such that the projected values (called canonical variables) have maximum correlation. The next pair of canonical variables is found by again maximizing their correlation, under the additional constraint that the they have to be uncorrelated to all previous ones, and so on.
CCA was first introduced by Hotelling in 1936, and has many similarities to principal component analysis (PCA). Where the PCA solution is computed from the eigenvalue decomposition (EVD) of the covariance matrix of a single data set, the CCA solution is computed from the EVD of the cross-covariance matrix of the two data sets. This approach is very efficient, but one sometimes encounters the following problems during an analysis. First, if at least one of the data sets contains more features than samples (a common case for gene expression data), there exist an infinite number of trivial projections that achieve perfect correlation. Regularization of the canonical vectors is necessary to again solve a well-posed problem. Second, the projections are typically linear combinations with non-zero weights for all features, which makes an interpretation of the weights difficult. A sparse solution which only includes a small number of important features is often desirable.
This package implements a CCA algorithm called nscancor
which can enforce appropriate constraints on the canonical vectors to
address both aforementioned problems. Enforcing a bound on the Euclidean
norm (also called the L2 norm) of the projections avoids trivial
correlations. Enforcing a bound on the L1 norm leads to sparse
solutions, where many of the weights are exactly zero. And enforcing
non-negativity of the projection weights is useful for analyzing data
where only positive influence of features is deemed appropriate. The
algorithm executes iterated regression steps, and the constraints enter
via the regression functions. nscancor
is therefore
modular, and builds on the many regression methods that are available,
e.g. ridge regression or the elastic net. By using two different
regression functions, the proper constraints can be enforced for each
domain.
The package also provides a generalization of constrained CCA for
analyzing more than two data sets. The mcancor
algorithm is
structurally analogous to nscancor
, but it maximizes the
sum of all pairwise correlations of canonical variables. As with
nscancor
, specifying the regression function for each
domain makes it possible to enforce appropriate constraints on each
canonical vector.
This blog post explains how to use the package and demonstrates its benefits.