# Multivariate Analysis Toolbox for Matlab®

## written by: Liran Carmel

We have tried to break down a typical process of multivariate data analysis, in trying to identify key components. We then built a fully object-oriented toolbox, with an object fitting each of those key components.

Data objects. We have identified three entities, which are the building blocks of any multivariate data process. The sampleset object carries information about the different samples, also called observations, conditions, or experiments. grouping object carries information about labeling of the samples, i.e., their association with specific clusters. The measurements themselves are in a data matrix. The datamatrix object is the general framework of a datamatrix, from which more specialized data matrices are derived by object-oriented inheritence. These more specialized data matrices encompass most of the data organization forms one may encounter. The vsmatrix object describes a rectangular two-way matrix of variables-by-samples. For example, a result of a gene array experiment in the form of genes-by-conditions will be represented in our toolbox by a vsmatrix object. The ssmatrix object describes relationships between samples. For example, a distance matrix will be represented in our toolbox as a ssmatrix. The vvmatrix object describes relationships between variables. For example, a correlation matrix will be represented in our toolbox as a vvmatrix.

Graph theory. The graph object describes a general mathematical graph. This toolbox includes more specific graphs (such as digraphs and trees) that are derived from this general object.

Pairwise data objects. These objects describe specific forms of pairwise data, and are all derived from either ssmatrix or vvmatrix (see above). The covmatrix object describes a covariance or a correlation matrix. The distmatrix object describes a distance matrix. The dissimatrix and simatrix objects describe pairwise dissimilarity or similarity information, respectively.

Dimensionality reduction algorithms. Each object in this group stands for a particular dimensionality reduction technique. Currently available are pcatrans that makes principal component analysis (PCA), wpcatrans that makes weighted principal component analysis (wPCA), and fishtrans that identifies discriminant direction according to the Fisher linear discriminant analysis.

Statistics. This portion of the toolbox includes general statistical functions, mainly various hypothesis testing procedures, as well as the object ctable that describes a contingency table.

#### Core objects:

grouping
labeling of the data according to a classification scheme
sampleset
information about samples (observations, conditions, experiments)
variable

#### Core data objects:

datamatrix
a two-way matrix object
ssmatrix
samples-by-samples two-way datamatrix (e.g., distance matrix)
vsmatrix
variable-by-samples two-way datamatrix (typical data matrix)
vvmatrix
variables-by-variables two-way datamatrix (e.g., correlation matrix)
dataset
repository of information regarding a certain dataset
graph
general undirected graph
digraph
general directed graph
tree
binary tree
bintree
binary tree

#### Pairwise data objects:

covmatrix
covariance matrix (inherits vvmatrix)
dissimatrix
dissimilarity matrix (inherits ssmatrix)
distmatrix
distance matrix (inherits ssmatrix)
simatrix
similarity matrix (inherits ssmatrix)

#### Dimensionality reduction algorithms:

lintrans
general dimensionality reduction by linear transformation
fishtrans
Fisher linear discriminant analysis
pcatrans
principal component analysis (PCA)
wpcatrans
weighted principal component analysis (wPCA)

#### Statistics:

ctable
contingency table

#### Combinatorics:

multinom
computes the multinomial coefficient.

#### Data manipulations:

lineup
ranks a vector in increasing order.
majority
finds the most frequent entry.
subs_incomp_data
substitue given data in an incompleted data array
subsample
picks up at random a subsample of a vector.
substitute
substitutes values in a list with a different set of values.

#### Graph Theory:

chowliu
applies the Chow-Liu algorithm.
code2dag
finds the DAG associated with a DAG-code.
code2rank
finds the rank of a DAG-code.
dispdagcode
displays a DAG code to the screen.
enumdagcodes
enumerates all DAG codes for a fixed number of nodes.
enummarkovclasses
enumerates all DAG codes for a fixed number of nodes.
nodags
computes the number of DAGs with fixed number of nodes.
rank2code
finds the DAG-code whose rank is {r}.
thd2wgt
computes, given THD, a default weight matrix.
wgt2thd
computes, given weights, a default THD matrix.

#### Grouping:

group
turns a list into assignment vector and naming cell array.

#### Hypothesis testing:

testbinom
computes the p-value of testing a binomial parameter.
testchi2hist
uses the chi2 test to compare a histograms to a standard.
testchi2hists
uses the chi2 test to compare two histograms.
testchi2independence
computes the p-value of independence hypothesis.
testfisherexact
computes the p-value of Fisher's exact test (© A. Trujillo-Ortiz et al.).
testfisheromnibus
computes p-value for the Fisher Omnibus test.
testkshist
uses KS test to compare a histograms to a standard.
testkshists
uses KS test to compare two histograms.
testmultinom
computes the p-value of testing multinomial parameters.

#### Information Theory:

centropy
computes the conditional entropy between two variables.
emutualentropy
estimates pairwise mutual entropy
entropy
computes the entropy of a distribution.
kdiv
computes the K-divergence between distributions p and q.
kl
computes the relative entropy.
ldiv
computes the L-divergence between distributions p and q.

#### Linear transformations:

fa_engin
performs factor analysis on the data.
factorscores
estimate the scores after factor analysis.
fish_engin
performs Fisher transformation of a grouped dataset.
pca_engin
performs PCA analysis on the data.

#### Pairwise Relationships:

distmat
calculates distance matrix.

#### Regression analysis:

regress1d
linearly regress one variable on another.

#### Statistics:

allstats
computes all common statistics (© D.C. Hanselman).
fdr
calculates
kendall
computes the Kendall rank correlation matrix.
pearson
computes the Pearson (linear) correlation matrix.
spearman
computes the Spearman rank correlation matrix.

#### Visualization:

scatter2d_engin
the engine used for 2D scatter plots.
scatter3d_engin
the engine used for 3D scatter plots.

The toolbox is freely available from this site. The latest release is MVA_13Sep2010.
Prerequisite: the toolbox General Utilities.