Concepts Identification

Concepts (alternatively called aspects or latent variables) are factors that are hidden using normal projections. These variables can be uncovered through analysis of the data. Three types of concept identification are provided, these are found under the Tools->Find Concepts menu

Cluster Transformed
Co-variant concepts (uncentered or centered)
Co-occurance concepts

Once the Concepts have been calculated they will be placed in the projection table of the experiments lists (this is the second table in the bottom left of the main screen). Information about the project includes its name and (if appropriate) its significance value (its singular value, its probability or its eigenvalue) is shown. To use the projections when viewing the data select the 'use projections' check box.

Cluster Transformed

A cluster transformation is provided to visually explore the validity and similarities between clusters that have been generated in SeqExpress. To project the data the distance between the 'average' expression profile for a cluster (or it's model) and each gene is calculated, and then each gene is transformed by this amount. Therefore by transforming a series of expression experiments for 10,000 gene using 10 clusters, the result is a matrix of 10x10,000. The more similiar a gene profile is to the model that defines a cluster, the smaller the resulting value.The transformations can then by viewed using any of the standard visualisation tools available.

Co-variant concepts

The Singular Value Decomposition (SVD) of the matrix is calculated. The left-hand side matrix (U) is used, this maps each gene value to an underlying singular value. This technique is a type of Principle Component Analysis.

Co-occurance concepts

The (probabilistic) Aspects of the gene/expression matrix are found. This is done through maximizing the log-likeyhood function by iteratively applying tempered expectation maximisation to half the data set (training), and then comparing the log-likelyhood against the second half (held out).

This technique relies strongly on the correct choice for the two data sets (so that co-occurance is observed). This technique is only useful in large scale repeated sampling (or possible) time course experiments) as it relies upon classifying the genes using one sub set and verifying against another (and so the number of latent variables which can be found depends on their being a representative sample of their behaviour in both the training and held out sets).