Distance Measures

SeqExpress provides a number of algorithms for calculating the distance between two genes. These distance measures can be used to calculate clusters, construct and partition graphs, generate hierarchies, find mixtures of models and validate clusters. The following distance measures are available:

Manhattan distance: this is the sum of the absolute differences between two genes (e.g. |A-B|). Unlike the euclidean distance this does not give an overemphasis to greater distances. This should generally be considered the default distance measure to be used in most operations.

Euclidian distance: square root of the sum of the square of the differences between two genes (e.g. sqrt((A-B)^2) . As the sum of the squares is used there is an emphasis on the large values (this is the distance in n-dimensions). To look for the more extreme differences in expression profiles this is a good distance measure.

Pearson Measure: this measures if the two genes differ from their mean values in a similar manner (e.g. ((A-mean(A))/Std(A)) *((B-mean(B))/Std(B))). It is useful if looking for genes which have similar responses, but may not have similar values.

Cosine Distance: this is a type of pearson measure, which considers the relative differences (e.g. A*B/|A|.|B|) assuming that the scale is uniform (that the distance from zero is relative). In some cases this can give better results, particularly where the data is not 'normally' distributed.

Gradient Change: this is an order dependent measure, and is useful in finding trends within the data. This measure was designed to work with time-series data, where similar distances equate to similar trends (e.g. decreases over time).

Ontology Similarity: this is a measure of how similar two genes are based on their ontology terms (this presumes that the terms are arranged in a hierarchy). This compares all terms against all terms and returns the shortest distance between any of the terms (e.g. 2*N3/(N1+N2+2*N3) where N1 is the distance between the first ontology term and a common parent, N2 is the distance between the second term and a common parent, and N3 is the distance between the common parent and the root ontology term). This measure allows for the numeric comparison of genes based on their Gene Ontology annotations.

Paired t-test: this is an experimental distance measure, and is based on the pair wise ttest ((mean(A) - Mean(B))/sqrt(std(A)/n+std(B)/n). The actual distance is based on the probability of the differences observed being significant. The measure is design to highlight genes that have significantly different expression profiles (assuming a normal distribution).