Clustering and Classification methods for Biologists


MMU logo

Principal Components Analysis

LTSN Bioscience logo

 

Name Description
algorithm A computational procedure which can be applied to obtain a solution to a problem.
canonical This term implies that something has been reduced to its simplest form (note that it also has a different religious meaning).
classification A systematic arrangement of objects (of any type) into groups or categories according to a set of established criteria.
coefficient of determination R2 The coefficient of determination is the proportion of the total variation in the dependent variable Y that is explained or accounted for by the variation in the independent variable X. If expressed as a percentage it is in the range 0 - 100%. It is the square of the correlation coefficient multiplied by 100
correlation The correlation coefficient (r estimates rho) provides an index of the degree to which paired measures(X and Y) co-vary in a linear fashion. Its values is constrained to lie between -1 and +1. r is positive (> 0) when cases with large values of X also tend to have large values of Y whereas cases with small values of X tend to have small values of Y. r is negative (< 0) when cases with large values of X tend to have small values of Y and vice versa. Correlation coefficients give no information about cause and effect. Similarly they provide misleading information if the relationship between X and Y is non-linear.
covariance The covariance is similar to the correlation coefficient in that it measures the relationship between a pair of variables. However, unlike the correaltion coefficient it is understandardised (in a correlation coefficient the covariance is divided by the standard deviations of x and y). Because the covariance is unstandardised there is no limit to possible values and it is difficult to compare covariances.
dendrogram A dendrogram is a 'tree-like' diagram that summaries the process of clustering. Similar cases are joined by links whose position in the diagram is determined by the level of similarity between the cases.
distance The distance between two objects a measure of the interval between them. It is important to understand that distances are not always measured by rulers, or their equivalent. Measurements of distance are also related to measures of similarity and dissimilarity.
eigen value Eigen values can be found for square symmetric matrices. There are as many eigen values as there rows (or columns) in the matrix. A realistic description of an eigen value demands a sound knowledge of linear algebra. However, conceptually they can be considered to measure the strength (relative length) of an axis (dervied from the square symmetric matrix). Eigen values are also known as latent variables.
eigen vector Each eigen value has an associated eigen vector. An eigen value is the length of an axis, the eigen vector determines its orientation in space. The values in an eigen vector are not unique because any coordinates that described the same orientation would be acceptable. Usually they are standardised in some way, e.g. their squared values sum to one. The eigen vectors are normally used to aid in the interpretation of a multivariate analysis.
hierarchy A system by which objects can be arranged in a graded order, typically represented by a series of ordered groupings such as those used for plant and animal classifications (e.g. classes, orders, families).
homoscedascity A condition under which the response variable (y) has a constant variance for all values of x. It is a necessary condition for regression and analysis of variance analyses.
independence Two events or variables are independent if knowledge of one provides no information about the value of the other. Thus, the probability of either is unaffected by the value of the other variable or event.
interaction An interaction occurs when the effects of two or more variables (in a regression analysis) or two or more factors (in an analysis of varaince) are not independent of each other. For example, you may find that the effect of a treatment is not the same different sexes.
intercept The constant ( a or b0) in a regression equation. It is the point where the regression line intercepts the y axis when x = 0.
linear combination A linear combination is a sum of two or more variables, for example
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk.
If any of the variables are linked by an operator other than + (or - ) the combination is no longer linear.
linear relationship A relationship between two variables that can be described by a straight line. Non-linear relationships can often be linearised by the application of a transformation (e.g. a logarithmic transformation).
matrix A matrix is a tabular representation of a set of data. It is characterised by its dimensionality measured by the number of rows (r) and columns (c). If r = c the matrix is said to be square, and if the upper triangle of values is identical to the lower triangle of values it is said to be symmetric. If matrix algebra methods are employed the matrix is normally symbolised by a bold capital letter, e.g. M.
normal distribution The normal or Gaussian distribution is one of the most important probability density functions, not the least because many measurement variables have distributions that at least approximate to a normal distribution. It is usually described as bell shaped, although its exact characteristics are determined by the mean and standard deveiation. It arises when the value of a variable is determined by a large number of independent prcoesses. For example, weight is a function of many processes both genetic and environmental. Many statistical tests make the assumption that the data come from a normal distribution.
ordination Ordination is a termed that is normally applied to a particular class of multivariate techniques when they are applied to ecological data. Generally they are geometrical methods that attempt to present multivariate data in fewer dimensions.
residual The difference between an observed and an expected or predicted value.
significance test A significance test allows us to determine the probability of obtaining the value of a test statistic (e.g. t, r, F, Chi-square) given that the null hypothesis is true.
similarity The similarity between two objects is a measure of how closely they resemble each other. Dissimilarity is the inverse of this, and this is related to the concept of distance.
slope The slope of a line is its gradient, i.e. the change in the dependent or response variable (y) per unit change in the independent or predictor variable (x).
standardization Raw data may be recorded at different scales, e.g. height in m and head circumference in mm. Inevitably this can result data that have quite different ranges. It is possible to remove these scale effects if data are standardised, i.e. forced to have the same scale. There are many ways of achieving this including converting to normal scores (normally distributed with a mean of 0 and a standard deviation of 1) or scaling to give a minimum of 0 and a maximum of 1.
taxonomy Taxonomy is the biological discipline that is concerned with the classification of living organisms into groups based on the shared possession of characteristics.
unsupervised Must cluster analysis algorithms are unsupervised, this means that the analyst does not impose any structure on to the classification, instead the classification 'emerges' from the data. Later, we may wish to investigate if the classification matches some other grouping criteria (e.g. gender or species).
variability Most measured variables show some variation, i.e. their values are not constant between, or even within, experimental units.
variance The variance (or mean square) may be thought of as the average squared difference between observations and their mean. It gives an indication of the amount of variability in a set of data.