Clustering and Classification methods for Biologists

MMU logo

Outline of analyses

LTSN Bioscience logo

Page Outline



[ Yahoo! ] options

Outline of multivariate methods


The following two documents provide a broad context for the unit. The other links introduce the topics covered in the unit and link to more detailed pages.


Cluster Analysis

Detailed description of Cluster Analysis methods

Cluster analysis is broad collection of methods that are used to group data into classes that share similar characteristics. Formal significance tests are generally not used, instead the analysis is judged by the 'quality' of the outcome, i.e. how useful you find the results.


Principal Components Analysis (PCA)

Detailed description of PCA

PCA is a dimension reduction technique that exploits the correlations between variables to derive a smaller set of components (composite variables) that retain a large proportion of the original information in fewer dimensions.


Discriminant Analysis

Detailed description of Discriminant Analysis

Discriminant analysis is a technique that can be used to (a) find how how two or more classes differ with respect to a set of predictor variables and/or (b) predict the class of an object from the values of its predictor variables. The outcome, but not the algorithm, is similar to logistic regression.


Logistic Regression

Detailed description of Logistic Regression.

Logistic regression is a type of generalised linear model that is typically used to model the relationship between a binary (0/1) response variable and one or more predictor variables. In many, but not all, analyses it is equivalent to using discriminant analysis.


Generalised additive models

Description of, and example analysis using, a Generalised Additive Model.

Generalised Additive Models (GAM) are related to the generalised linear model (e.g. logistic regression. However, these are not fully parametric models because the regression coefficients are replaced by non-parametric smoothing functions which model, to a user-defined level of complexity, the relationships between the class variable and the predictors.


Decision Trees

Detailed description of Decision Trees

Decision trees predict the class of an object by a series of binary (usually) decisions. In many respects they are similar to the familiar species identification keys. The decisions identify thresholds that maximally separate groups. A more recent, and more robust, decision tree algorithm is known as a randomForest.


Artificial neural networks

Description of Artificial Neural Networks

Artificial neural networks belong to a class of methods variously known as parallel distributed processing or connectionist techniques. They are an attempt to simulate a real neural network, which is composed of a large number of interconnected, but independent, neurones. However, most artificial neural networks are simulated since they are implemented in software on a single CPU.It is generally considered that neural networks do well when data structures are not well understood. This is because they are able to combine and transform raw data without any user input. One disadvantage of most artificial neural networks is that the learned relationships are distributed amongst the connections; this makes them potentially difficult to interpret.


Other methods

Link to other multivariate methods

This page contains links to a variety of other methods, covered in less detail. They are not core to this resource but may be useful in your studies.


Measuring accuracy

Measuring the accuracy of predictions

This page contains a description of problems and solutions associated with the measurement of prediction accuracy in a technique such as logistic regression (or any method that makes binary predictions).