What is Redundancy Analysis?
Redundancy analysis (RDA) is a multivariate statistical method used to explain the variation in one set of variables (the response matrix, Y) using a second set of variables (the explanatory matrix, X). Conceived by Van den Wollenberg in 1977 as an alternative to Canonical Correlation Analysis (CCorA), RDA combines principles from multiple linear regression and principal component analysis (PCA). It is a "constrained" form of ordination, meaning the axes of the analysis are limited by the explanatory variables provided. This differs from an unconstrained ordination like PCA, which simply finds the axes of maximum variation without any external constraints.
Unlike CCorA, which is a symmetric method, RDA is non-symmetric. It specifically seeks linear combinations of the explanatory variables (X) that can best explain the variation in the response variables (Y). This makes it a powerful tool for testing hypotheses about how certain environmental factors might influence species composition, gene expression, or other complex ecological datasets.
How Does RDA Work?
An RDA conceptually involves two main steps. First, a series of multiple linear regressions are performed, with each response variable regressed against all explanatory variables. This produces a matrix of fitted values. The second step is to run a PCA on this matrix of fitted values. The resulting principal components are the constrained or canonical axes of the RDA. The variation accounted for by these axes is the portion of the response data explained by the explanatory variables. The remaining, unexplained variation is captured by unconstrained axes, which represent the residuals of the analysis.
Key Components of RDA
- Response Variables (Y): This data matrix contains the variables you wish to explain or predict. Examples include species abundance in different sites or gene expression levels in different individuals.
- Explanatory Variables (X): This matrix contains the variables that are hypothesized to influence the response variables. These can be environmental factors like pH, temperature, or land use type.
- Constrained (Canonical) Axes: These axes represent the variation in the response data that is statistically explained by the explanatory variables. They are linear combinations of the explanatory variables.
- Unconstrained (Residual) Axes: These represent the variation in the response data that remains unexplained by the explanatory variables included in the model.
Interpreting the RDA Biplot
The most common way to visualize RDA results is through a biplot. This plot displays the relationships among the response variables, explanatory variables, and observations (e.g., samples, sites) in a reduced-dimensional space.
Visual interpretation guidelines:
- Arrow Length: The length of an explanatory variable's arrow indicates the strength of its correlation with the constrained ordination axes. Longer arrows indicate a greater influence.
- Arrow Direction: The direction of an arrow shows the environmental gradient. Response variables pointing in a similar direction to an explanatory variable's arrow are positively correlated with it.
- Angle Between Arrows: The angle between two variable arrows reflects their correlation. A small angle indicates a strong positive correlation, a 90-degree angle indicates no correlation, and a 180-degree angle indicates a strong negative correlation.
- Observation Points: Points representing individual samples or sites are positioned based on their scores on the ordination axes. Sites that plot closer together are more similar in their response variable composition.
When to Use RDA: Applications and Examples
RDA is a versatile tool applicable in any field dealing with multivariate data where a directional or constrained analysis is needed. Its primary use case is when you assume that the relationships between your response and explanatory variables are linear.
Some common applications include:
- Ecology: Analyzing how environmental variables like soil properties, temperature, or precipitation influence community composition (e.g., microbes, plants, animals).
- Genomics: Explaining patterns in genetic data (e.g., SNP data) using environmental factors to assess genotype-environment associations.
- Bioinformatics: Understanding how different treatments or conditions affect gene expression profiles by using treatment labels as explanatory variables.
- Soil Science: Investigating how management practices influence a suite of soil properties simultaneously.
RDA vs. CCA: Choosing the Right Analysis
The choice between RDA and Canonical Correspondence Analysis (CCA) is crucial and depends on the underlying data structure, specifically the expected nature of the relationships between variables.
| Feature | Redundancy Analysis (RDA) | Canonical Correspondence Analysis (CCA) | 
|---|---|---|
| Model Assumption | Assumes linear relationships between response and explanatory variables. | Assumes unimodal (curved, bell-shaped) relationships between response and explanatory variables. | 
| Distance Measure | Based on Euclidean distances. Appropriate for shorter gradients. | Based on chi-square distances. Appropriate for longer gradients. | 
| Focus | Maximizes the explained variance of the response variables. | Maximizes the correlation between site and species scores. | 
| Data Type | Handles quantitative and qualitative variables. Can be sensitive to variables with many zeros. | Preferred for data with many double-zeros, like species abundance data across long gradients. | 
| Application | Ecological studies with linear environmental responses, genomics. | Ecological niche modeling, long environmental gradients. | 
The Step-by-Step Process for Conducting RDA
- Data Preparation: Organize your data into a response matrix (Y) and an explanatory matrix (X). Each matrix should have the same number of observations (rows). Consider data transformations or scaling if necessary, for example, using Hellinger transformation for species abundance data.
- Multicollinearity Check: Evaluate the explanatory variables for high correlations among themselves. Highly correlated variables can inflate variance. If necessary, remove or combine some variables.
- Model Fitting: Use statistical software like R (with the veganpackage) or XLSTAT to fit the RDA model. The software performs the regressions and constrained PCA internally.
- Significance Testing: Perform permutation tests (e.g., ANOVA on the RDA model) to determine if the relationship between the explanatory and response variables is statistically significant. This validates the reliability of the model.
- Result Interpretation: Analyze the model summary, including eigenvalues and inertia percentages, to understand the amount of variance explained. Focus on the constrained axes.
- Visualization: Generate and interpret a biplot to visualize the relationships between all variables and observations. Assess arrow directions, lengths, and angles to draw conclusions.
- Refine and Report: Based on the results, you can perform further analyses like partial RDA to account for nuisance variables or build a more refined model. The final report should include the amount of explained variance and an interpretation of the biplot.
Conclusion
Redundancy analysis provides a structured, hypothesis-driven approach to explore complex relationships within multivariate data. By merging elements of regression and PCA, RDA effectively identifies how a set of predictor variables drives variation in a set of response variables. Its visual output, the biplot, is an intuitive way to represent these multivariate relationships, making it a valuable tool for researchers across ecology, genetics, and other scientific domains. While it assumes linear relationships, its practical utility and interpretability solidify its place as a cornerstone of modern data analysis, particularly within the ecological community.
Visit the R vegan package documentation for more details on performing RDA.