What is Redundancy Analysis (RDA)?
Redundancy Analysis, or RDA, is a multivariate statistical technique used to summarize the linear relationship between two matrices of variables: a set of response variables (e.g., species abundance data) and a set of explanatory or predictor variables (e.g., environmental data). Developed by Van den Wollenberg in 1977, RDA was created as a non-symmetrical alternative to Canonical Correlation Analysis (CCorA), specifically designed to test hypotheses about the influence of predictors on responses.
Unlike an unconstrained ordination technique such as Principal Component Analysis (PCA), which simply reveals underlying structure within a single dataset, RDA forces the ordination axes to be linear combinations of the explanatory variables. This constrained approach allows researchers to directly assess how much of the variation in the response matrix can be explained by the explanatory matrix, making it a crucial tool for testing causal relationships and exploring complex ecological patterns.
The Core Mechanism of RDA
Conceptually, RDA works in two main steps, combining two familiar statistical methods: multiple linear regression and PCA.
- Multiple Regression: Each response variable in the matrix (Y) is regressed on all explanatory variables in the matrix (X). This creates a new matrix of "fitted values" which represents the portion of the response data that can be linearly predicted by the explanatory variables.
- Principal Component Analysis: A standard PCA is then performed on this matrix of fitted values. The resulting principal components, also known as canonical axes, are constrained to be linear combinations of the explanatory variables. A separate, independent PCA is also performed on the residuals (the part of the data not explained by the model).
This two-step process partitions the total variation in the response matrix into two parts: a constrained part that is explained by the explanatory variables, and an unconstrained part that is not explained by the model.
RDA vs. PCA: A Comparative Overview
Understanding RDA is often easiest when contrasted with PCA. While both are powerful dimension-reduction techniques for visualizing complex multivariate data, their core objectives differ significantly.
| Feature | Principal Component Analysis (PCA) | Redundancy Analysis (RDA) | 
|---|---|---|
| Purpose | To identify underlying structure and capture maximum variance within a single dataset for data reduction and visualization. | To model and explain the variation in one dataset (response variables) using a second dataset (explanatory variables). | 
| Variables | Operates on a single matrix of variables, treating all variables symmetrically. | Considers two distinct matrices: response variables (dependent) and explanatory variables (independent). | 
| Approach | An unconstrained ordination method, meaning its axes are determined solely by the internal variability of the data. | A constrained (or canonical) ordination method, meaning its axes are constrained by and are linear combinations of the explanatory variables. | 
| Visualization | Generates an ordination plot that shows the relationships between observations and variables based on their correlations. | Produces a triplot showing the relationships between observations, response variables, and explanatory variables. | 
| Output | Provides eigenvalues indicating the total variance captured by each principal component. | Partitions total variance into explained (constrained) and unexplained (unconstrained) components. | 
Applications of RDA
RDA is particularly valuable in fields where understanding the relationship between multiple dependent variables and multiple independent variables is crucial. Its primary applications include:
- Ecological and Environmental Science: This is the most common application of RDA. Ecologists use RDA to study how environmental factors, such as soil composition, pH, or climate, influence the composition of species communities in different areas.
- Landscape Genomics: RDA can be used to identify associations between genetic data (response variables) and environmental variables (explanatory variables). This helps researchers understand how environmental pressures drive genetic adaptation.
- Soil Science: Researchers apply RDA to determine how management practices (e.g., farming techniques) or environmental conditions influence a set of soil properties.
- Bioinformatics: In microbiome studies, RDA can explore how microbial community composition is influenced by different host or environmental factors.
Steps for Performing RDA
Performing an RDA involves a sequence of steps, often implemented using statistical software like R with the vegan package.
- Data Preparation: Assemble your two data matrices: one for the response variables (Y) and one for the explanatory variables (X). Data scaling and transformations (e.g., Hellinger transformation for ecological count data) are often necessary.
- Model Specification: Define the RDA model using a formula that specifies the response and explanatory variables.
- Model Fitting: Run the RDA function to fit the model to your data.
- Model Assessment: Evaluate the model's significance using permutation tests, which test the relationship between the two datasets.
- Interpretation and Visualization: Analyze the results by examining the triplot. The triplot displays observations, response variables, and explanatory variables as arrows, allowing you to visually interpret the relationships.
Assumptions and Limitations of RDA
While powerful, RDA relies on several key assumptions, and it is important to be aware of its limitations:
- Linear Relationships: RDA assumes that the relationships between the response and explanatory variables are linear. If relationships are non-linear (e.g., unimodal), Canonical Correspondence Analysis (CCA) may be more appropriate.
- Euclidean Distance: Standard RDA uses Euclidean distance, which can sometimes be unsuitable for certain types of data, such as species count data with many zeros. The alternative distance-based RDA (db-RDA) addresses this by allowing other distance measures.
- Quantitative Data: RDA is primarily designed for use with quantitative variables, though versions can incorporate qualitative variables.
- No Multicollinearity: The explanatory variables should not be highly correlated with each other to avoid misinterpreting their individual effects.
- Constraints on Variables: The number of explanatory variables should be less than the number of observations.
Conclusion
Redundancy Analysis is a robust statistical method that provides a sophisticated way to explore and model the relationships between two sets of multivariate data. By combining multiple regression and PCA, it quantifies how much of the variation in a response dataset can be explained by a predictor dataset. This capability makes it an indispensable tool for researchers in ecology, environmental science, and genetics seeking to understand complex, interacting systems. While certain assumptions regarding linearity and data type must be met, alternative methods like db-RDA offer flexibility, cementing RDA's role as a cornerstone of modern data analysis. For more details on the practical application of RDA in R, see the thorough documentation available through UW Pressbooks.