Module 8 Planning your analysis (what test?)

One of the main questions that biologists have after collecting data is, “What statistical test do I use?” Ignoring the fact that they should have thought about that before collecting data, the answer to that question depends on the answers to two other questions:

What question are you trying to answer?
What kind of data do you have?

After exploring the data, checking data distributions and gaining an understanding of the patterns and problems in a dataset, it is time to run the main analysis. This guide is designed to help you use what you learned in exploring the data to choose an appropriate analytical strategy.

8.1 How to use this guide

The sections below list some common analytical strategies that you might use for some common types of biological questions, and guides you through the selection process. This document is meant to serve as a primer and rough guide; it is not meant to be exhaustive. This document is also very much a work in progress. Not every situation is covered, and there might be issues in your dataset that make the recommendations below problematic. Each section is structured as a dichotomous key, because biologists love dichotomous keys. At each step, choose the option that best applies to your situation If you don’t know the answer to a question or which option to choose, you may need to go back and do some more exploratory data analysis. R functions are provided for most methods.

8.2 What question are you trying to answer?

The first thing to nail down is what statistical question you are trying to answer. This does not mean what biological question you are investigating, but two are related. An example biological question might be:

How do small mammals (rodents) adapt to live in urban environments?

This is a good biological question, but it is not a question that you can answer with statistics. In order to answer the overarching biological question, we usually need to break it down into smaller chunks: hypotheses that can either be true or false, and predictions (patterns in the data) that would be observed if the hypothesis is true and not observed if a hypothesis is false. For the question above, we might have some working hypothesis like:

Rodents living in urban environments shift their diet to take advantage of anthropogenic food sources.

This is a good scientific hypothesis, because it is falsifiable. For this or any other hypothesis we need to be able to define:

What would we observe if the hypothesis is not true?
What would we observe if the hypothesis is true?

For this statement, these are relatively straightforward. We are are closer to a statistical hypothesis, but we’re not quite there yet. What does it mean to “shift their diet”? Compared to what? How do we quantify a “shift”?

To make a statistical hypothesis we need to think of specific quantitative or categorical comparisons that would support or contradict the hypothesis. For example, here are a few of the potential predictions implied by the hypothesis above:

Rodents in urban environments will have a greater proportion of artificial foods in their feces than conspecific rodents in rural or natural environments.
Rodents in urban environments will consume a smaller number of different kinds of food than rodents in rural environments because of the availability of calorie-dense foods associated with humans.
Rodents in urban environments will spend less time foraging and make shorter foraging movements than rodents in rural or natural environments.

These three predictions are statements of the kinds of patterns that we should observe if the hypothesis was true. We can also define their logical negation: what would we observe if the hypothesis was not true? Critically, these predictions include comparison phrases like greater than or smaller number of or less time than. Those comparisons are the sorts of patterns that a statistical model can actually evaluate. Knowing what kind of pattern you are testing is a key step in planning your data analysis.

Broadly speaking, statistical hypotheses in biology tend to fall into just a few categories:

Testing for a difference in mean or location between groups.
Testing for a continuous relationship between a response variable (or multiple variables) and one or more predictor variables. Also referred to as “predicting” or “modeling” a continuous variable.
Testing for (lack of) independence between categories of observations.
Classifying observations based on predictor variables.

These categories are not mutually exclusive. Many questions phrased as one category can be reframed into another. How you frame your question depends on many factors and at some point will involve some discretion on your part. There is often more than one right way analyze a dataset, but the right ways are a vanishingly small subset of the possible ways to analyze your data.

8.3 Testing for a difference in mean or location

Dichotomous key for testing for a difference in mean or location:

1. Single response variable (or, one variable tested at a time): 5
1\('\). Multiple response variables tested at once: 2

  2. Data multivariate normal: 3
  2\('\). Data not multivariate normal: 4

3. All conditions for MANOVA met: try MANOVA. R function manova().
3\('\). MANOVA conditions not all met: 4.

  4. Try permutational MANOVA (PERMANOVA) (R function vegan::adonis()) or:
  4\('\). Ordinate using NMDS, then try MRPP in NMDS-space. R functions vegan::metaMDS() followed by vegan::mrpp().

5. Comparing between 2 groups (i.e., a single factor with 2 levels): 6
5\('\). Comparing between >2 groups: 8

  6. Data meet assumptions of t-test: use a t-test. R function t.test().
  6\('\). Data do not meet assumptions of t-test: 7.

7. Need to predict values of response variable: use GLM. R function glm().
7\('\). Need only to test for difference in location: use Mann-Whitney test (aka: Wilcoxon rank sum test). R function wilcox.test().

  8. Response variables normal, variance homogeneous: use ANOVA. R function lm() followed by anova() and/or aov().
  8\('\). Response variables nonnormal OR variance heterogeneous: 9.

9. Need to predict values of response variable: use GLM. R function glm().
9\('\). Need to test for difference in location only: use Kruskal-Wallis test. R function kruskal.test().

8.3.1 Additional considerations

Once you reach a decision from the key above, there are a few more options to consider:

Are there any interactions between your predictor variables? This means that the effects of two predictors are not additive; i.e., the effect of one variable affects the effect of another variable. Most of the methods above can accommodate interactions. You can use R function interaction.plot() to explore potential interactions visually.
Is there a hierarchical structure to your data, with the higher-level groups representing variation that you are interested in explaining? This is often called “blocking” and needs to be included in your model. Usually the blocking factor can be included as a predictor in the model.
Is there a hierarchical structure to your data, with the higher-level groups representing variance you are not interested in but still need to account for? If so, you may want to consider adding random effects to your model. This creates a mixed model (see Module 12).

8.4 Testing for a continuous relationship between two or more variables

Dichotomous key for testing continuous relationships:

1. Need to test whether two variables vary together, not predict actual values: 2
1\('\). Need to predict values of response variables: 3

  2. Both variables normally distributed, suspected relationship linear: use Pearson’s product moment correlation r. R functions cor() or cor.test().
  2\('\). One or both variables nonnormal and/or suspected relationship nonlinear: use Spearman’s rank correlation coefficient \(\rho\) (“rho”). R functions cor() or cor.test().

3. Suspected relationship linear, or can be made linear by transform: 4
3\('\). Suspected relationship nonlinear and cannot be made linear: 5.

  4. Response variable normally distributed or can be transformed to be normal: use linear regression or multiple linear regression (on transformed scale, if needed). R function lm().
  4\('\). Response variable nonnormal and cannot be transformed to normality: use GLM. R function glm().

5. Response variable and residuals normal: use nonlinear models. R function nls().
5\('\). Response variable and residuals nonnormal: try GAM, GNM, or regression trees (in that order). R functions mgcv::gam(), gnm::gnm(), and rpart::rpart(), respectively.

8.4.1 Additional considerations

Once you reach a decision from the key above, there are a few more options to consider:

Is there autocorrelation in your data? If so, you may want to consider including an autocorrelation structure.
Is there a hierarchical structure to your data, with the higher-level groups representing variance you are not interested in but still need to account for? If so, you may want to consider adding random effects to your model. This creates a mixed model:
- Linear models become linear mixed models (LMM). R function lme4::lmer().
- Generalized linear models become generalized linear mixed models (GLMM). R function lme4::glmer().
- Additive models become generalized additive mixed models (GAMM). R function mgcv::gamm().
- NLS models become nonlinear mixed models (NLME). R function nlme::nlme() or lme4::nlmer().
Do your data meet most of the assumptions for linear regression, but with heteroscedasticity (non-constant variance in residuals)? You might consider using generalized least squares (GLS) with a different variance structure. R function nlme::gls().
Is there severe multicollinearity among your predictor variables? If so, try dropping some of the collinear predictors or try a regularization technique such as ridge or lasso regression. Various R packages.
Do you not have a clear idea of the form of the relationship between your continuous response variable and the predictors? Or, do you not want to assume a data model and just let the data speak for themselves? Try a machine learning technique like boosted regression trees. R function gbm::gbm().

8.5 Independence between categories

When all variables are categorical, the question being asked is likely something like, “Do the observed frequencies differ between groups?”. Such data are usually summarized in a contingency table (R function ftable(); see Module 4) and analyzed with a \(\chi^2\) test (“chi-squared test”).

8.6 Classifying observations

Dichotomous key for classification problems:

1. Outcome binary: 2
1\('\). Outcome not binary (>2 options): 3.

2. Logistic regression works: use logistic regression. R function glm().
2\('\). Logistic regression doesn’t work: use classification trees. R function rpart::rpart().

3. Try multinomial logistic regression. Several R packages, including nnet and mlogit.
3\('\). Try classification trees. R function rpart::rpart().

8.7 Conclusions

This guide was designed to cover a lot of possible pathways for biological data analysis, but of course there are some situations that aren’t covered. By the time you have deconstructed your problem enough to answer the diagnostic questions above, you should be well on your way to figuring out what analysis to use. Whatever you do, make sure that you document your decision and its justification…this will be very helpful later!