Applied Biological Data Analysis
Statistics and R for Biologists
January 14, 2022
This website accompanies a course I developed in Fall 2021: Applied Biological Data Analysis. The course is currently a “special topics” (i.e., seminar), but will hopefully someday evolve into a regular course. The target audience is biology graduate students and advanced undergraduates who need to analyze data for their research projects. If you find the material here helpful, please let me know! Likewise, if you have comments for improvement, I’d love to hear from you.
License and permissions
This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. This means that you are allowed to share the material and adapt it for your purposes, under the following conditions:
- Give appropriate credit to the author. This includes citing the original sources of any third-party datasets used in some of the tutorials. I’ve done my best to provide appropriate citations alongside these datasets.
- If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
This site is for educational purposes only. This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. Inclusion of third-party data falls under guidelines of fair use as defined in section 107 of the US Copyright Act of 1976.
This course is a survey of data analysis skills and statistical methods that are essential for modern biology. The course takes a holistic approach to the data analysis workflow in biology using the open-source environment R, including data management, exploratory data analysis, data modeling, and reproducible science practices. Statistical topics covered include generalized linear models, mixed effects models, non-linear models, and ordination. Students are required to apply techniques learned in class to real or simulated biological datasets as a course project.
- Explain the role of statistics in the biological sciences and the ways in which analytical results are communicated.
- Manipulate, summarize, display, and analyze data using the open-source environment and language R.
- Use probability distributions to model and think about biological phenomena. Students should come away from the course able to translate biological hypotheses and ideas into statistical statements and vice versa.
- Conduct exploratory data analyses in support of scientific investigations, particularly to detect and diagnose common problems with biological datasets.
- Employ modern statistical methods to answer biological questions.
- Communicate data and analytical results to audiences who may or may not have statistical backgrounds.
- Access to a computer or laptop capable of running R version \(\ge\) 4.0.
- Having a recent version of RStudio is helpful but not esssential.
- A 64-bit system is highly recommended for speed and stability.
- A separate program in which to write and edit code.
- Most people prefer to use an “integrated development environment” (IDE) or dedicated code editor for coding.
- IDEs and code editors are often subject to irrationally strong personal and organizational preferences, so just pick the one that works for you. Or, use the one your supervisor tells you to use.
- My usual workflow includes two windows: the base R GUI, and Notepad++.
- It would helpful to have a dataset of your own to work with. Each course module uses numerous examples with simulated or published datasets. But, at the end of the day, you’re probably here because you need to apply these methods to your own work.
The primary textbook for the course is Bolker (2008). This is a general statistics textbook for biologists. The applications and examples are focused on ecology, but the statistical exposition is relevant to any area of biology.
Some other general statistics and R books you may find helpful include Dalgaard (2002) and Zuur et al. (2009b). A more advanced statistics book that also introduces machine learning is James et al. (2013), which is also available for free online. Parts of the course also lean heavily on Zuur et al. (2007), McCune et al. (2002), Legendre and Legendre (2012), and Kéry (2010).
If you want a PDF version of this website, you can download it here.
This course is organized into modules that each focus on one of the objectives of the course. The links below will take you straight to the start of each module.
Module 1 is an introduction to the use of statistics in biology, and explores some fundamental issues related to how biologists make inferences based on data.
Module 2 is an introduction to the R progam and programming language.
Module 3 takes a deep dive into manipulating data in R. This includes data import and export, working with special data types such as dates and times, and data frame management.
Module 4 is a survey of common techniques for exploratory data analysis. This is a vital step in the data analysis workflow because it can reveal unexpected patterns and problems in the data.
Module 5 introduces generalized linear models (GLM), an extremely general and powerful framework for analyzing biological data. This module presents worked examples of some of the most common GLMs in biology.
Module 6 explores nonlinear models for relationships that don’t follow a straight line. These models include biological growth curves and dose response curves.
Module 7 introduces mixed models, which extend GLM and other models to include random effects. Mixed models allow biologists to account for variability from unknown sources.
Module 8 explores methods for multivariate data analysis, including distance measures and ordination.
Module 9 is a guide to selecting statistics for your research. This module contains dichotomous keys to help you decide what statistical methods are most appropriate for the question you are trying to analyze.