Applied Biological Data Analysis: Statistics and R for Biologists
Fall 2024 Course Notes
August 20, 2024
Preface
This website accompanies a course I have developed at Kennesaw State University: Applied Biological Data Analysis. The course is currently a “special topics” (BIOL 6490), but will next year (Fall 2025) become a regular course as BIOL 7410. The target audience is biologists of any level–but mainly graduate students and advanced undergraduates–who need to analyze data for their research projects. If you find the material here helpful, please let me know! Likewise, if you have comments for improvement, I’d love to hear from you: send all comments to ngreen62@kennesaw.edu!
The Fall 2024 course syllabus can be downloaded here.
Work in progress warning!
I am actively updating this site throughout the Fall 2024 semester. This means that many cross-references may be broken, pictures missing, etc. Please be patient with me as I get things back into working order. I decided to do things this way–push updates as I make them–rather than update the site all at once so that (1) my current students have the site as a resources; and (2) people who have bookmarked the site previously don’t lose their place.
License and permissions
This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. This means that you are allowed to share the material and adapt it for your purposes, under the following conditions:
- Give appropriate credit to the author. This includes citing the original sources of any third-party datasets used in some of the tutorials. I’ve done my best to provide appropriate citations alongside these datasets.
- If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
This site is for educational purposes only. This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. Inclusion of third-party data for educational purposes falls under guidelines of fair use as defined in section 107 of the US Copyright Act of 1976.
Course description
This course is a survey of data analysis skills and statistical methods that are essential for modern biology. The course takes a holistic approach to the data analysis workflow in biology using the open-source environment R, including data management, exploratory data analysis, data modeling, and reproducible science practices. Statistical topics covered include generalized linear models, mixed effects models, non-linear models, and ordination. Students are required to apply techniques learned in class to real or simulated biological datasets as a course project.
Course objectives
- Explain the role of statistics in the biological sciences and the ways in which analytical results are communicated.
- Manipulate, summarize, display, and analyze data using the open-source environment and language R.
- Use probability distributions to model and think about biological phenomena. Students should come away from the course able to translate biological hypotheses and ideas into statistical statements and vice versa.
- Conduct exploratory data analyses in support of scientific investigations, particularly to detect and diagnose common problems with biological datasets.
- Employ modern statistical methods to answer biological questions.
- Communicate data and analytical results to audiences who may or may not have statistical backgrounds.
Course requirements
- Access to a computer or laptop capable of running R version 4.0 or above. This site was rendered using R version 4.2.1 and the most recent version of any package mentioned (as of August 16, 2022).
- Having a recent version of RStudio is helpful but not essential.
- A 64-bit system is highly recommended for speed and stability.
- A separate program in which to write and edit code.
- Most people prefer to use an “integrated development environment” (IDE) or dedicated code editor for coding. I recommend some in
- IDEs and code editors are often subject to irrationally strong personal and organizational preferences, so just pick the one that works for you. Or, use the one your supervisor tells you to use.
- My usual workflow includes two windows: the base R GUI, and Notepad++.
- It would helpful to have a dataset of your own to work with. Each course module uses numerous examples with simulated or published datasets. But, at the end of the day, you’re probably here because you need to apply these methods to your own work.
Recommended reading
This course has no required textbook. If you want a textbook, then I recommend one of the following:
Ecological models and data in R by Ben Bolker (Bolker 2008). This is a general statistics textbook for biologists. The applications and examples are focused on ecology, but the statistical exposition is relevant to any area of biology.
Introductory statistics with R by Peter Dalgaard (Dalgaard 2002). This is a general statistics textbook and introduction to R programming.
A beginner’s guide to R by Alain Zuur and others (Zuur et al. 2009b). This book introduces R in a different way than Dalgaard (2002).
An introduction to statistical learning: with applications in R by Gareth James and others (James et al. 2013). This is a more advanced statistics book that also introduces machine learning. The second edition is available for free online.
Parts of the course also lean heavily on some more specialized textbooks:
Analysing ecological data (Zuur et al. 2007) and Mixed effects models and extensions in ecology with R (Zuur et al. 2009a) by Alain Zuur and others. These are advanced textbooks aimed at ecologists which introduce GLMs, mixed models, and other topics we will explore in this course.
Analysis of ecological communities by Bruce McCune and James Grace (McCune et al. 2002) and Numerical ecology by Pierre and Louis Legendre (Legendre and Legendre 2012) are introductions to ordination and other multivariate methods. McCune et al. (2002) is a gentler introduction but depends on proprietary software. Legendre and Legendre (2012) is a little more technical (but still relatively easy to understand) and uses R throughout.
An introduction to WinBUGS for ecologists by Marc Kéry (Kéry 2010). This book introduces Bayesian inference using the free program WinBUGS, which can be accessed from R1.
If you want a PDF version of this website, please be patient. For some reason (as of August 20 2024) I can’t get RStudio to recognize the various \(\LaTeX\) packages on my machine, and won’t render a PDF. I’ll put a PDF version here once I can get it to exist.
Course organization
This course is organized into modules that each focus on one of the objectives of the course. The links below will take you straight to the start of each module.
Module 1 is an introduction to the use of statistics in biology, and explores some fundamental issues related to how biologists make inferences based on data.
Module 2 is an introduction to the R program and programming language.
Module 3 takes a deep dive into manipulating data in R. This includes data import and export, working with special data types such as dates and times, and data frame management.
Modules 4, 5, and 6 are a survey of exploratory data analysis techniques for biologists. Module 4 explores ways to describe, summarize, and display single variables. Module 5 dives into probability distributions and how to find the right distribution for your data. Module 6 presents methods for exploring relations between variables and introduces statistical graphics.
Module 7 covers common statistical problems in biology. These are problems in addition to those discussed in Module 1, that usually only become apparent after data are collected.
Module 8 is a guide to selecting statistics for your research. This module contains dichotomous keys to help you decide what statistical methods are most appropriate for the question you are trying to analyze.
Module 9 is a review of introductory statistics, such as you might encounter in an undergraduate biostatistics course. This module covers the basics of null hypothesis significance testing, and three classes of tests: tests comparing groups (t-tests, ANOVA, Wilcoxon tests), tests for continuous patterns (regression, ANCOVA, correlations), and tests for contingency (Fisher’s exact test, \(\chi^2\) tests, and G tests). This module also does a deep dive into linear models and their fundamental importance for understanding more advanced methods.
Module 10 introduces generalized linear models (GLM), an extremely general and powerful framework for analyzing biological data. This module presents worked examples of some of the most common GLMs in biology.
Module 11 explores nonlinear models for relationships that don’t follow a straight line. These models include biological growth curves and dose response curves.
Module 12 introduces mixed models, which extend GLM and other models to include random effects. Mixed models allow biologists to account for variability from unknown sources.
Module 13 explores methods for multivariate data analysis, including distance measures and ordination. Major update coming September 2024!
Literature Cited
WinBUGS has been superseded by a new version, OpenBUGS, which is fully open source. WinBUGS code should work just fine in OpenBUGs or JAGS, another Bayesian inference engine accessible from R.↩︎