Preface

This website accompanies a course I have developed at Kennesaw State University: Applied Biological Data Analysis. The course is currently a “special topics” (BIOL 6490), but will next year (Fall 2025) become a regular course as BIOL 7410. The target audience is biologists of any level–but mainly graduate students and advanced undergraduates–who need to analyze data for their research projects. If you find the material here helpful, please let me know! Likewise, if you have comments for improvement, I’d love to hear from you: send all comments to ngreen62@kennesaw.edu!

The Fall 2024 course syllabus can be downloaded here.

Work in progress warning!

I am actively updating this site throughout the Fall 2024 semester. This means that many cross-references may be broken, pictures missing, etc. Please be patient with me as I get things back into working order. I decided to do things this way–push updates as I make them–rather than update the site all at once so that (1) my current students have the site as a resources; and (2) people who have bookmarked the site previously don’t lose their place.

License and permissions

This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. This means that you are allowed to share the material and adapt it for your purposes, under the following conditions:

  1. Give appropriate credit to the author. This includes citing the original sources of any third-party datasets used in some of the tutorials. I’ve done my best to provide appropriate citations alongside these datasets.
  2. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

This site is for educational purposes only. This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. Inclusion of third-party data for educational purposes falls under guidelines of fair use as defined in section 107 of the US Copyright Act of 1976.

Course description

This course is a survey of data analysis skills and statistical methods that are essential for modern biology. The course takes a holistic approach to the data analysis workflow in biology using the open-source environment R, including data management, exploratory data analysis, data modeling, and reproducible science practices. Statistical topics covered include generalized linear models, mixed effects models, non-linear models, and ordination. Students are required to apply techniques learned in class to real or simulated biological datasets as a course project.

Course objectives

  1. Explain the role of statistics in the biological sciences and the ways in which analytical results are communicated.
  2. Manipulate, summarize, display, and analyze data using the open-source environment and language R.
  3. Use probability distributions to model and think about biological phenomena. Students should come away from the course able to translate biological hypotheses and ideas into statistical statements and vice versa.
  4. Conduct exploratory data analyses in support of scientific investigations, particularly to detect and diagnose common problems with biological datasets.
  5. Employ modern statistical methods to answer biological questions.
  6. Communicate data and analytical results to audiences who may or may not have statistical backgrounds.

Course requirements

  • Access to a computer or laptop capable of running R version 4.0 or above. This site was rendered using R version 4.2.1 and the most recent version of any package mentioned (as of August 16, 2022).
  • Having a recent version of RStudio is helpful but not essential.
  • A 64-bit system is highly recommended for speed and stability.
  • A separate program in which to write and edit code.
    • Most people prefer to use an “integrated development environment” (IDE) or dedicated code editor for coding. I recommend some in
    • IDEs and code editors are often subject to irrationally strong personal and organizational preferences, so just pick the one that works for you. Or, use the one your supervisor tells you to use.
    • My usual workflow includes two windows: the base R GUI, and Notepad++.
  • It would helpful to have a dataset of your own to work with. Each course module uses numerous examples with simulated or published datasets. But, at the end of the day, you’re probably here because you need to apply these methods to your own work.

Course organization

This course is organized into modules that each focus on one of the objectives of the course. The links below will take you straight to the start of each module.

Module 1 is an introduction to the use of statistics in biology, and explores some fundamental issues related to how biologists make inferences based on data.

Module 2 is an introduction to the R program and programming language.

Module 3 takes a deep dive into manipulating data in R. This includes data import and export, working with special data types such as dates and times, and data frame management.

Modules 4, 5, and 6 are a survey of exploratory data analysis techniques for biologists. Module 4 explores ways to describe, summarize, and display single variables. Module 5 dives into probability distributions and how to find the right distribution for your data. Module 6 presents methods for exploring relations between variables and introduces statistical graphics.

Module 7 covers common statistical problems in biology. These are problems in addition to those discussed in Module 1, that usually only become apparent after data are collected.

Module 8 is a guide to selecting statistics for your research. This module contains dichotomous keys to help you decide what statistical methods are most appropriate for the question you are trying to analyze.

Module 9 is a review of introductory statistics, such as you might encounter in an undergraduate biostatistics course. This module covers the basics of null hypothesis significance testing, and three classes of tests: tests comparing groups (t-tests, ANOVA, Wilcoxon tests), tests for continuous patterns (regression, ANCOVA, correlations), and tests for contingency (Fisher’s exact test, \(\chi^2\) tests, and G tests). This module also does a deep dive into linear models and their fundamental importance for understanding more advanced methods.

Module 10 introduces generalized linear models (GLM), an extremely general and powerful framework for analyzing biological data. This module presents worked examples of some of the most common GLMs in biology.

Module 11 explores nonlinear models for relationships that don’t follow a straight line. These models include biological growth curves and dose response curves.

Module 12 introduces mixed models, which extend GLM and other models to include random effects. Mixed models allow biologists to account for variability from unknown sources.

Module 13 explores methods for multivariate data analysis, including distance measures and ordination. Major update coming September 2024!

About the author

I’m a quantitative ecologist at Kennesaw State University who studies human impacts on animal populations. Since earning my PhD at Baylor University in 2012, I’ve been fortunate to have worked in a variety of government and industry roles. Those experiences have shaped my perspective on biology and education. In my courses and advising I do my best to impress upon students the variety of biological experiences and perspectives that different domains bring to the table…and the value that they all bring. Most of my current research focuses on understanding human impacts on animal communities and how animal populations adapt to rapidly changing environments.

Why am I making this guide? I have been using R since about 2008, when I was a graduate student trying to make a multi-panel plot for a manuscript. I had seen R mentioned in publications and thought that using R would be more effective than kludging something together with Excel and Powerpoint. Needless to say, jumping straight to making plots with the lattice package was not the easiest way to learn R. The final result was a little clunky (Kirchner et al. 2011) but I was really proud of it. Eventually I took some courses and did a lot of self-teaching, and am finally pretty decent at R. I have used R in my work pretty much continuously since 2008 for everything from Bayesian data analysis (e.g., Green et al. 2020) to simulation modeling (e.g., Green et al. 2019).

This guide is meant to be a primer for self-starting with R. In many sections the text is a bit lengthy, but that is because I wanted to provide explanations that might save someone some of the trouble I had getting started. R documentation is notoriously terse and compact–this is great for professionals who need a reminder, but not so much for newcomers who are still trying to learn. This guide also evolved out of course notes, and will continue to evolve based on feedback from students. If you have any feedback or suggestions for improvement, please send me a message!

Literature Cited

Bolker, B. M. 2008. Ecological models and data in R. Princeton University Press, Princeton, New Jersey, USA.
Dalgaard, P. 2002. Introductory statistics with R. Springer Science+Business Media.
Green, N. S., M. L. Wildhaber, and J. L. Albers. 2019. Potential responses of the Lower Missouri River shovelnose sturgeon (Scaphirhynchus platorynchus) population to a commercial fishing ban. Journal of Applied Ichthyology 35:370–377.
Green, N. S., M. L. Wildhaber, J. L. Albers, T. W. Pettit, and M. J. Hooper. 2020. Efficient mammal biodiversity surveys for ecological restoration monitoring. Integrated Environmental Assessment and Management.
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An introduction to statistical learning. 7th edition. Springer.
Kéry, M. 2010. Introduction to WinBUGS for ecologists: Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press.
Kirchner, B. N., N. S. Green, D. A. Sergeant, J. N. Mink, and K. T. Wilkins. 2011. Responses of small mammals and vegetation to a prescribed burn in a tallgrass blackland prairie. The American Midland Naturalist 166:112–125.
Legendre, P., and L. Legendre. 2012. Numerical ecology. Elsevier, Amsterdam.
McCune, B., J. B. Grace, and D. L. Urban. 2002. Analysis of ecological communities. MjM software design, Gleneden Beach, Oregon, USA.
Zuur, A. F., E. N. Ieno, and G. M. Smith. 2007. Analysing ecological data. Springer Science+Business Media, New York.
Zuur, A. F., E. N. Ieno, N. J. Walker, A. A. Saveliev, G. M. Smith, and others. 2009a. Mixed effects models and extensions in ecology with R. Springer, New York.
Zuur, A., E. N. Ieno, and E. Meesters. 2009b. A beginner’s guide to R. Springer Science+Business Media, New York.

  1. WinBUGS has been superseded by a new version, OpenBUGS, which is fully open source. WinBUGS code should work just fine in OpenBUGs or JAGS, another Bayesian inference engine accessible from R.↩︎