Preface

This website accompanies a course I developed in Fall 2021: Applied Biological Data Analysis. The course is currently a “special topics” (i.e., seminar), but will hopefully someday evolve into a regular course. The target audience is biology graduate students and advanced undergraduates who need to analyze data for their research projects. If you find the material here helpful, please let me know! Likewise, if you have comments for improvement, I’d love to hear from you.

License and permissions

This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. This means that you are allowed to share the material and adapt it for your purposes, under the following conditions:

  1. Give appropriate credit to the author. This includes citing the original sources of any third-party datasets used in some of the tutorials. I’ve done my best to provide appropriate citations alongside these datasets.
  2. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

This site is for educational purposes only. This work and its content is released under the Creative Commons Attribution-ShareAlike 4.0 license. Inclusion of third-party data falls under guidelines of fair use as defined in section 107 of the US Copyright Act of 1976.

Course description

This course is a survey of data analysis skills and statistical methods that are essential for modern biology. The course takes a holistic approach to the data analysis workflow in biology using the open-source environment R, including data management, exploratory data analysis, data modeling, and reproducible science practices. Statistical topics covered include generalized linear models, mixed effects models, non-linear models, and ordination. Students are required to apply techniques learned in class to real or simulated biological datasets as a course project.

Course objectives

  1. Explain the role of statistics in the biological sciences and the ways in which analytical results are communicated.
  2. Manipulate, summarize, display, and analyze data using the open-source environment and language R.
  3. Use probability distributions to model and think about biological phenomena. Students should come away from the course able to translate biological hypotheses and ideas into statistical statements and vice versa.
  4. Conduct exploratory data analyses in support of scientific investigations, particularly to detect and diagnose common problems with biological datasets.
  5. Employ modern statistical methods to answer biological questions.
  6. Communicate data and analytical results to audiences who may or may not have statistical backgrounds.

Course requirements

  • Access to a computer or laptop capable of running R version \(\ge\) 4.0.
  • Having a recent version of RStudio is helpful but not esssential.
  • A 64-bit system is highly recommended for speed and stability.
  • A separate program in which to write and edit code.
    • Most people prefer to use an “integrated development environment” (IDE) or dedicated code editor for coding.
    • IDEs and code editors are often subject to irrationally strong personal and organizational preferences, so just pick the one that works for you. Or, use the one your supervisor tells you to use.
    • My usual workflow includes two windows: the base R GUI, and Notepad++.
  • It would helpful to have a dataset of your own to work with. Each course module uses numerous examples with simulated or published datasets. But, at the end of the day, you’re probably here because you need to apply these methods to your own work.

Course organization

This course is organized into modules that each focus on one of the objectives of the course. The links below will take you straight to the start of each module.

Module 1 is an introduction to the use of statistics in biology, and explores some fundamental issues related to how biologists make inferences based on data.

Module 2 is an introduction to the R progam and programming language.

Module 3 takes a deep dive into manipulating data in R. This includes data import and export, working with special data types such as dates and times, and data frame management.

Module 4 is a survey of common techniques for exploratory data analysis. This is a vital step in the data analysis workflow because it can reveal unexpected patterns and problems in the data.

Module 5 introduces generalized linear models (GLM), an extremely general and powerful framework for analyzing biological data. This module presents worked examples of some of the most common GLMs in biology.

Module 6 explores nonlinear models for relationships that don’t follow a straight line. These models include biological growth curves and dose response curves.

Module 7 introduces mixed models, which extend GLM and other models to include random effects. Mixed models allow biologists to account for variability from unknown sources.

Module 8 explores methods for multivariate data analysis, including distance measures and ordination.

Module 9 is a guide to selecting statistics for your research. This module contains dichotomous keys to help you decide what statistical methods are most appropriate for the question you are trying to analyze.

About the author

I’m a quantitative ecologist at Kennesaw State University who studies human impacts on animal populations. Since earning by PhD at Baylor University in 2012, I’ve had the good fortune to have worked in a variety of government and industry roles. Those experiences have shaped my perspective on biology and education. In my courses and advising I do my best to impress upon students the the variety of biological experiences and perspectives that different domains bring to the table…and the value that they all bring. Most of my current research focuses on understanding human impacts on animal communities and how animal populations adapt to rapidly changing environments.

Why am I making this R guide? I have been using R since about 2008, when I was a graduate student trying to make a multi-panel plot for a manuscript. I had seen R mentioned in publications and thought that using R would be more effective than kludging something together with Excel and Powerpoint. Needless to say, jumping straight to making plots with the lattice package was not the easiest way to learn a new programming language. The final result was a little clunky (Kirchner et al. 2011) but I was really proud of it. Eventually I took some courses and did a lot of self-teaching, and am finally (14 years later) mostly decent at R. I have used R in my work pretty much continuously since 2008 for everything from Bayesian data analysis (e.g., Green et al. 2020) to simulation modeling (e.g., Green et al. 2019). This guide is meant to be a primer for self-starting with R. A lot of the explanations that are included are there to hopefully save someone some of the trouble I had getting started.

Literature Cited

Bolker, B. M. 2008. Ecological models and data in R. Princeton University Press, Princeton, New Jersey, USA.
Dalgaard, P. 2002. Introductory statistics with R. Springer Science+Business Media.
Green, N. S., M. L. Wildhaber, and J. L. Albers. 2019. Potential responses of the Lower Missouri River shovelnose sturgeon (Scaphirhynchus platorynchus) population to a commercial fishing ban. Journal of Applied Ichthyology 35:370–377.
Green, N. S., M. L. Wildhaber, J. L. Albers, T. W. Pettit, and M. J. Hooper. 2020. Efficient mammal biodiversity surveys for ecological restoration monitoring. Integrated Environmental Assessment and Management.
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An introduction to statistical learning. 7th edition. Springer.
Kéry, M. 2010. Introduction to WinBUGS for ecologists: Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press.
Kirchner, B. N., N. S. Green, D. A. Sergeant, J. N. Mink, and K. T. Wilkins. 2011. Responses of small mammals and vegetation to a prescribed burn in a tallgrass blackland prairie. The American Midland Naturalist 166:112–125.
Legendre, P., and L. Legendre. 2012. Numerical ecology. Elsevier, Amsterdam.
McCune, B., J. B. Grace, and D. L. Urban. 2002. Analysis of ecological communities. MjM software design Gleneden Beach, Oregon, USA.
Zuur, A. F., E. N. Ieno, and G. M. Smith. 2007. Analysing ecological data. Springer Science+Business Media, New York.
Zuur, A., E. N. Ieno, and E. Meesters. 2009b. A beginner’s guide to R. Springer Science+Business Media, New York.