This blog has a few goals. First, you’ll see how to simulate a nested data set using the assumptions of a linear mixed-effects model. Then, you’ll learn about R packages that can help to summarize and visualize similar hierarchical data in fast, reproducible, and easily generlizable ways. Finally, the user can change the simulation parameters to visualize the emergent effects of variability at different hierarchical scales.
Data Generation
Working with hierarchical data can be a pain, but there is a suite of R packages in the tidy family that are unbelievably helpful. What I mean is that these packages will change your scientific life forever. For real.
Hyperbole aside, let’s start by generating some realistic, nested data. We’ll be testing Bergmann’s rule that body size increases with elevation (as a proxy for temperature) by sampling along different mountains. We’ll sample multiple individuals of many species, representing many genera, sampled across multiple mountains. Lots of nestedness here.
There’s a lot of non-independence going on with this type of nested sampling. You’d expect body size to vary more among genera than among species within a given genera, and there is probably variation among mountains. This all affects the intercept of body size (i.e. the average body size within groups). There might be similar non-independence in the relationship between body size and elevation (i.e. the slope). We’ll code this type of non-independence as random effects on the interecept and slope, below.
This blog is focused on managing and visualizing hierarchical data. In a later post, I’ll use a very similar generative example to illustrate proper statistical models to handle all of this non-independence.
Very crudely, let’s look at the overall pattern in the data.
Well, that’s a lot of variation, which makes it hard to see a clear pattern. Does the inherent hierarchy in the data obscure the body size - elevation relationship? What are the patterns among mountains? Among genera? Among species?
Data Management and Synthesis
I’m going to focus on the tidy family of packages that can helpu us dissect and summarize the raw data in simple and reproducible ways, and we’ll use the ggplot2 package for visualization.
Brad Boehmke does a great job of highlighting the functions of the dplyr and tidyr packages in his R publication, and you should definitely read it. I’ll just highlight a few useful functions relevant to this dataset.
Here’s a simple question: What are the average body weights and their standard deviations on each mountain? We could write a for-loop that partitions the data frame and then calculates and stores these summary statistics, but that is pretty inefficient and prone to errors. Instead, use dplyr and the summarize() function.
What about the mean weights of genera on different mountains?
Perhaps simpler, we want to know how many species were found on each mountain.
Data Visualization
We can even pipe these data frames right into ggplot2! Let’s take a look at the variability in average weights across the different mountains. Here we’ll average to the species level, because we have multiple individuals sampled from each species, and we’ll summarize these species-level averages using boxplots.
Now, more relevant to the question at hand, what is the relationship between body size and elevation? First, is it consistent across mountains? Let’s look at this visually. We’ll use the facet_wrap() function from the ggplot2 package.
It would appear that Bergmann’s rule does not apply to these mountains. What’s going on?
Let’s use facet_wrap() to look at the each genus separately, and let’s highlight the species with different colors.
Now it becomes clear why Bergmann’s rule was being obscured when we looked across the whole data set and when we looked at mountains individually. There is a lot of variation among genera in the body size - elevation relationship. This makes sense, because we coded a large random variation in slope among genera.
Go back and change the random variation in slopes and intercepts at the different hierarchical levels. How does this affect the overall pattern among mountains? This type of simulation can help you understand how much data you might need to collect in order to find statistically meaningful patterns.
Leave a Comment