Wrapping my head around circular means

13 minute read

For my first long-form blogpost, I thought I’d share a recent statistical problem I’ve had as part of my Masters work.

One of my chapters is about comparing patterns of growth and flowering phenology (the timing of these sorts of events during the year) across a few dozen species of sedge (Cyperaceae)—gramminoid plants common here in the fynbos of the Cape Floristic Region.

For one analysis, I am comparing flowering data between six clades. The biological details are not the focus of this post. I am more interested (for the moment) in sharing the analytical approach I’ve learnt.

Normally in ecology and related disciplines, when we wish to compare two or more distributions of values, we would employ a (t)-test or ANOVA. This is all very well if the values we wish to compare fall on a number line.

This is not the case with phenological or seasonal data.

A small introduction to circular data

A variable describing the time of year, time of day or season of an event is an example of circular data. A simple example of could involve one trying to estimate when, on average, people sleep:

# Make up some data
sleeping <- c(
  24, 23, 24, 23, 22, 24, 23, 21, 22, 24, 23, 20, 22, 23, 24, 24,
  1, 2, 2, 1, 3, 2, 2, 1, 1, 3, 4, 2, 3, 3, 4, 5, 5, 6, 3, 1, 1
)
# Plot its distribution
hist(sleeping,
  breaks = 24,
  xlab   = "Time of day (hour)", ylab = "No. people sleeping",
  main   = ""
)

Histogram of sleeping
times.

From this made-up data it is clear that people sleep at night. But our go-to statistic, the mean, fails at describing this:

mean(sleeping)

## [1] 11.37838

Nobody was asleep at 11h30 in the afternoon! This statistic is not very helpful. How can we better describe the average time of asleepness?

Part of this issue is that time-of-day is not a linear measurement, but a circular one. It does not fall on a number line, as 24h00 is the same “thing” as 00h00. Our sleeping data would be better visualised as follows (using the R-package ggplot2) (Wickham 2009):

library(ggplot2)
# Set theme to something simpler, cleaner
theme_set(theme_minimal() + theme(panel.grid.minor = element_blank()))
ggplot(as.data.frame(sleeping), aes(sleeping)) +
  geom_histogram(bins = 24) +  # because 24hrs in a day
  scale_x_continuous(
    breaks = seq(0, 24, 2), 
    labels = paste0(seq(0, 24, 2), "h00")  # 01h00, 02h00 etc.
  ) +
  # Make the plot circular
  coord_polar() +
  theme(
    axis.title.x = element_blank(),
    # Remove non-sensical y-axis details now that plot is circular
    axis.title.y = element_blank(),
    axis.text.y  = element_blank()
  )

Circular histogram of sleeping
times.

Here, it is much clearer that people tend to be asleep around 00h00 to 01h00. The R-package circular (Agostinelli and Lund 2017) provides an interface to this common form of data in the social sciences and ecology. It enables us to calculate statistics such as the circular mean, which accounts for the fact that our data live on a circle and not a number line.

To convert data into explicitly circular data, we use circular::as.circular(), which takes a vector as input:

library(circular)

## 
## Attaching package: 'circular'

## The following objects are masked from 'package:stats':
## 
##     sd, var

sleeping <- as.circular(sleeping, 
  type     = "directions",
  units    = "hours",
  template = "clock24"
)
sleeping
plot(sleeping)

The default plot produced by the `plot.circular()`
method.

## Circular Data: 
## Type = directions 
## Units = hours 
## Template = clock24 
## Modulo = asis 
## Zero = 1.570796 
## Rotation = clock 
##  [1] 24 23 24 23 22 24 23 21 22 24 23 20 22 23 24 24  1  2  2  1  3  2  2
## [24]  1  1  3  4  2  3  3  4  5  5  6  3  1  1

Now, the mean makes much more sense:

mean(sleeping)
plot(sleeping)
points(mean(sleeping), col = "red")

The mean is now more sensible, thanks to
`circular::`.

## Circular Data: 
## Type = directions 
## Units = hours 
## Template = clock24 
## Modulo = asis 
## Zero = 1.570796 
## Rotation = clock 
## [1] 1.975252

My data: flowering times in Cape sedges

I began this post by saying that I wanted to share the application of circular statistics to some data for my Masters (albeit a small example thereof).

As above, I am comparing flowering data between six clades of gramminoid fynbos plants (Cyperaceae, Tribe Schoeneae), namely the genera Schoenus L. and Tetraria Beauv. Tetraria is known to be polyphyletic, and until recently the only Schoenus species in southern Africa was the cosmopolitain S. nigricans L. The rest, used below, have been transferred from Tetraria Beauv. to Schoenus L. by Elliott and Muasya (2017).

The last large-scale conspectus of the Cape flora was by Manning and Goldblatt (2012). The flowering data I use here are from that flora, with species names ammended to match then current taxonomy (as in Elliott and Muasya 2017). Each species has flowering times as follows:

Species	Start	End
Schoenus adnatus	Feb	Aug
Schoenus complanatus	Dec	Mar
Schoenus dregeanus	Dec	Apr
Schoenus gracilimus	Dec	Mar
Schoenus lucidus	Dec	Dec
Schoenus quadrangularis	Nov	May
Schoenus neovillosus	Jan	Feb
Schoenus arenicola	Jul	Nov
Schoenus compar	Apr	Jun
Schoenus pictus	Dec	May
Schoenus bolussi	Jul	Aug
Schoenus compactus	Aug	Nov
Schoenus crassus	Apr	Jun
Schoenus cuspidatus	Aug	Nov
Schoenus exilis	Apr	Jun
Schoenus graminifolius	Jul	Nov
Schoenus laticulmis	May	Dec
Schoenus variabilis	Apr	June
Tetraria capillacea	Oct	Nov
Tetraria crinifolia	Aug	Oct
Tetraria fasciata	Nov	May
Tetraria ferruginea	Mar	Jul
Tetraria flexuosa	Jan	May
Tetraria fourcadei	Jan	May
Tetraria maculata	Dec	Feb
Tetraria ustulata	Jan	May
Tetraria cernua	Oct	Nov
Tetraria burmanii	Nov	Apr
Tetraria fimbriolata	Jan	Apr
Tetraria microstachys	Dec	Feb
Tetraria nigrovaginata	Jan	Apr
Tetraria pillansii	Jan	Feb
Tetraria pubescens	Oct	Apr
Tetraria pygmaea	Feb	Apr
Tetraria vaginata	Sep	Apr
Tetraria bromoides	Oct	Feb
Tetraria eximia	May	Aug
Tetraria involucrata	Jan	Apr
Tetraria robusta	Feb	Mar
Tetraria secans	Apr	May
Tetraria thermalis	Jun	Oct
Tetraria triangularis	Feb	Apr

Flowering times (start and end) for Schoenus L. and Tetraria Beauv. (Cyperaceae, Tribe Schoeneae), from Manning and Goldblatt (2012).

Along with these data, taxonomic research groups these species into three clades in each genus (Schoenus and Tetraria). That merely explain the colour-coding used in the figures below.

For now, let’s unpack how to take "Jan", "Feb", "Mar" and turn it into meaningful numerical data, and treat that data circularly such that December and January are adjecent.

Flowering months as numerical data

First, we need to convert the monthly abbreviations into numbers. Simple enough: "Jan" = 1, "Feb" = 2, etc. I apply this process for each rows’ flowering_start and flowering_end values, row-by-row, using purrr–a functional programming package for R.

get_month_int <- function(month_abb) {
  # Converts "Jan", "Feb", etc. -> 1, 2, etc.
  month_abb %>%
    map_int(~ ifelse(.x %in% month.abb,
      which(month.abb == .x),
      NA
    ))
}
# Apply get_month_int() to the data
flowering_times <- flowering_times %>%
  mutate(
    flowering_start = get_month_int(flowering_start),
    flowering_end   = get_month_int(flowering_end)
  )
flowering_times

## # A tibble: 42 x 4
##    clade               species                flowering_start flowering_end
##    <chr>               <chr>                            <int>         <int>
##  1 Epischoenus spp.    Schoenus adnatus                     2             8
##  2 Epischoenus spp.    Schoenus complanatus                12             3
##  3 Epischoenus spp.    Schoenus dregeanus                  12             4
##  4 Epischoenus spp.    Schoenus gracilimus                 12             3
##  5 Epischoenus spp.    Schoenus lucidus                    12            12
##  6 Epischoenus spp.    Schoenus quadrangular…              11             5
##  7 Epischoenus spp.    Schoenus neovillosus                 1             2
##  8 S. compar-S. pictus Schoenus arenicola                   7            11
##  9 S. compar-S. pictus Schoenus compar                      4             6
## 10 S. compar-S. pictus Schoenus pictus                     12             5
## # … with 32 more rows

Now, let’s create sequences of months that a species is in flower for, starting at flowering_start, ending at flowering_end. This will be embedded within the dataframe as a list-column–where each entry in the column is itself a list, nested in the dataframe. I apply this process for each pair of flowering_start-flowering_end values, again, using purrr–a functional programming package for R.

get_month_seq <- function(start, end) {
  # Creates seq from start to end, unless end <= start
  # (e.g. Nov:Feb = 11, 12, 1, 2)
  map2(start, end, function(start, end) {
    month_seq <- 
      if (any(is.na(c(start, end)))) NA
      else if (start > end)          c(start:12, 1:end)
      else if (start <= end)         start:end
    month_seq
  })
}
flowering_times <- flowering_times %>%
  mutate(flowering_months = get_month_seq(flowering_start, flowering_end))
flowering_times

## # A tibble: 42 x 5
##    clade       species       flowering_start flowering_end flowering_months
##    <chr>       <chr>                   <int>         <int> <list>          
##  1 Epischoenu… Schoenus adn…               2             8 <int [7]>       
##  2 Epischoenu… Schoenus com…              12             3 <int [4]>       
##  3 Epischoenu… Schoenus dre…              12             4 <int [5]>       
##  4 Epischoenu… Schoenus gra…              12             3 <int [4]>       
##  5 Epischoenu… Schoenus luc…              12            12 <int [1]>       
##  6 Epischoenu… Schoenus qua…              11             5 <int [7]>       
##  7 Epischoenu… Schoenus neo…               1             2 <int [2]>       
##  8 S. compar-… Schoenus are…               7            11 <int [5]>       
##  9 S. compar-… Schoenus com…               4             6 <int [3]>       
## 10 S. compar-… Schoenus pic…              12             5 <int [6]>       
## # … with 32 more rows

Lastly, we just need to reshape this data into a form ggplot2 can understand: long-form data.

flowering_times <- flowering_times %>%
  unnest() %>% # unnest the list-columns so ggplot can understand
  # Arrange species by sub-generic clade
  arrange(clade) %>%
  mutate(species = factor(species, levels = unique(species)))
flowering_times

## # A tibble: 173 x 5
##    clade       species       flowering_start flowering_end flowering_months
##    <chr>       <fct>                   <int>         <int>            <int>
##  1 Epischoenu… Schoenus adn…               2             8                2
##  2 Epischoenu… Schoenus adn…               2             8                3
##  3 Epischoenu… Schoenus adn…               2             8                4
##  4 Epischoenu… Schoenus adn…               2             8                5
##  5 Epischoenu… Schoenus adn…               2             8                6
##  6 Epischoenu… Schoenus adn…               2             8                7
##  7 Epischoenu… Schoenus adn…               2             8                8
##  8 Epischoenu… Schoenus com…              12             3               12
##  9 Epischoenu… Schoenus com…              12             3                1
## 10 Epischoenu… Schoenus com…              12             3                2
## # … with 163 more rows

Plotting flowering times

Now we can really plot this data:

flowering_plot <- 
  ggplot(flowering_times, aes(flowering_months, species, fill = clade)) +
    geom_tile() +
    scale_x_continuous(breaks = 1:12, labels = month.abb) +
    theme(
      axis.title.x = element_blank(),  # no need for x-axis title
      axis.text.y  = element_text(hjust = 0)  # left-justify species names
    )
flowering_plot

It’s clear that all the (austral) summer flowering species ("Dec"–"Feb") could really benefit from a circular plot. Thus:

flowering_plot +
  # Make the plot circular
  coord_polar() +
  # Remove non-sensical y-axis details now that plot is circular
  theme(
    axis.text.y  = element_blank(),
    axis.title.y = element_blank()
  )

This is much more realistic. Problem is, it squashes up all the species in the centre of the circle.

Perhaps a better summary would be a count of the number of species in flower in a given month, for each clade. Like so:

flowering_plot2 <-
  ggplot(flowering_times, aes(flowering_months, fill = clade)) +
    geom_bar() +
    scale_x_continuous(breaks = 1:12, labels = month.abb) +
    coord_polar() +
    facet_wrap(~clade) +
    theme(
      axis.title.x    = element_blank(),
      axis.text.y     = element_blank(),
      axis.title.y    = element_blank(),
      strip.text      = element_blank()  # legend duplicates facet labels
    )
flowering_plot2

Now that we’ve figured out how to plot the data in some useful ways, let’s now try and analyses with the appropriate statistics.

Statistics for flowering times

First, let’s just simplify the dataset for the purposes of this exercise. I’m going to just group months where any species is in flower, grouped by clade:

flowering_summary <- flowering_times %>%
  select(clade, flowering_months) %>%
  # Split the dataframe into a list according to clade
  split(.$clade) %>%
  # For each item in the list, return only the vector of flowering months
  map(pull, flowering_months) %>%
  # Tidy up the list's names
  {set_names(., names(.) %>%
    # Replace hyphens ("-") & spaces (" ") with underscores ("_")
    str_replace_all("[-\ ]", "_") %>%
    # Remove fullstops (".")
    str_remove_all("\\.")
  )}
flowering_summary

## $Epischoenus_spp
##  [1]  2  3  4  5  6  7  8 12  1  2  3 12  1  2  3  4 12  1  2  3 12 11 12
## [24]  1  2  3  4  5  1  2
## 
## $S_compar_S_pictus
##  [1]  7  8  9 10 11  4  5  6 12  1  2  3  4  5
## 
## $S_cuspidatus_et_al
##  [1]  7  8  8  9 10 11  4  5  6  8  9 10 11  4  5  6  7  8  9 10 11  5  6
## [24]  7  8  9 10 11 12 NA
## 
## $T_fasciata_T_flexuosa
##  [1] 10 11  8  9 10 11 12  1  2  3  4  5  3  4  5  6  7  1  2  3  4  5  1
## [24]  2  3  4  5 12  1  2  1  2  3  4  5 10 11
## 
## $T_microstachys_T_burmannii
##  [1] 11 12  1  2  3  4  1  2  3  4 12  1  2  1  2  3  4  1  2 10 11 12  1
## [24]  2  3  4  2  3  4  9 10 11 12  1  2  3  4
## 
## $T_thermalis_T_bromoides
##  [1] 10 11 12  1  2  5  6  7  8  1  2  3  4  2  3  4  5  6  7  8  9 10  2
## [24]  3  4

Now let’s circular::-ise and mean the data (I also use the package astroFns (Harris 2012) to help convert my data into radians, which seems to be the simplest way to work with circular data):

library(tidyverse)
library(magrittr)  # ::divide_by()

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(circular)
library(astroFns)  # ::hms2rad()

month2rad <- function(month) {
  stopifnot(month %in% 1:12)
  hms2rad(2 * month)  # use 24hr time as astroFns:: does
}
rad2month <- function(radian) {
  radian %>%
    rad2hms() %>%
    substr(1, 2) %>%  # b/c astroFns:: returns an hms string
    as.numeric() %>%
    divide_by(2)
}
my_circular_mean <- function(x) {
  x %>%
    month2rad() %>%
    circular(rotation = "clock", template = "geographics") %>%
    mean() %>%
    rad2month()
}

flowering_circular_means <- flowering_summary %>%
  map(~.x[!is.na(.x)]) %>%  # FIXME/TODO: track down this NA for my MSc...
  map(my_circular_mean)
flowering_circular_means

## $Epischoenus_spp
## [1] 2
## 
## $S_compar_S_pictus
## [1] 4.5
## 
## $S_cuspidatus_et_al
## [1] 8
## 
## $T_fasciata_T_flexuosa
## [1] 2
## 
## $T_microstachys_T_burmannii
## [1] 1.5
## 
## $T_thermalis_T_bromoides
## [1] 3.5

Let’s add these to our plot:

# Put the means in a data.frame so ggplot2 can understand
mean_flowering_times <- flowering_circular_means %>%
  imap_dfr(~tibble(clade = .y, mean_flowering_time = .x)) %>%
  arrange(clade) %>%
  mutate(clade = unique(sort(flowering_times$clade)))
flowering_plot2 +
  geom_vline(data = mean_flowering_times, aes(xintercept = mean_flowering_time))

Those means seem to make sense—they are at the intuitive centre of each clade’s flowering distribution! Hooray!

Session info

I present here my machine’s state at time of writing, for reproducibility’s sake.

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] astroFns_4.1-0  magrittr_1.5    forcats_0.4.0   stringr_1.4.0  
##  [5] dplyr_0.8.0.1   purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
##  [9] tibble_2.1.1    tidyverse_1.2.1 here_0.1        circular_0.4-93
## [13] ggplot2_3.1.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.6         haven_2.1.0      lattice_0.20-38 
##  [5] vctrs_0.1.0      colorspace_1.4-1 generics_0.0.2   htmltools_0.3.6 
##  [9] yaml_2.2.0       utf8_1.1.4       rlang_0.3.4      pillar_1.4.0    
## [13] glue_1.3.1       withr_2.1.2      modelr_0.1.4     readxl_1.3.1    
## [17] plyr_1.8.4       munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
## [21] rvest_0.3.3      mvtnorm_1.0-10   evaluate_0.13    labeling_0.3    
## [25] knitr_1.22       fansi_0.4.0      highr_0.8        broom_0.5.2     
## [29] Rcpp_1.0.1       scales_1.0.0     backports_1.1.4  jsonlite_1.6    
## [33] hms_0.4.2        digest_0.6.19    stringi_1.4.3    grid_3.5.0      
## [37] rprojroot_1.3-2  cli_1.1.0        tools_3.5.0      lazyeval_0.2.2  
## [41] zeallot_0.1.0    crayon_1.3.4     pkgconfig_2.0.2  xml2_1.2.0      
## [45] lubridate_1.7.4  rstudioapi_0.10  assertthat_0.2.1 rmarkdown_1.12  
## [49] httr_1.4.0       R6_2.4.0         boot_1.3-22      nlme_3.1-137    
## [53] compiler_3.5.0

References

Agostinelli, C., and U. Lund. 2017. R Package circular: Circular Statistics (Version 0.4-93). CA: Department of Environmental Sciences, Informatics; Statistics, Ca’ Foscari University, Venice, Italy. UL: Department of Statistics, California Polytechnic State University, San Luis Obispo, California, USA. https://r-forge.r-project.org/projects/circular/.

Elliott, T.L., and A.M. Muasya. 2017. “Taxonomic Realignment in the Southern African Tetraria (Cyperaceae, Tribe Schoeneae; Schoenus Clade).” South African Journal of Botany 112: 354–60. https://doi.org/https://doi.org/10.1016/j.sajb.2017.06.011.

Harris, Andrew. 2012. AstroFns: Astronomy: Time and Position Functions, Misc. Utilities. https://CRAN.R-project.org/package=astroFns.

Manning, John, and Peter Goldblatt. 2012. Plants of the Greater Cape Floristic Region. 1: The Core Cape Flora. South African National Biodiversity Institute.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Twitter Facebook LinkedIn

Wrapping my head around circular means

A small introduction to circular data

My data: flowering times in Cape sedges

Flowering months as numerical data

Plotting flowering times

Statistics for flowering times

Session info

References

You May Also Enjoy

Running a workshop on data wrangling in R with the `tidyverse`

Hello, world

Identifying and thinking about a weed

What determines a species range? Thoughts about SDMs and niches

A small introduction to circular data

My data: flowering times in Cape sedges

Flowering months as numerical data

Plotting flowering times

Statistics for flowering times

Session info

References

You May Also Enjoy

Running a workshop on data wrangling in R with the tidyverse

Hello, world

Identifying and thinking about a weed

What determines a species range? Thoughts about SDMs and niches

Running a workshop on data wrangling in R with the `tidyverse`