In Her Mind's Eye

Explorations in history data

image of a step chart

Height Data for Women in the Digital Panopticon

To see the R code for this post, toggle it on and off here: show/hide R

Introduction

The Digital Panopticon has brought together massive amounts of data about British prisoners and convicts in the long 19th century, including heights for many thousands of individuals. Adult height is strongly influenced by environmental factors in childhood, one of the most important being nutrition. So, if you have enough of it, height data for past populations is extremely informative about standards of living.

I blogged about this data for female prisoners for Women’s History Month in March, and you can read that post for more of the historical discussion.

I decided to re-post a shorter version over here to share the code and focus on the use of two types of particularly important statistical visualisations: box plots and histograms.

The data

The four datasets are subsets of Digital Panopticon datasets.

  • HCR, Home Office Criminal Registers 1790-1801, prisoners held in Newgate awaiting trial (1226 heights total, 1061 aged over 19)
  • CIN, Convict Indents 1820-1853, convicts transported to Australia (17183 heights, 14181 over 19)
  • PLF, Female prison licences 1853-1884, female convicts sentenced to penal servitude (571 heights, 535 over 19)
  • RHC, Registers of Habitual Criminals 1881-1925, recidivists who were under police supervision following release from prison (12599 heights, 12118 over 19)

For each dataset, I included only female prisoners with a year of birth as well as a height, and then filtered out children and teenagers so we have adult heights only. It can’t quite be assumed that the datasets contain only unique individuals, so the results here are very much provisional (don’t cite them!). My main interest is in exploration of the data.

# packages ####

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'forcats' was built under R version 3.4.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# get data ####

# in each case this is summary data derived from the full DP datasets, including only height, age/yob, event year. HCR, PLF and RHC are (or shortly will be) downloadable as open data; there's more info about this in the links above. CIN is not yet open.

# in the original sources heights were recorded in feet and inches (to quarters of an inch), which have been converted to inches and decimal points.

# HCR: home office criminal registers 1790-1801 - defendants awaiting trial
# 1226 rows, 1061 over 19 
# possible that a few women appear more than once

hcr <- read_csv("../data/bp/hcr_heights_20180316.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   h_year = col_integer(),
##   age = col_integer(),
##   year_birth = col_integer(),
##   height = col_double()
## )
hcra <- hcr %>% 
  mutate(
          decade = h_year - (h_year %% 10), 
          decade_birth = year_birth - (year_birth %% 10),
          dataset="hcr" 
          ) %>% 
  filter(age > 19)

# CIN: Oxley convict indents 1820-1853 - transported convicts
# 17183 rows total, 14181 over 19 
# unlikely to be more than a very small handful of repeat appearances - it was very rare to be transported more than once!

cin <- read_csv("../data/bp/cin_heights_20180314.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   h_year = col_integer(),
##   age = col_integer(),
##   year_birth = col_integer(),
##   height = col_double()
## )
cina <- cin %>% 
  mutate(
    decade = h_year - (h_year %% 10), 
    decade_birth = year_birth - (year_birth %% 10), 
    dataset = "cin" 
  ) %>% 
  filter(age > 19)

# PLF: prison licences 1853-1884
# 571 rows, 535 over 19
# repeat offenders shouldn't be an issue as files had already been amalgamated.

plf <- read_csv("../data/bp/plf_heights_20180314.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   h_year = col_integer(),
##   age = col_integer(),
##   year_birth = col_integer(),
##   height = col_double(),
##   daterc = col_date(format = ""),
##   source = col_character(),
##   ref_id = col_character()
## )
plfa <- plf %>% 
  mutate(
    decade = h_year - (h_year %% 10), 
    decade_birth = year_birth - (year_birth %% 10), 
    dataset="plf"  
  ) %>% 
  select(id, h_year,age,year_birth,height,decade,decade_birth, dataset) %>% 
  filter(age > 19)

# RHC: register of habitual criminals 1881-1925
# 12599, 12118 over 19
# likely to be some repeat offenders

rhc <- read_csv("../data/bp/rhc_heights_20180316.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   h_year = col_integer(),
##   age = col_integer(),
##   year_birth = col_integer(),
##   height = col_double()
## )
rhca <- rhc %>% 
  mutate(
    decade = h_year - (h_year %% 10), 
    decade_birth = year_birth - (year_birth %% 10), 
    dataset="rhc" 
  ) %>% 
  filter(age > 19)


# stack 'em up (bind_rows = sql union)

hcr_cin_plf_rhc <- bind_rows(hcra, cina, plfa, rhca)

The visualisations

A box plot, or box and whisker plot, is a really concentrated way of visualising what statisticians call the “five figure summary”" of a dataset:

  1. the median;
  2. upper quartile (halfway between the median and the maximum value);
  3. lower quartile (halfway between the median and minimum value);
  4. minimum value;
  5. maximum value

Here’s a diagram:

The thick green middle bar marks the median value. The two blue lines parallel to that (aka ‘hinges’) show the upper and lower quartiles. The pink horizontal lines extending from the box are the whiskers. In this version of a box plot, the whiskers don’t necessarily extend right to the minimum and maximum values. Instead, they’re calculated to exclude outliers which are then plotted as individual dots beyond the end of the whiskers.

So what’s the point of all this? Imagine two datasets: one contains the values 4,4,4,4,4,4,4,4 and the other 1,3,3,4,4,4,6,7. The two datasets have the same averages, but the distribution of the values is very different. A boxplot is useful for looking more closely at such variations within a dataset, or for comparing different datasets, which might look pretty much the same if you only considered averages.

Histograms are less complex; they’re a type of bar chart that’s particularly useful for visualising the distribution of a dataset.

First views of the datasets

The first thing I look for is incongruities and impossible numbers that might suggest problems with the data, one of the great benefits of exploratory data visualisation. If we see people over 7 feet or under 3 feet tall, or born in 1650, that’s very unlikely to be correct. It should be remembered that minor data errors are par for the course and if it only affects very tiny numbers, I’ll just filter them out in subsequent analysis. If there are a lot of issues, though, that can suggest bigger problems with the reliability of the data.

A trickier problem is values that might or might not be errors. In the 19th century women over the height of 6 feet, for example, are very rare indeed, but they do exist so it can’t be assumed that’s wrong.

HCR

# nb use of varwidth to vary width of box according to relative size of group within dataset
ggplot(hcra, aes(factor(decade_birth), height)) +
  geom_boxplot(varwidth = TRUE, fill="#D55E00", alpha=0.5, outlier.size = 0.7, outlier.alpha = 1) +
  scale_y_continuous(breaks=seq(36,84,by=3)) +
  labs(y="height (inches)", x="birth decade", title="Women's heights in HCR by birth decade")

If you compare this HCR plot to the following ones, you’ll see that it has more extreme outliers, and overall the boxes are more asymmetrical (this appears to a lesser extent with PLF). We’ll come back to this.

CIN

ggplot(cina, aes(factor(decade_birth), height)) +
  geom_boxplot(varwidth = TRUE, fill="#D55E00", alpha=0.5, outlier.size = 0.7, outlier.alpha = 1) +
  scale_y_continuous(breaks=seq(48, 74,by=2)) +
  labs(y="height (inches)", x="birth decade", title="Women's heights in CIN by birth decade")

PLF

ggplot(plfa, aes(factor(decade_birth), height)) +
  geom_boxplot(varwidth = TRUE, fill="#D55E00", alpha=0.5, outlier.size = 0.7, outlier.alpha = 1) +
  scale_y_continuous(breaks=seq(48,68,by=2))  +
  labs(y="height (inches)", x="birth decade", title="Women's heights in PLF by birth decade")

RHC

ggplot(rhca, aes(factor(decade_birth), height)) +
  geom_boxplot(varwidth = TRUE, fill="#D55E00", alpha=0.5, outlier.size = 0.7, outlier.alpha = 1) +
  scale_y_continuous(breaks=seq(48,74,by=2)) +
  labs(y="height (inches)", x="birth decade", title="Women's heights in RHC by birth decade")

Put them all together!

This filters out women born before 1750 and after 1899, because the numbers were very small, and extreme outliers. I added a guideline at the median for the 1820s (the mid-point), which I think makes it easier to see the trends (I discuss what I think those mean at more length in the earlier blog post).

Ta da! Not going to lie, I was pretty pleased with this.

ggplot(hcr_cin_plf_rhc %>% 
    filter(year_birth > 1749, year_birth < 1900, height > 40, height < 80)
       , aes(factor(decade_birth), height)) +
  geom_boxplot(varwidth = TRUE, fill="#D55E00", alpha=0.5, outlier.size = 0.5, outlier.alpha = 1) +
  labs(y="height (inches)", x="birth decade", title="Women's heights by birth decade, 1750-1899") +
  scale_y_continuous(breaks=seq(40,76,by=2)) +
  geom_hline(aes(yintercept=61), colour="#990000", linetype=2)

Problems

But it’s time to come back and take another look at the HCR data. I’m going to switch to histograms, and these show more clearly that there’s something up. A ‘normal’ height distribution in a population should look like a “bell curve” - quite tightly and symmetrically clustered around the average. (In fact, height data seems to be quite often used in examples of typical normal distributions.) CIN and RHC are close.

ggplot(hcr_cin_plf_rhc %>% filter(dataset %in% c("cin", "rhc"))
       , aes(height)) +
  geom_histogram(binwidth=1) +
  facet_grid(~dataset) +
  scale_x_continuous(breaks=seq(48,72,by=2)) +
  labs(y="", title="Heights distribution for women in CIN and RHC")

PLF isn’t quite as good, though the outliers are minimal.

ggplot(plfa, aes(height)) +
  geom_histogram(binwidth=1) +
  labs(y="", title="Heights distribution for women in PLF") +
  scale_x_continuous(breaks=seq(40,76,by=2))

But HCR is extremely problematic.

ggplot(hcra, aes(height)) +
  geom_histogram(binwidth=1) +
  labs(y="", title="Heights distribution for women in HCR") +
  scale_x_continuous(breaks=seq(36,84,by=2))

Now we can see it’s not just the outliers that are the problem: the distribution is the wrong shape. It has a multimodal distribution - that’s to say, it has several peaks instead of being a pyramid shape.

What could be causing this? The first thing I’ll need to do is go back to the data and see if there are problems with the transcriptions or the extraction from transcription to heights - the spike at 60 inches (5 feet) could suggest that data has got truncated. But if that seems OK, it might mean that the heights were inaccurately recorded in the first place. Then we really have a problem…

Data on Github

The data used in posts can normally be found here (there may be more detail in the body of the post). Data and code files may also be available on Github.

Re-use: Unless otherwise stated, all data, code and images on this site are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License Creative Commons License