Burials in 18th-century London

Exploring missing data with naniar

R
missing data
death
Author

Sharon Howard

Published

1 January 2023

The dataset

** need to update the data, as i’ve redone the deduplication **

the dataset is from UKDS, by CAMPOP - open data (CC BY NC licence)

don’t recommend the excel version… i think it only has the “daybook” data. then again!

https://doi.org/10.5255/UKDA-SN-8642-2

This data collection comprises individual burial account book burial records and parish register burial records filling in gaps where there were no extant/accessible burial accounts books for St Dunstan Stepney, London between 1733 and 1843. This produces a continuous series of burials except for a gap beginning July 1808 and ending January 1812. Transcriptions include full names, burial fees itemised under various headings as a potential indicator of social status, and hamlet address. Also in some periods ages and causes of death.

also see https://www.campop.geog.cam.ac.uk/research/projects/migrationmortalitymedicalisation/

I selected only a small set of the variables in the original data for exploration, in particular age at burial and cause of death.

The {naniar} package

Data cleaning

Dates were in separate day, month and year fields; I converted these to a single date field.

For cause of death, age and address, I standardised varying values for missing information (eg “blank” or “-”) to NA (R’s symbol for missing values). I did the same with values transcribed as “illegible” (or similar), though technically speaking there’s a difference between values that are missing because they were never written down in the first place, and values that are missing because they couldn’t be transcribed. But the numbers of illegibles are small (about 1% of NAs for each variable). did i clean occupation, title, status as well? they have plenty of NAs and only the odd (illegible)/blank/unknown ? check, but i think i might assume i did, they look far too clean otherwise.

I also standardised some variations in spelling/wording of common causes of death using OpenRefine, but this was not comprehensive.

There appeared to be some duplicate entries (caused presumably by overlaps between the two sources used in the dataset), which I removed where possible.

I added gender on the basis of first names (using historical names data), relationships (eg “daughter”=female) or status/title (eg “Mr”, “Lady”), but it’ll need to be borne in mind that it’s not part of the original records.

I trimmed the dataset slightly to a century 1740-1839, purely for convenience so that it would be easier to split it into decades and 25 year periods. This resulted in a dataset of 57068 burials.

Code
## make a gist of the original cleaning to get burials.csv from burials_cln.txt
## packages

library(readxl)
library(janitor)
library(lubridate)
library(scales)
#library(debkeepr)

library(knitr)
library(kableExtra)

library(patchwork)

library(tidytext)
library(tidyverse)

library(ggthemes)
theme_set(theme_minimal())


### devtools::install_github("hrbrmstr/waffle")
library(waffle)
# this generates a warning about fonts, do i really need it?
#library(hrbrthemes)

#library(skimr)

# deprecated warnings with current version of naniar, try github update?
#remotes::install_github("njtierney/naniar")
library(naniar)

## data
##preprocessing in preparation/processing

burials <-
  read_csv(here::here("site_data/stepney_burials/stepney_burials_20230102.csv"))

ons_deaths_2019 <-
  read_excel(here::here("site_data/stepney_burials/ons_weekly_deaths_2019.xlsx")) %>%
  clean_names("snake")

# i don't remember if there were any undated before processing? probably not given nature of source?
burials_1740 <-
  burials  %>%
  mutate(year = year(date)) %>% 
  filter(between(year, 1740, 1839)) %>%
  mutate(decade = year - (year %% 10))  %>%
  mutate(period = case_when(
    between(year, 1740, 1764) ~ "1740-1764",
    between(year, 1765, 1789) ~ "1765-1789",
    between(year, 1790, 1814) ~ "1790-1814",
    between(year, 1815, 1839) ~ "1815-1839"
  ))  %>%
  mutate(month_num = month(date),
         month_name = month(date, label=TRUE), # will this do what i think it does...
         month = str_pad(month_num, 2, pad="0"),
         year_month=paste(year, month, sep="-"))  %>%
  # tidy up causes a bit
  mutate(cause_std = case_when(
    str_detect(cause_std, "teeth") ~ "teething",
    str_detect(cause_std, "\\bfits") ~ "convulsions",
    TRUE ~ cause_std
  )) %>%
  mutate(age_group_n = case_when(
    is.na(age_burial) ~ NA_real_,
    TRUE ~ age_burial - (age_burial %% 10)
  )) %>% 
  mutate(age_group = case_when(
    is.na(age_group_n) ~"n/a",
    age_group_n==0 ~ "0-9",
    age_group_n==10 ~ "10-19",
    age_group_n==20 ~ "20-29",
    age_group_n==30 ~ "30-39",
    age_group_n==40 ~ "40-49",
    age_group_n==50 ~ "50-59",
    age_group_n==60 ~ "60-69",
    age_group_n==70 ~ "70-79",
    age_group_n==80 ~ "80-89",
    age_group_n==90 ~ "90+",
    age_group_n==100 ~ "90+",
  ))  %>%
  mutate(address = case_when(
    str_detect(address, "blank\\]|\\[__+\\]|\\[St Vitus|\\(blank\\)") ~ NA_character_,
    str_detect(address, "illegible|too faded|\\[M|ells\\]|____\\[\\?") ~  NA_character_, # "-illegible",
    address=="=" ~ NA_character_,
    str_detect(address, "Mid'x") ~ str_replace(address, "Mid'x", "Middlesex"),
    str_detect(address, "___... by a Dray") ~ str_remove(address, "___... by a Dray"),
    str_detect(address, "Stratf'd") ~ str_replace(address, "Stratf'd", "Stratford"),
    str_detect(address, "Southw'k") ~ str_replace(address, "Southw'k", "Southwark"),
    str_detect(address, "Mag'n") ~ str_replace(address, "Mag'n", "Magdalen"),
    # Kath'  # Sop'd ?? # An'd ?
    TRUE ~ address
  )) %>%
  select(-agebur_int, -age_years)

# why had i commented this out?
burials_1740_gender <-
burials_1740 %>%
  filter(!is.na(gender))


# probably skip fees for this
# fees <-
#   read_csv(here::here("site_data/stepney_burials/step_burials_fees_long.csv"))
# fyi semi join fees to burials loses some rows because you've filtered some rows of the original data
# fees_1740 <-
# fees %>%
#   semi_join(burials_1740_gender, by="burial_id")


# so what is the difference here...
# 
# agebur / age_burial / age_years / agebur_int ?
# the original was agebur; 
# agebur_int came from the age dict data. 
# age_years appears to be the last step of age dict cleaning
# age_burial = final clean up
# can't see any need to use age_years or agebur_int
# 
# cause, cause_std, cause_cln ? looks like this was done in OR
# cause = original
# cause_n has been lowercased and std spellings
# plus cause_std has NA'ed a few in cause_n that aren't causes or uninterpretable. so it is final version
# 
# include gender but may not do anything with it

# for naniar selection of stuff you really want
# not much point in keeping original agebur or cause i think - agebur is misleading because there are 21.2k "-".

sb_burials_na <-
burials_1740 %>%
  select(
    person_id, burial_id, source, 
    date, year, decade, period,
    relship, occupation, status, title,
    address,  cause_std,
    age_burial, age_group_n,
    gender
  )

# now bind_shadow. am i actually using this anywhere? nope. drop it.
# sb_shadow <-
# sb_burials_na %>%
#   bind_shadow(only_miss = TRUE) 

Statisticians divide missing data into three general types, which are not all equally difficult to deal with:

Missing completely at random, or MCAR, is missingness that has no association with any data you have observed, or not observed. In other words, the cause of the missingness can be considered truly random, and unrelated to observed or unobserved variables meaningful to the data and your analyses.

It may be difficult to diagnose the cause of this type of missingness, precisely because it is random. But it’s likely to look random in visualisations. As a result you can probably quite safely drop the values from an analysis without worrying too much about skewing the results (except that you might end up with a very small sample), or impute calculated values such as averages.

Missing at random (MAR) occurs when missingness depends on data you have observed, but not on unobserved data.

If missingness within a variable is related to unobserved data (including values of the missing variable itself), the missingness is missing not at random (MNAR). [Aka Missing Not at Random (MNAR)!]

Here, data isn’t missing at random - the missingness is related to the nature of the data variable itself - but we don’t have any information about the reason why.

?An example is that very elderly people may be less likely to know (or remember) how old they are and so their ages are more likely to be unrecorded or recorded as estimates.?

Exploring missing values

An overview

There are no missing dates; the original sources are chronological registers so the date was always recorded. (Which is often not the case for historical data!) The source (daybook or parish register) is also always present; this may be a significant variable when considering variations in missings.

Gender needs to be considered separately, but my efforts seem to have been successful on first glance.

After that, there are massive variation in missingness.

  • address (which I think is generally parish level) is very well covered, with only 7% missing
  • age and cause of death come next, around 30% missing
  • the other variables were only occasionally recorded
Code
burials_1740_missing_summary <-
burials_1740 %>%
 select(status, occupation, gender, relship, age_burial, cause_std, address, title) %>%
  miss_var_summary()

burials_1740_missing_summary %>%
  kable(digits = 2) %>% 
  kable_styling()
variable n_miss pct_miss
title 55418 97.11
occupation 53871 94.40
status 51186 89.69
relship 50407 88.33
age_burial 16891 29.60
cause_std 15658 27.44
address 3783 6.63
gender 2318 4.06

And here is the same data visualised, which shows up clusters more clearly (though less precisely) than a table.

Code
burials_1740  %>%
 select(status, occupation, gender, relship, age_burial, cause_std, address, title) %>%
  gg_miss_var(show_pct = TRUE)

Code
#warning seems to have gone with github version of naniar.
#Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in ggplot2 3.3.4. Please use "none" instead.

More detailed analysis

Since I know date is always present, I can use that to explore if or how much missing data might vary over time. I’ve split the dataset into four 25-year periods. I also want to know if there are significant differences between the two sources used.

Code
burials_1740_periods_missing_summary <-
burials_1740  %>%
 select(status, occupation, gender, relship, age_burial, cause_std, address, period, title)  %>%
  group_by(period) %>%
  miss_var_summary() %>%
  ungroup()

burials_1740_source_missing_summary <-
burials_1740  %>%
 select(status, occupation, gender, relship, age_burial, cause_std, address, source, title)  %>%
  group_by(source) %>%
  miss_var_summary() %>%
  ungroup()

#i think higher na gender in later decades is probably related to lack of status/relship data which you used to supplement names.
burials_1740_periods_missing_summary  %>%
 # select(status, occupation, gender, relship, age_burial, cause_std, address, period, title)  %>%
 #  group_by(period) %>%
 #  miss_var_summary() %>%
 #  ungroup() %>%
  ggplot(aes(variable, pct_miss)) +
  geom_col(width = 0.001, colour = "#6e016b", fill = "#6e016b") +
  geom_point(colour = "#6e016b", fill = "#6e016b") +
  coord_flip() +
  facet_wrap(~period, nrow = 1) +
  scale_color_discrete(guide = "none")  +
  labs(y="% missing", x=NULL)

Code
#  gg_miss_var(show_pct = TRUE, facet=county)

Now it’s possible to see that cause of death and age were recorded less often in the first decades of the dataset. In particular, in the 1740-64 period age was recorded only about 67% of the time but progressively improved until it was almost always recorded in 1815-39. Recording cause of death didn’t change quite as much, going from about 55% in 1740-64 to about 80% in 1790-1814 before getting slightly worse again in the last period.

With the gender data, significant variations might suggest, for example, that my data doesn’t reflect shifts in naming or spelling accurately. And it turns out there is some variation, from only 1.5% missing in 1740-64 but 7% in 1790-1814; although it’s all pretty good I’d want to work out why that’s happened and whether I can make it any better.

Code
burials_1740_periods_missing_summary %>%
  filter(variable=="gender") %>%
  kable() %>%
  kable_styling()
period variable n_miss pct_miss
1740-1764 gender 254 1.526718
1765-1789 gender 418 2.955735
1790-1814 gender 683 6.959446
1815-1839 gender 963 5.845220

The naniar function gg_miss_fct creates a heatmap, and I think a breakdown by decade is particularly revealing. Lighter squares represent higher % missing within the decade, so (ignoring gender) we get a nice picture of shifts in recording information. Although as already noted, age and cause of death were less likely to be recorded in the early decades, occupation, status and relationship were all more likely to be recorded then than later on. Address was slightly less often recorded in the middle decades. There was also a noticeable decline in recording cause of death in the very last decade.

Code
burials_1740 %>%
  select(decade, gender, age_burial, cause_std, occupation, relship, status, address, title) %>%
  gg_miss_fct(fct=decade)

However.

There are big differences in missing rates for the key data of cause of death and age between daybooks and parish registers. Cause of death is hardly ever recorded in parish registers (99.9% missing) and age is missing about 70% of the time, compared to 11.9% and 21% missing in daybooks.

(On the other hand, I’ve had a better success rate with gender in the parish registers - only 1.5% missing cf 4.6% in daybooks; maybe first names are recorded more often in registers.)

Code
burials_1740_source_missing_summary  %>%
 # select(status, occupation, gender, relship, age_burial, cause_std, address, period, title)  %>%
 #  group_by(period) %>%
 #  miss_var_summary() %>%
 #  ungroup() %>%
  ggplot(aes(variable, pct_miss)) +
  geom_col(width = 0.001, colour = "#6e016b", fill = "#6e016b") +
  geom_point(colour = "#6e016b", fill = "#6e016b") +
  coord_flip() +
  facet_wrap(~source, nrow = 1) +
  scale_color_discrete(guide = "none")  +
  labs(y="% missing", x=NULL)

They’re not evenly spread; in fact, the dataset documentation makes it clear that daybooks were the main source and parish registers were used to fill in gaps - in other words, the dataset creators had a missing data problem, which they solved by using a different source.

Code
##parish register is clustered in two periods or 4 decades - 1740s-60s and 1830s (about half). [probably not enough dups to worry about which i preferred but presumably daybook since that's better on cause/age]
burials_1740 %>%
  count(decade, source) %>%
  group_by(decade) %>%
  mutate(s=sum(n), p=n/s*100) %>%
  ungroup() %>%
  ggplot(aes(decade, n, fill=source)) +
  geom_col(position = "fill")

For some analysis (eg general changes over time, seasonal patterns), that won’t be a problem. It might not be a problem for the categories of cause of death and age that I’m interested in either, if the missing daybooks were MCAR (missing completely at random). Can I work out whether that’s the case?

To start with, it’s a promising sign that several other original variables - title, status, relship and occupation - have pretty similar missing data profiles overall and even when broken down by decade. This is the heatmap again for each source, showing only the decades in which parish registers have been used.

Code
burials_1740 %>%
  filter(decade %in% c(1740, 1750, 1760, 1830), source=="parish register") %>%
  select(decade, occupation, relship, status, address, title) %>%
  gg_miss_fct(fct=decade)  + ggtitle("parish registers") +

burials_1740 %>%
  filter(source=="daybook", decade %in% c(1740, 1750, 1760, 1830)) %>%
  select(decade, occupation, relship, status, address, title) %>%
  gg_miss_fct(fct=decade) + ggtitle("daybooks") +

plot_layout(ncol = 1, guides = "collect")

if you convert all parish register stuff (except dates!) to NA, what happens.

Code
burials_par <-
burials_1740 %>%
   filter(decade %in% c(1740, 1750, 1760, 1830))

#???? not sure if this will do anything useful
burials_par_na <-
burials_1740 %>%
  mutate(across(
    c(title, forename, given, surname, gender, relship, relname, cause_std, address, occupation, status, age_group),
    ~case_when(
      source=="parish register" ~ NA_character_,
      TRUE ~ .
    )
  )) %>%
  mutate(across(
    c(age_burial, age_group_n),
    ~case_when(
      source=="parish register" ~ NA_real_,
      TRUE ~ .
    )
  ))

address? this is probably not very std though and recording habits might differ? but could do something like the top 20 huh, all the illegibles are in the daybook i think

Code
burials_1740 %>%
  mutate(address_lc = str_to_lower(address)) %>%
  count(address_lc, source, sort = T) %>%
  group_by(source) %>%
  top_n(50,n) %>%
  mutate(rn = row_number()) %>%
  mutate(p = n/sum(n)) %>%
  ungroup() %>%
  arrange(address_lc)
# A tibble: 136 × 5
   address_lc              source              n    rn        p
   <chr>                   <chr>           <int> <int>    <dbl>
 1 aldgate                 daybook            20    26 0.000435
 2 aldgate                 parish register    15    21 0.00150 
 3 bermondsey              daybook            16    31 0.000348
 4 bethnal green           daybook          2437     5 0.0530  
 5 bethnal green           parish register  1359     3 0.136   
 6 blackwall               parish register   107     7 0.0107  
 7 bromley                 parish register     2    49 0.000200
 8 camberwell              daybook            12    43 0.000261
 9 camberwell              parish register     3    40 0.000301
10 christ church middlesex daybook            84    13 0.00183 
# … with 126 more rows
Code
burials_1740 %>%
  count(source) %>%
  mutate(p = round(n/sum(n)*100,1))
# A tibble: 2 × 3
  source              n     p
  <chr>           <int> <dbl>
1 daybook         46971  82.3
2 parish register 10097  17.7
Code
burials_1740 %>%
  count(source, month) %>%
  group_by(month) %>%
  mutate(p = round(n/sum(n)*100,1)) %>%
  ungroup() %>%
  filter(source=="parish register")
# A tibble: 12 × 4
   source          month     n     p
   <chr>           <chr> <int> <dbl>
 1 parish register 01      993  18.5
 2 parish register 02      835  17.5
 3 parish register 03      929  18.1
 4 parish register 04      912  19  
 5 parish register 05      969  20.4
 6 parish register 06      781  19  
 7 parish register 07      792  18.6
 8 parish register 08      808  17.4
 9 parish register 09      703  15.5
10 parish register 10      744  16.1
11 parish register 11      768  15.9
12 parish register 12      863  16.6

A potentially important factor is seasonality. It’s likely that there tend to be different causes of death at different times of year (and ages might vary seasonally too), so I hope that the daybook and parish register data have similar monthly patterns. Overall, parish registers make up 17.7% of the data, which varies by month between 15.5% (September) and 20.4% (May). In fact, lower % parish registers clearly cluster between the months of August and December.

Code
burials_1740 %>%
  count(month, source) %>%
    ggplot(aes(fill=source, values=n)) +
  geom_waffle(colour="white", n_rows = 10, make_proportional = TRUE) +
  # can't remember how to fix ordering in waffle...
  coord_equal() +
  facet_wrap(~month) +
  theme_minimal() +
  theme_enhance_waffle() +
  scale_fill_tableau() +
  labs(fill="source",  caption="1 square=1%")

It might, however, be more appropriate to make the comparison only for decades in which parish registers have been used, rather than overall?

Code
burials_par %>%
  count(month, source) %>%
    ggplot(aes(fill=source, values=n)) +
  geom_waffle(colour="white", n_rows = 10, make_proportional = TRUE) +
  # can't remember how to fix ordering in waffle...
  coord_equal() +
  facet_wrap(~month) +
  theme_minimal() +
  theme_enhance_waffle() +
  scale_fill_tableau() +
  labs(fill="source",  caption="1 square=1%")

hmm, they’ve actually been filled in big chunks haven’t they, not interspersed.

Code
burials_1740 %>%
  add_count(source, name="src_n") %>%
  count(source, year_month, month, src_n) %>%
  ggplot(aes(year_month, n, fill=source)) +
  geom_col(position = "fill")

then again some months have burials from both sources. those could be dups… there are quite a few dups in here ffs. they haven’t deduped because of age in a lot of cases. (a handful because of spelling variations in name)

Code
# burials_1740 %>%
#   count(source, year_month) %>%
#   arrange(year_month, source) %>%
#   count(year_month, name="nn") %>% filter(nn>1) %>%
#   left_join(burials_1740, by="year_month") %>%
#   arrange(person_id) %>%  View()

Once we have a broad overview of missingness in the data, the next step is to explore how missingness exists at finer resolution within variables and cases. You might also refer to variables and cases as “columns” and “rows”, but for consistency, we will use “variables” and “cases”.

Gender

Code
sb_burials_na %>%
  count(decade, gender) %>%
  mutate(gender = replace_na(gender, "NA")) %>%
  ggplot(aes(decade, n, fill=gender)) +
  geom_col(position = "fill") +
  scale_fill_tableau() +
  labs(y="number of burials")

cause of death - 27.5% [25% all] have no cause of death

Code
burials_1740 %>%
  count(is.na(cause_std), decade) |>
  rename(grp = `is.na(cause_std)`) |>
  ggplot(aes(x=decade, y=n, fill=grp)) +
  geom_col(position = "fill")

Code
burials_1740 %>%
  count(cause_std, sort = T) %>%
  mutate(p = n/sum(n)*100)
# A tibble: 170 × 3
   cause_std        n     p
   <chr>        <int> <dbl>
 1 <NA>         15658 27.4 
 2 convulsions  10954 19.2 
 3 consumption   5263  9.22
 4 old age       4918  8.62
 5 fever         3303  5.79
 6 smallpox      2964  5.19
 7 decline       2937  5.15
 8 dropsy        1732  3.03
 9 stillborn     1469  2.57
10 inflammation  1204  2.11
# … with 160 more rows

age at burial: 30% [30% all] have no age info, 3x size of the next largest group

plus obvious clumping at 0/5…

Code
# burials_1740 %>%
#   count(age_burial, sort = T) %>%
#   mutate(p = n/sum(n)*100)
Code
## NA cause
# all data and 1740 are almost identical; does na cause skew towards early records?
# slightly lower % f 46.5 (overall 47.8) - m 49.4 (48) but it's not far out.
# burials_1740 %>%
#   filter(is.na(cause_std)) %>%
#   count(gender) %>%
#   mutate(p = n/sum(n)*100)
Code
## NA age
# a bit more difference cf gender overall. f 45.4 and m 50.65
# burials_1740 %>%
#   filter(is.na(age_burial)) %>%
#   count(gender) %>%
#   mutate(p = n/sum(n)*100)
Code
#by decade
# burials_1740 %>%
#   #filter(is.na(age_burial)) %>%
#   count(decade) %>%
#   mutate(p =n/sum(n)*100)
Code
burials_1740 %>%
  mutate(has_age = !is.na(age_burial)) %>%
  count(decade, has_age) %>%
  ggplot(aes(decade, n, fill=has_age)) +
  geom_col(position = "fill") +
  scale_fill_tableau() +
  scale_y_continuous(labels = percent_format()) +
  labs(y="% of burials")

Code
burials_1740 %>%
  mutate(has_cause = !is.na(cause)) %>%
  count(decade, has_cause) %>%
  ggplot(aes(decade, n, fill=has_cause)) +
  geom_col(position = "fill") +
  scale_fill_tableau() +
  scale_y_continuous(labels = percent_format()) +
  labs(y="% of burials")

pretend it doesn’t matter…

Code
burials_1740_gender %>%
  count(gender) %>%
  mutate(p = n/sum(n)*100)
# A tibble: 2 × 3
  gender     n     p
  <chr>  <int> <dbl>
1 female 27641  50.5
2 male   27109  49.5

Age and gender

  • huge difference between 18th c and today… but is there any significant gender diff?
Code
# diverging thingy is quite nice but hard to see smaller categories
# 19 = 100 yrs +
burials_1740_gender %>%
  filter(!is.na(age_burial)) %>%
    mutate(age_group2 = case_when(
    age_group_n==-1 ~"n/a",
    age_burial<5 ~ "<5",
    age_group_n==0 ~ "05-09",
    age_group_n==10 ~ "10-19",
    age_group_n== 20 ~ "20-29",
    age_group_n==30 ~ "30-39",
    age_group_n==40 ~ "40-49",
    age_group_n==50 ~ "50-59",
    age_group_n==60 ~ "60-69",
    age_group_n==70 ~ "70-79",
    age_group_n==80 ~ "80-89",
    age_group_n==90 ~ "90+",
    age_group_n==100 ~ "90+",
  )) %>% 
  add_count(gender, name="n_gen") %>%
  add_count(age_group2, name="n_age") %>%
  count(gender, n_gen, n_age, age_group2, sort = T)  %>%
  mutate(div_g = ifelse(gender=="female", n*-1, n)) %>%
  mutate(age_group = fct_rev(age_group2)) %>%
  ggplot(aes(x=age_group, fill=gender, y=div_g)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept=0) +
  scale_y_continuous(labels=abs, breaks = seq(-8000, 8000, 1000)) +
  scale_fill_tableau() +
  theme(legend.position = "bottom") +
  labs(x=NULL, y="number of burials", fill=NULL, title="Gender and age at burial")

Code
# https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales
ons_deaths_2019 %>%
  filter(gender!="persons") %>%
  pivot_longer(x1:x52) %>%
  group_by(gender, age_group) %>%
  summarise(total = sum(value), .groups = "drop_last") %>%
  ungroup() %>% 
  add_count(gender, name="n_gen") %>%
  add_count(age_group, name="n_age") %>%
 # count(gender, n_gen, n_age, age_group, sort = T)  %>%
  mutate(div_g = ifelse(gender=="female", total*-1, total)) %>%
  mutate(age_group = fct_rev(age_group)) %>%
  ggplot(aes(x=age_group, fill=gender, y=div_g)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept=0) +
  scale_y_continuous(labels=abs, breaks = seq(-130000, 90000, 20000)) +
  scale_fill_tableau() +
  theme(legend.position = "bottom") +
  labs(x=NULL, y="number of deaths", fill=NULL, title="Gender and age at death, England and Wales 2019", caption="source: ONS")

Code
burials_top_causes <-
burials_1740_gender %>%
  select(burial_id, decade, gender, age_burial, cause_std, age_group_n, age_group) %>%
  filter(!is.na(cause_std), !is.na(age_burial)) %>%
  add_count(cause_std, sort = T, name="n_cause") %>%
  filter(n_cause>=140)  # 20 why can't i get the same using top_n?
  #count(cause_std, n_cause, age_group)
Code
### Heatmap
# sb_records_with_excl_total %>%
#   filter(excl_total==FALSE, !is.na(year), between(year, 1765, 1835), 
#          comm_short %in% c("Clothing", "Food", "Relief", "Labour", "Fees", "Goods", "Medical")
#          ) %>%
#   count(year, comm_short) %>%
burials_top_causes %>%
  count(age_group, cause_std, gender) %>%
   mutate(cause_std = fct_rev(cause_std)) %>%
  ggplot(aes(x=age_group, y=cause_std, fill=n)) +
  geom_raster() +
  #theme(legend.position = "bottom") +
  labs(y=NULL, x=NULL, fill="burials", title="Heatmap of causes of death by age and gender") +
  scale_fill_viridis_c(option = "A", direction = -1) +
  facet_wrap(~gender) +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

Causes of death

Code
# burials_1740 %>%
#   filter(!is.na(gender)) %>%
#   #mutate(cause_std = replace_na(cause_std, "n/a")) %>%
#   add_count(gender, name="n_gen") %>%
#   add_count(cause_std, name="n_cause") %>%
#   filter(n_cause>100) %>%
#   count(gender, n_gen, n_cause, cause_std, sort = T) %>%
#   ggplot(aes(n, cause_std, fill=gender)) +
#   geom_col(position = "dodge")
Code
# i appear to have some male died in childbirth... 6, and 12 NA...
# 3 do have male names, but 3 have title "Mrs" - fixed. grrrrrr. but what other titles have you screwed up?
# now 10 NA. (401 total childbirth)
# burials_1740 %>%
#   filter(str_detect(cause_std, "childbirth")) %>%
#   filter(is.na(gender))
#   filter(gender=="male") 
#   #filter(title=="Mrs") %>% count(gender)
Code
# some iffy ones here but I don't think there are any other systematic issues.
# burials_1740 %>%
#   count(gender, title)
Code
# diverging thingy is quite nice but hard to see smaller categories
  
burials_1740_gender %>%
  filter(!is.na(cause_std)) %>%
  #mutate(cause_std = replace_na(cause_std, "n/a")) %>%
  add_count(gender, name="n_gen") %>%
  add_count(cause_std, name="n_cause") %>%
  filter(n_cause>100) %>%
  count(gender, n_gen, n_cause, cause_std, sort = T)  %>%
  mutate(div_g = ifelse(gender=="female", n*-1, n)) %>%
  mutate(cause_std = fct_rev(cause_std)) %>%
  ggplot(aes(x=cause_std, fill=gender, y=div_g)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept=0) +
  scale_y_continuous(labels=abs) +
  scale_fill_tableau() +
  theme(legend.position = "bottom") +
  labs(x=NULL, y="number of burials", fill=NULL)

Code
# wonder if this is a good one for a waffle
burials_1740_gender %>%
  filter(!is.na(cause_std)) %>%
  #mutate(cause_std = replace_na(cause_std, "n/a")) %>%
  add_count(gender, name="n_gen") %>%
  add_count(cause_std, name="n_cause") %>%
  filter(n_cause>=50) %>%
  count(gender, n_gen, n_cause, cause_std, sort = T) %>%
  mutate(cause_std = fct_rev(cause_std)) %>%
  ggplot(aes(n, cause_std, fill=gender)) +
  geom_col(position = "fill") +
  scale_fill_tableau() +
  labs(y=NULL, x="% of burials", title = "Gender of common causes of death", caption = "minimum 50 burials")

Most gendered causes of death

Code
burials_1740_gender %>%
  filter(!is.na(cause_std)) %>%
  #mutate(cause_std = replace_na(cause_std, "n/a")) %>%
  #add_count(gender, name="n_gen") %>%
  add_count(cause_std, name="n_cause") %>%
  filter(n_cause>4) %>%
  count(gender, n_cause, cause_std, sort = T) %>%
  pivot_wider(names_from = gender, values_from = n, values_fill = 0) %>%
  mutate(female_p= round(female/n_cause*100, 2)) %>%
  filter(female_p>65 | female_p<35) %>%
  mutate(fem_grp = if_else(female_p>65, ">65%", "<35%")) %>%
  mutate(cause_std = fct_reorder(cause_std, female_p)) %>%
  ggplot(aes(y=female_p, x=cause_std, group=fem_grp)) +
  geom_point() +
  #i've done lollipops with a line before. looks like it was geom_segment and kinda complicated... so ok.
  #ggalt::geom_lollipop(colour="lightgrey", point.colour = "black", horizontal=TRUE) + # this seems to force axis to start at 0 which is not helpful...
  coord_flip() +
  facet_wrap(~fem_grp, scales = "free") +
  labs(y="female % of burials", x=NULL, title = "Most gendered causes of death", caption = "minimum 5 burials")

Code
## stuff that isn't working...
Code
## Burial costs
# no big difference here, which seems remarkable... well you can see some.
# fees_1740 %>%
#   group_by(burial_id) %>%
#   summarise(fees_d = sum(fee_d)) %>%
#   ungroup() %>%
#   inner_join(burials_1740_gender, by="burial_id") %>%
#   filter(fees_d>0) %>% # there are 5 with 0 fees (=specified no fee) - causes a warning if using log scale
#   ggplot(aes(x=gender, y=fees_d, fill=gender)) +
#   lvplot::geom_lv(show.legend = F) +
#   #geom_boxplot() +
#   #ggbeeswarm::geom_beeswarm() +
#   #geom_violin() +
#   scale_y_log10() +
#   scale_fill_tableau() +
#   labs(y="total fees (d)", title = "Comparison of burial fees")
Code
# don't forget dplyr::summarise is masked by vcdExtra
#library(vcd)
#library(vcdExtra)
Code
# 2-way tables speech + gender
#vcd_gender_sig_tab <- with(all_qs_sigs_gender %>% filter(!is.na(gender), sig_type!="in"), table(gender, sig_type))

# age_cause_tab <- 
#   with(burials_top_causes, table(age_group, cause_std))
Code
# mosaic(age_cause_tab, 
#        legend=F, shade=T,
#             #  main="QS petitions: gender ~ sub type" #,
#        rot_labels = c(left = 0, top=90), 
#        offset_varnames = c(left = 1, top=2.5), # space categories-varname, i think
#        offset_labels = c(left = 1, top=1.5),
#       margins = c(4.5, 0, 2, 1),
#   #margin = unit(4, "lines"),
#        #spacing = spacing_increase(start = unit(0.2, "lines"), rate = 2) , 
#        labeling_args = list(
#          abbreviate = c(cause_std = 11),
#          tl_labels = TRUE, tl_varnames = FALSE,
#          gp_labels = gpar(fontsize=9)
#         )
#        )
Code
# struc_age_gender_cause <-
#   structable(cause_std+age_group~gender, burials_top_causes)
# 
# # mosaic is too much, just not working...
# mosaic( 
#   ~gender +cause_std + age_group,
#   struc_age_gender_cause, 
#   shade = T, legend=F,
#             #  main="QS petitions: gender ~ sub type" #,
#       rot_labels = c(left = 0, top=90, right=0), 
#       #  offset_varnames = c(left = 1, top=2.5), # space categories-varname, i think
#       #  offset_labels = c(left = 1, top=1.5),
#       # margins = c(4.5, 0, 2, 1),
#   #margin = unit(4, "lines"),
#        #spacing = spacing_increase(start = unit(0.2, "lines"), rate = 2) , 
#        labeling_args = list(
#          abbreviate = c(cause_std = 7),
#          #tl_labels = TRUE, 
#          #tl_varnames = FALSE,
#          gp_labels = gpar(fontsize=9)
#         )
#   )
Code
#nope
# fees_1740 %>%
#   group_by(burial_id) %>%
#   summarise(fees = sum(fee_d)) %>%
#   filter(fees <= 2400) %>%
#   inner_join(burials_1740_gender, by="burial_id") %>%
#   ggplot(aes(fees, group=gender)) +
#   geom_histogram(binwidth = 10) +
#   facet_wrap(~gender, ncol=1)
Code
# burials_1740 %>%
#   filter(!is.na(gender)) %>%
#   mutate(cause_std = replace_na(cause_std, "n/a")) %>%
#   add_count(gender, name="n_gen") %>%
#   add_count(cause_std, name="n_cause") %>%
#   filter(n_cause>100) %>%
#   count(gender, n_gen, n_cause, cause_std, sort = T) %>%
#   mutate(cause_std = reorder_within(cause_std, n, gender)) %>%
#   ggplot(aes(cause_std, n, fill=gender, group=gender)) +
#   geom_col(show.legend = F) +
#   coord_flip() +
#   scale_x_reordered() +
#   facet_wrap(~gender, ncol=2, scales="free_y") +
#   labs(x=NULL)
Code
# burials_1740 %>%
#   filter(!is.na(gender), !is.na(age_burial) ) %>%
#   #mutate(cause_std = replace_na(cause_std, "n/a")) %>%
#   add_count(gender, name="n_gen") %>%
#   add_count(age_burial, name="n_age") %>%
#  # filter(n_cause>100) %>%
#   #count(gender, n_gen, n_age, age_burial) %>% 
#   ggplot(aes(age_burial, colour=gender, group=gender)) +
#   geom_density()
#   # geom_histogram(binwidth = 2) +
#   # facet_wrap(~gender, ncol = 2, scales = "free_y")
Code
# ??don't really get this. but nb it can also be faceted.
# sb_burials_na %>%
#  select(status, occupation, gender, relship, age_burial, cause_std, address) %>%
#   gg_miss_case(order_cases = F)
Code
# burials_1740 %>%
#   select(status, occupation, gender, relship, age_burial, cause_std, address)  %>%
#   #group_by(gender) %>%
#   miss_var_summary() %>%
#   #ungroup() %>%
#   mutate(variable=fct_reorder(variable, pct_miss)) %>%
#   ggplot(aes(variable, pct_miss)) +
#   geom_col(width = 0.001, colour = "#484878", fill = "#484878") +
#   geom_point(colour = "#484878", fill = "#484878") +
#   coord_flip() +
#   #facet_wrap(~gender) +
#   scale_color_discrete(guide = "none")  +
#   #scale_x_reordered() +
#   labs(y="% missing", x=NULL)
#  gg_miss_var(show_pct = TRUE, facet=county)
Code
# don't think you have any facet options like ncol if you do it with gg_miss_var
# burials_1740 %>%
#   mutate(p = case_when(
#     between(year, 1740, 1764) ~ "1740-1764",
#     between(year, 1765, 1789) ~ "1765-1789",
#     between(year, 1790, 1814) ~ "1790-1814",
#     between(year, 1815, 1839) ~ "1815-1839"
#   )) %>%
#   select(p, title, status, occupation, relship, age_burial, cause_std, address) %>%
#   gg_miss_var(facet = p)
Code
# Naniar has a function to create **upset plots**, which show how often missing variables *co-occur* (or correlate). They are cute but not necessarily easy to read and on this occasion I'm not sure it says anything very compelling, I think partly because missings tend to be either very high or low and not much in between. I've only selected some variables to make it easier to read.
# burials_1740 %>%
#   select(age_burial, cause_std, address, occupation) %>%
#   gg_miss_upset()