Easy_case_conversion

R
TIL
strings
Author

Sharon Howard

Published

1 Apr 2023

Sometimes you want to convert variable names or text strings to, say, snake_case or CamelCase. This is a common issue when you’re importing spreadsheets or mixed data sources into R.

Variable names

I have a dataset about apprenticeship disputes in early modern London which I’d like to do something with. It’s an Excel spreadsheet from the UK Data Archive.

The column names have been written in ordinary language with spaces between words (“sentence case”), apostrophes and possibly other punctuation. A few columns don’t seem to have names at all.

The first 20 columns:

library(tidyverse)
library(readxl)

apprentices_xlsx <- read_excel(here::here("site_data/apprentices_disputes/apprenticeship_disputes.xlsx"), sheet="Main index")

apprentices_xlsx |>
  select(1:20) |>
  names()
 [1] "ID"                                       
 [2] "number of petition"                       
 [3] "catalogue  order number"                  
 [4] "forename of apprentice"                   
 [5] "surname of apprentice"                    
 [6] "forename of father"                       
 [7] "father's occupation or status"            
 [8] "whether the father is a citizen of london"
 [9] "parish"                                   
[10] "county"                                   
[11] "whether the father is dead"               
[12] "forename of master"                       
[13] "surname of master"                        
[14] "whether the master is a citizen of london"
[15] "company or trade of master"               
[16] "whether the master is dead"               
[17] "other notes about the master"             
[18] "the defendant if not the master"          
[19] "date of the apprentice's binding"         
[20] "...20"                                    

These can be inconvenient to work with in R because you have to remember to use `backslashes` whenever you name them, and there are a lot of columns in this spreadsheet. (Also, the spacing is inconsistent, which is hard to spot.) And they just bug me.

apprentices_xlsx |>
  select(`number of petition`, `catalogue  order number`) |>
  head()
# A tibble: 6 × 2
  `number of petition` `catalogue  order number`
                 <dbl> <chr>                    
1                  210 222                      
2                  195 14                       
3                  305 45                       
4                  243 78                       
5                   29 71                       
6                   59 9                        

Solution: the {janitor} package clean_names() function.

The default for clean_names() is snake_case.

library(janitor)

apprentices_snake <-
apprentices_xlsx |>
  clean_names() 

apprentices_snake |>
  select(1:20) |>
  names()
 [1] "id"                                       
 [2] "number_of_petition"                       
 [3] "catalogue_order_number"                   
 [4] "forename_of_apprentice"                   
 [5] "surname_of_apprentice"                    
 [6] "forename_of_father"                       
 [7] "fathers_occupation_or_status"             
 [8] "whether_the_father_is_a_citizen_of_london"
 [9] "parish"                                   
[10] "county"                                   
[11] "whether_the_father_is_dead"               
[12] "forename_of_master"                       
[13] "surname_of_master"                        
[14] "whether_the_master_is_a_citizen_of_london"
[15] "company_or_trade_of_master"               
[16] "whether_the_master_is_dead"               
[17] "other_notes_about_the_master"             
[18] "the_defendant_if_not_the_master"          
[19] "date_of_the_apprentices_binding"          
[20] "x20"                                      

If you want camelCase (or one of various other options), you need to specify it.

apprentices_xlsx |>
  clean_names("lower_camel") |>
  select(1:20) |>
  names()
 [1] "id"                                 "numberOfPetition"                  
 [3] "catalogueOrderNumber"               "forenameOfApprentice"              
 [5] "surnameOfApprentice"                "forenameOfFather"                  
 [7] "fathersOccupationOrStatus"          "whetherTheFatherIsACitizenOfLondon"
 [9] "parish"                             "county"                            
[11] "whetherTheFatherIsDead"             "forenameOfMaster"                  
[13] "surnameOfMaster"                    "whetherTheMasterIsACitizenOfLondon"
[15] "companyOrTradeOfMaster"             "whetherTheMasterIsDead"            
[17] "otherNotesAboutTheMaster"           "theDefendantIfNotTheMaster"        
[19] "dateOfTheApprenticesBinding"        "x20"                               

Other text strings

The clean_names() function is only intended for variable names, but sometimes you want to convert the contents of a variable instead. This might be so that you can use it to construct an ID, or if you’re joining data from different sources where they might be in slightly different formats, you want a quick way to ensure both sides of the join match.

Solution: the {snakecase} package. Again, the default to_any_case() function converts to snake_case, but other variations are available.

library(snakecase)

apprentices_snake |>
  head(10) |>
  select(trade_in_which_the_apprentice_was_trained) |>
  mutate(trade_snake = to_any_case(trade_in_which_the_apprentice_was_trained), 
         trade_camel = to_upper_camel_case(trade_in_which_the_apprentice_was_trained))
# A tibble: 10 × 3
   trade_in_which_the_apprentice_was_trained trade_snake      trade_camel   
   <chr>                                     <chr>            <chr>         
 1 Hosier                                    hosier           Hosier        
 2 Mercer                                    mercer           Mercer        
 3 Ship master                               ship_master      ShipMaster    
 4 Oilman                                    oilman           Oilman        
 5 Merchant taylor                           merchant_taylor  MerchantTaylor
 6 Gold wire drawer                          gold_wire_drawer GoldWireDrawer
 7 Mariner                                   mariner          Mariner       
 8 Grocer                                    grocer           Grocer        
 9 Carpenter                                 carpenter        Carpenter     
10 Tailor                                    tailor           Tailor        

In one project I needed to convert text strings to what’s sometimes called kebab-case (like snake_case but hyphens instead of underscores). {snakecase} doesn’t have this as a ready-made function, but it does have a sep_out argument so it’s easy to change the separator.

apprentices_snake |>
  head(10) |>
  select(trade_in_which_the_apprentice_was_trained) |>
  mutate(trade_kebab = to_snake_case(trade_in_which_the_apprentice_was_trained, sep_out="-"))
# A tibble: 10 × 2
   trade_in_which_the_apprentice_was_trained trade_kebab     
   <chr>                                     <chr>           
 1 Hosier                                    hosier          
 2 Mercer                                    mercer          
 3 Ship master                               ship-master     
 4 Oilman                                    oilman          
 5 Merchant taylor                           merchant-taylor 
 6 Gold wire drawer                          gold-wire-drawer
 7 Mariner                                   mariner         
 8 Grocer                                    grocer          
 9 Carpenter                                 carpenter       
10 Tailor                                    tailor