Grouping On Variables with Spaces in the Column Name Using dplyr

I recently had a variable with a thousand observations that could be one of four classes. I wanted to figure out how many of each class was listed within a variable. Easy enough with group_by() in normal circumstances. Unfortunately, the purveyors of this dataset decided to name their columns with two words. So something like tasty fruit rather than tasty-fruit or tasty_fruit. Unfortunately, just passing the multi-word name through group_by() as you normally would a standard column name won’t work for reasons I detail below.

TLDR: There are two solutions. Click here for solution 1 and here for solution 2.

Now, there are a lot of different conventions for naming columns in a dataset: camelCase, TitleCase, snake_case, mEmEcAsE1—but one thing that is pretty universally avoided among data analytics professionals (aside from using the same name twice), it’s using multiple words for one column name. The reasons are part historical, part technological—but many languages (R included) don’t play nice with column names containing spaces. But, it’s still an issue that can crop-up from time-to-time, so I thought I would share my solution to it.

(Note: I’m generally understanding if the source isn’t an analyst or is someone who, like, only works with smaller excel files and (some) SQL variants. But I’m convinced that there’s only one way to deal with the “pros” who do this: Ritual sacrifice to the coding gods.)

The only logical path.

Here’s how you group by variables with spaces in them using dplyr.

I’m going to be using the penguins dataset to demonstrate.

library(palmerpenguins) #For Data
library(magrittr) #For pipes
library(dplyr) #For data manipulation

head(penguins)
## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge~           39.1          18.7              181        3750 male 
## 2 Adelie  Torge~           39.5          17.4              186        3800 fema~
## 3 Adelie  Torge~           40.3          18                195        3250 fema~
## 4 Adelie  Torge~           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge~           36.7          19.3              193        3450 fema~
## 6 Adelie  Torge~           39.3          20.6              190        3650 male 
## # ... with 1 more variable: year <int>

Let’s say I want to know how many different kinds of penguins there are. Normally that can be accomplished like this:

penguin_count <- penguins %>%
  group_by(species) %>%
  summarise(n_birds = n())
## `summarise()` ungrouping output (override with `.groups` argument)
penguin_count
## # A tibble: 3 x 2
##   species   n_birds
##   <fct>       <int>
## 1 Adelie        152
## 2 Chinstrap      68
## 3 Gentoo        124

We have 152 Adelie penguins, 68 Chinstrap penguins, and 124 Gentoo penguins.

But what if the column wasn’t titled “species” but something like “ground birbs.” (I’m a millenial; don’t judge me.)

Unfortunately for my demonstration (although fortunately for its many users!), it doesn’t have any columns with spaces. It instead uses snake_case. So I have to do a bit of renaming.

pengs <- penguins %>%
  rename("ground birbs" = species)

head(pengs)
## # A tibble: 6 x 8
##   `ground birbs` island bill_length_mm bill_depth_mm flipper_length_~
##   <fct>          <fct>           <dbl>         <dbl>            <int>
## 1 Adelie         Torge~           39.1          18.7              181
## 2 Adelie         Torge~           39.5          17.4              186
## 3 Adelie         Torge~           40.3          18                195
## 4 Adelie         Torge~           NA            NA                 NA
## 5 Adelie         Torge~           36.7          19.3              193
## 6 Adelie         Torge~           39.3          20.6              190
## # ... with 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Demonstrating the Problem

To demonstrate the issue, let’s see what happens if you try to group_by() the renamed “species” column in the penguins dataset, ground birbs:

pengs_count <- pengs %>%
  group_by("ground birbs") %>%
  summarise(n_birds = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(pengs_count)
## # A tibble: 1 x 2
##   `"ground birbs"` n_birds
##   <chr>              <int>
## 1 ground birbs         344

Ok, cool, we know that there are 344 ground birbs? Not very helpful.

Maybe we could use the column index?

pengs_count <- pengs %>%
  group_by(1) %>%
  summarise(n_birds = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(pengs_count)
## # A tibble: 1 x 2
##     `1` n_birds
##   <dbl>   <int>
## 1     1     344

Yes, they are number 1 in cuteness—but that’s somehow even less informative. Could we avoid the quotes?

pengs_count <- pengs %>%
  group_by(ground birbs) %>%
  summarise(n_birds = n())

head(pengs_count)
## Error: <text>:2:19: unexpected symbol
## 1: pengs_count <- pengs %>%
## 2:   group_by(ground birbs
##                      ^

That’s even worse! We just get an error!

There are two ways that we can access this misbegotten column.

Solution 1: Use ` (backquotes)

If you encase the the offending name in backquotes (`) you’ll be able to access it normally. So instead of "ground birbs" it’d be `ground birb`

pengs_count <- pengs %>%
  group_by(`ground birbs`) %>%
  summarise(n_birds = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(pengs_count)
## # A tibble: 3 x 2
##   `ground birbs` n_birds
##   <fct>            <int>
## 1 Adelie             152
## 2 Chinstrap           68
## 3 Gentoo             124

Solution 2: Use the across() function

New in dplyr 1.0.0 is the across() function, which allows you to apply functions across (get it) rows in your data. But, importantly in our case, across() lets you access variables by their column index.

pengs_count <- pengs %>%
  group_by(across(1)) %>%
  summarise(n_birds = n())
## `summarise()` ungrouping output (override with `.groups` argument)
head(pengs_count)
## # A tibble: 3 x 2
##   `ground birbs` n_birds
##   <fct>            <int>
## 1 Adelie             152
## 2 Chinstrap           68
## 3 Gentoo             124

So if you want to group_by() a column with a space in its name, use backquotes or across()! I hope this helps!!


  1. I’m only kidding with this one. But even this is preferable to spaces in the column name!↩︎