How to split a string column by length

Last updated on 2021-04-03 2 min read Data Science

Intro

This is a documentation of how I split a string type column by its length, and combine them together in a directory format (which was a necessary step for me to check whether each directory existed in my analysis).

library(tidyverse)
data <- tibble(string = c("123456", "987654"))
print(data)

## # A tibble: 2 x 1
##   string
##   <chr> 
## 1 123456
## 2 987654

Step 1

strsplit splits the string into a list of strings, and in tibble it will show up as a column of list type.

split_data <- 
  data %>% 
  mutate(split_str = strsplit(string, "(?<=.{2})", perl = TRUE)) 

print(split_data)

## # A tibble: 2 x 2
##   string split_str
##   <chr>  <list>   
## 1 123456 <chr [3]>
## 2 987654 <chr [3]>

Step 2

First method: combine string + unnest

split_data %>% 
  mutate(split_str_dir = map(split_str, ~ str_c(., collapse = "/"))) %>% 
  unnest(split_str_dir)

## # A tibble: 2 x 3
##   string split_str split_str_dir
##   <chr>  <list>    <chr>        
## 1 123456 <chr [3]> 12/34/56     
## 2 987654 <chr [3]> 98/76/54

Second method: unnest (wider) + unite

split_data %>% 
  unnest_wider(split_str, names_sep = "_") %>% 
  unite(split_str_dir, starts_with("split_str"), sep = "/")

## # A tibble: 2 x 2
##   string split_str_dir
##   <chr>  <chr>        
## 1 123456 12/34/56     
## 2 987654 98/76/54

Outro

In my opinion the second method is more straightforward in syntax but it requires someone to know the existence of unnest_wider (how many problems in programming are due to unknown unknowns?).

The first method requires some understanding of functional programming syntax i.e. map and ~. It also requires someone to understand the difference between str_c’s parameters: sep and collapse .

R programming tidyverse

Tim

Personalizing medicine