Many functions in base R has faded away from my daily use of R because of tidyverse and the paradigm to do as many operations as possible in a data.frame.
Get the variable name
deparse(substitute(variable))
Indexing and subsetting
which
to return a logical vector that can be used in []
for subsetting
Tidyverse alterantive (notes for myself)
Imagine that I have a list of data.frames (group_split
split a dataframe into lists of dataframes by the value of column specified)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'dplyr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
list_df <- iris %>% group_split(Species)
list_df[[1]]
## # A tibble: 50 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 40 more rows
To get dataframes whose sum of large Sepals’ lengths (Sepal.Length
>5) is larger than 200, I can either pull out the values > 5, calculate their sum, compare to obtain a vector of TRUE
values and subset by the logical vector, in this way:
subset_vector <- list_df %>% map_dbl(
~ filter(., Sepal.Length > 5) %>%
pull(Sepal.Length) %>%
as.double() %>%
sum(na.rm = TRUE)
)
list_df[which(subset_vector>200)]
## <list_of<
## tbl_df<
## Sepal.Length: double
## Sepal.Width : double
## Petal.Length: double
## Petal.Width : double
## Species : factor<fb977>
## >
## >[2]>
## [[1]]
## # A tibble: 50 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 7 3.2 4.7 1.4 versicolor
## 2 6.4 3.2 4.5 1.5 versicolor
## 3 6.9 3.1 4.9 1.5 versicolor
## 4 5.5 2.3 4 1.3 versicolor
## 5 6.5 2.8 4.6 1.5 versicolor
## 6 5.7 2.8 4.5 1.3 versicolor
## 7 6.3 3.3 4.7 1.6 versicolor
## 8 4.9 2.4 3.3 1 versicolor
## 9 6.6 2.9 4.6 1.3 versicolor
## 10 5.2 2.7 3.9 1.4 versicolor
## # ... with 40 more rows
##
## [[2]]
## # A tibble: 50 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6.3 3.3 6 2.5 virginica
## 2 5.8 2.7 5.1 1.9 virginica
## 3 7.1 3 5.9 2.1 virginica
## 4 6.3 2.9 5.6 1.8 virginica
## 5 6.5 3 5.8 2.2 virginica
## 6 7.6 3 6.6 2.1 virginica
## 7 4.9 2.5 4.5 1.7 virginica
## 8 7.3 2.9 6.3 1.8 virginica
## 9 6.7 2.5 5.8 1.8 virginica
## 10 7.2 3.6 6.1 2.5 virginica
## # ... with 40 more rows
Or, I can do everything within the data.frame, in this way:
list_df %>%
map(
~ filter(., Sepal.Length > 5) %>%
mutate(length_sum = sum(as.double(Sepal.Length), na.rm = TRUE)) %>%
filter(length_sum > 200)
)
## [[1]]
## # A tibble: 0 x 6
## # ... with 6 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## # Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>, length_sum <dbl>
##
## [[2]]
## # A tibble: 47 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species length_sum
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 7 3.2 4.7 1.4 versicolor 282.
## 2 6.4 3.2 4.5 1.5 versicolor 282.
## 3 6.9 3.1 4.9 1.5 versicolor 282.
## 4 5.5 2.3 4 1.3 versicolor 282.
## 5 6.5 2.8 4.6 1.5 versicolor 282.
## 6 5.7 2.8 4.5 1.3 versicolor 282.
## 7 6.3 3.3 4.7 1.6 versicolor 282.
## 8 6.6 2.9 4.6 1.3 versicolor 282.
## 9 5.2 2.7 3.9 1.4 versicolor 282.
## 10 5.9 3 4.2 1.5 versicolor 282.
## # ... with 37 more rows
##
## [[3]]
## # A tibble: 49 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species length_sum
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 6.3 3.3 6 2.5 virginica 324.
## 2 5.8 2.7 5.1 1.9 virginica 324.
## 3 7.1 3 5.9 2.1 virginica 324.
## 4 6.3 2.9 5.6 1.8 virginica 324.
## 5 6.5 3 5.8 2.2 virginica 324.
## 6 7.6 3 6.6 2.1 virginica 324.
## 7 7.3 2.9 6.3 1.8 virginica 324.
## 8 6.7 2.5 5.8 1.8 virginica 324.
## 9 7.2 3.6 6.1 2.5 virginica 324.
## 10 6.5 3.2 5.1 2 virginica 324.
## # ... with 39 more rows
The advantage of doing it in a data.frame way is that I can continue using the rich vocabulary that tidyverse
provides. For example, I can condition the sum on Sepal.Width using group_by
in each category of flowers:
list_df_advanced <- list_df %>%
map(
~ mutate(., width_category = if_else(Sepal.Width > 3, "wide", "narrow")
)
)
list_df_advanced %>%
map(
~ filter(., Sepal.Length > 5) %>%
group_by(width_category) %>%
mutate(length_sum = sum(as.double(Sepal.Length), na.rm = TRUE)) %>%
ungroup() %>%
filter(length_sum > 200)
)
## [[1]]
## # A tibble: 0 x 7
## # ... with 7 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## # Petal.Length <dbl>, Petal.Width <dbl>, Species <fct>, width_category <chr>,
## # length_sum <dbl>
##
## [[2]]
## # A tibble: 39 x 7
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species width_category
## <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.5 2.3 4 1.3 versicolor narrow
## 2 6.5 2.8 4.6 1.5 versicolor narrow
## 3 5.7 2.8 4.5 1.3 versicolor narrow
## 4 6.6 2.9 4.6 1.3 versicolor narrow
## 5 5.2 2.7 3.9 1.4 versicolor narrow
## 6 5.9 3 4.2 1.5 versicolor narrow
## 7 6 2.2 4 1 versicolor narrow
## 8 6.1 2.9 4.7 1.4 versicolor narrow
## 9 5.6 2.9 3.6 1.3 versicolor narrow
## 10 5.6 3 4.5 1.5 versicolor narrow
## # ... with 29 more rows, and 1 more variable: length_sum <dbl>
##
## [[3]]
## # A tibble: 32 x 7
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species width_category
## <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.8 2.7 5.1 1.9 virginica narrow
## 2 7.1 3 5.9 2.1 virginica narrow
## 3 6.3 2.9 5.6 1.8 virginica narrow
## 4 6.5 3 5.8 2.2 virginica narrow
## 5 7.6 3 6.6 2.1 virginica narrow
## 6 7.3 2.9 6.3 1.8 virginica narrow
## 7 6.7 2.5 5.8 1.8 virginica narrow
## 8 6.4 2.7 5.3 1.9 virginica narrow
## 9 6.8 3 5.5 2.1 virginica narrow
## 10 5.7 2.5 5 2 virginica narrow
## # ... with 22 more rows, and 1 more variable: length_sum <dbl>
I can’t think of a straightforward way to achieve this in base R without many loops…