Notes on Bioconductor packages

Last updated on 2021-04-04 3 min read Data Science

This is not intended to be a comprehensive review of Bioconductor packages - there are too many of them. These are my personal notes.

First of all, I must declare a love-hate relationship with many Bioconductor packages. On one hand, they are very useful for specific purposes. On the other hand, there is often less underlying logic for these packages as compared to the tidyverse ecosystem. Even the authors admit that sometimes they forget what functions are there in their packages (I should link here to a Bioconductor support page, but not in the mood to do so). However, I must acknowledge that the data types that Bioconductor packages are designed to work with are not often as straight-forward and generalizable as tables in tidyverse - a lot of the data have biological contexts and are more complicated than basic R types. It is thus more difficult to have a generalizable approach to all data analysis steps.

Naming convensions

Many Bioconductor packages share the same spelling with the base counterparts, except with some different capitalization. It can thus be painful to find relevant documentations via a search engine

Solution: add Bioconductor in one of the search terms.

Conflict with tidyverse

There are many functions that clash with tidyverse. Notable examples include rename and slice. The issues with rename is the most tricky because while slice usually returns errors when used in a tidyverse way to slice rows, rename is different from dplyr::rename in the sequence of the names. In dplyr, the column name to be renamed should be in front rename("new_name" = "old_name"). In Bioconductor it is the other way around.

The dplyr’s logic is probably: Charles is now the King.

The Bioconductor’s logic is: the King is now Charles.

Both are perfectly logical, thus perfectly confusing.

DataFrame

It enables an ordinary data.frame to contain more data types in the column. When you convert that to a data.frame or a tibble, the column names can be quite unexpected and require some bit of fixing.

Subsetting

It has the same subsetting syntax as base data.frame and does not work with tidyverse syntax. When using pipe. you can subset by DataFrame %>% .[x,y].

GenomicRanges

The function that I quite most often is makeGenomicRangesFromDataFrame given that I am mostly in the tidyverse. In the tibble bioinformatics world, I use chr and pos to denote SNP positions. However, GenomicRanges force the names to become seqnames, start and end because, well, it stores range data.

I have two related strategies to avoid constantly fixing the column names when working with a bunch of tibbles + GenomicRanges.

First, the variable name should denote the type of the object. I usually just add _GR if the object is a GenomicRanges object.

Second, if the variable is a tibble, it must have standard column names that follow chr and pos. I write custom wrappers for functions like makeGenomicRangesFromDataFrame so that it always converts the same columns to GenomicRanges.

Others

Not just for Bioconductor packages, but I generally find packages with pdf documentations a bit difficult to manage. I make notes on them but it is time-consuming to keep and retrieve a catalogs of pdfs (insert in some ads music). Now Zotero has helped a lot with this aspect.

bioinformatics R

Tim

Personalizing medicine