This is not intended to be a comprehensive review of Bioconductor packages - there are too many of them. These are my personal notes.
First of all, I must declare a love-hate relationship with many Bioconductor packages. On one hand, they are very useful for specific purposes. On the other hand, there is often less underlying logic for these packages as compared to the tidyverse
ecosystem. Even the authors admit that sometimes they forget what functions are there in their packages (I should link here to a Bioconductor support page, but not in the mood to do so). However, I must acknowledge that the data types that Bioconductor packages are designed to work with are not often as straight-forward and generalizable as tables in tidyverse
- a lot of the data have biological contexts and are more complicated than basic R types. It is thus more difficult to have a generalizable approach to all data analysis steps.
Naming convensions
Many Bioconductor packages share the same spelling with the base counterparts, except with some different capitalization. It can thus be painful to find relevant documentations via a search engine
Solution: add Bioconductor in one of the search terms.
Conflict with tidyverse
There are many functions that clash with tidyverse. Notable examples include rename
and slice
. The issues with rename
is the most tricky because while slice
usually returns errors when used in a tidyverse
way to slice rows, rename
is different from dplyr::rename
in the sequence of the names. In dplyr
, the column name to be renamed should be in front rename("new_name" = "old_name")
. In Bioconductor it is the other way around.
The dplyr
’s logic is probably: Charles is now the King.
The Bioconductor’s logic is: the King is now Charles.
Both are perfectly logical, thus perfectly confusing.
DataFrame
It enables an ordinary data.frame to contain more data types in the column. When you convert that to a data.frame or a tibble, the column names can be quite unexpected and require some bit of fixing.
Subsetting
It has the same subsetting syntax as base data.frame and does not work with tidyverse syntax. When using pipe. you can subset by DataFrame %>% .[x,y]
.
GenomicRanges
The function that I quite most often is makeGenomicRangesFromDataFrame
given that I am mostly in the tidyverse
. In the tibble
bioinformatics world, I use chr
and pos
to denote SNP positions. However, GenomicRanges
force the names to become seqnames
, start
and end
because, well, it stores range data.
I have two related strategies to avoid constantly fixing the column names when working with a bunch of tibbles + GenomicRanges.
First, the variable name should denote the type of the object. I usually just add _GR
if the object is a GenomicRanges object.
Second, if the variable is a tibble, it must have standard column names that follow chr
and pos
. I write custom wrappers for functions like makeGenomicRangesFromDataFrame
so that it always converts the same columns to GenomicRanges.
Others
Not just for Bioconductor packages, but I generally find packages with pdf documentations a bit difficult to manage. I make notes on them but it is time-consuming to keep and retrieve a catalogs of pdfs (insert in some ads music). Now Zotero has helped a lot with this aspect.