This documents the common pitfalls when working with Bioinformatics data and how to prevent them.
Headers Case use janitor::clean_names to standardize names to snakecases.
Names use a standardized name:
chr for chromosome, instead of chrom, seqnames etc. Sometimes you have to change the name to fit a certain software (e.g. GenomicRanages), but only convert the name within the call of the function itself, and immediately change back. Never propagate the name change to the next function because it will then be a headache to deal with the dependencies between functions.
Recently I have been tidying up data for my research projects in NUS. This process of dealing with a few TBs of data in one day made me slightly paranoid of the integrity of the data: where should they be stored, which archiving + compresssion protocal should be used, which local/remote file transferring algorithms should be used and even what kind of media - should they be transferred via USB or ethernet.
I am writing this post not as a guideline, but mainly for self-reference and hopefully a prompt for discussion.
The boom of bioinformatics in recent years is coupled with cheaper technologies and consequently the surge of the amount of data available. The rapid development of the field itself is an anti-estblishment movement - even the most experienced bioinformaticians must spend a significant amount of time getting updated with the resources and toolkits.