I am writing this post not as a guideline, but mainly for self-reference and hopefully a prompt for discussion.
The boom of bioinformatics in recent years is coupled with cheaper technologies and consequently the surge of the amount of data available. The rapid development of the field itself is an anti-estblishment movement - even the most experienced bioinformaticians must spend a significant amount of time getting updated with the resources and toolkits. When preparing my current manuscript, I started to explore more on bioinformatics database - NCBI, UCSC, ClinVar, DECIPHER, dbSNP - just to name a few. I found the amount of data is overwhelming at best and I think it is important to discuss a few of the hallmakrs of bioinformatics data in this post.
The first hallmark is the lack of comprehensive guidelines to mine from the data. The speed of its devleopment has arguably much outpace the writing of any up-to-date textbook. While general computing, biological and statistical knowledge can still be acquired from tutorials, most skillsets have to be equipped through frequent uses of online search engines.
The second hallmark is the lack of one single authoritative dataset. Arguably the most important data in the bioinformatics, human genome build, is itself a mess. I have to rely on Heng Li’s blog to decide which reference to use. The UCSC and NCBI builds are incompatible, with difference names for reference genomes. For SNPs, there are dbSNP and kaviar and argulably many more, and they all have their own versions. It is not just a free market, it is the wild west. So, as Heng Li’s post said, welcome to Bioinformatics.
The specific issue that I have encountered is the querying of SNP reference number for FMR1 gene’s CGG expansion. If one just takes a look at the genome browser at the bottom of the page for rs193922936, one can definitely notice that there are dozens of SNPs all within the expansion region. But my question is this, how can one differentiate even one from the other? Instead of giving each repeat size a SNP number, if that is even what it is intending doing, would it not be good to just give the whole microsatellite a SNP reference? The irony is, there is no SNP reference for the whole expansion. Just when I thought the condition was taken by NCBI as multiple types of indels instead of a microsatellite, I realised that the microsatellite actually has the ClinVar number. And perhaps one is not enough to show the condition’s importance (I am kidding here), there are multiple CLinVar entries to the same repeat region, including this one, which has a rs number itself, and the publication reference that this additional entry quotes is the original papers that discovered the microsatellite. If that is not confusing enough, you need to know that for other genes, the expansion can have a single rs number. So the confusing practice here is apparently not consistent (and I have no judgement on whether being more consistent is good).
I have reported this issue and there are definitely many more to come in my future research. Frustated as it may be, I believe that by continuing contributing to our existing datasets as an end-user and with even more data that bolsters our understanding of the genetic features, we can achieve a more tidy bioinformatics dataset, just like the periodic table. There are perhaps also better ways to gather data than using a web browser, so learning more data-mining skills like SQL may be helpful.