Bioinformtatics data integrity

Recently I have been tidying up data for my research projects in NUS. This process of dealing with a few TBs of data in one day made me slightly paranoid of the integrity of the data: where should they be stored, which archiving + compresssion protocal should be used, which local/remote file transferring algorithms should be used and even what kind of media - should they be transferred via USB or ethernet. I believe maintaining a good practice to ensure data integrity is one of the few things that may be tedious at the start but beneficial in the long run (just like detailed documentations). Since bioinformatics is a data science, ensuring the integrity of the data is the first step towards reproducibility.

There are many ways whereby the data can be corrupted. Thus, it is best to check the data integrity (e.g. checksum) periodically and before and after transferring. But I have been thinking if it is ever possible to ensure the integrity before and after compression, since there is no way to compare the data after it has been transformed (as compared to just relocating them).

Timing Liu
Computational Biologist & Medical Student

Personalizing medicine