How Perl Helped the Genome Project

Today I want to talk about how Perl helped the Human Genome Project.

The legend tells that somewhere in early February 1996, in Cambridge, England, in the largest DNA Sequencing centre in Europe, a meeting between scientists from Cambridge and from the largest DNA sequencing centre in the United States of America tried to solve a disconcerting problem: despite both DNA sequencing centres being using mainly the most laboratory techniques, databases, and data analysis tools, it was not possible to interchange data or meaningfully compare results between them.

The dimension of the Genome Project is respectable and would make an average DBA run for cover: estimates go up to 10 terabytes of information at the project completion, divided in complex and not so intuitive parts that make a lot of sense for molecular biologists, but not as much for a computer scientist.

From the software engineering point of view, managing the laboratory activity is a complex, ever-changing problem that drove a lot of development teams out of the road at the initial, monolithic implementations attempted. After a while, groups learned that modular, loosely-coupled systems that specialize on parts of the process and can be swapped around as the laboratory protocols and techniques used improve and change were a better implementation solution for the problem.

The utilization of those modular systems to analyse the DNA data was similar to a pipeline, and unix pipes were largely utilized to pass data between small processing programs.

A data interchange format was developed by the Whitehead Institute and the MIT Centre for Genome Research in order to easy the communication between those small pipelined analytic utilities. It's named Boulder. It eventually became the data interchange format used by all the DNA Sequencing Centres around the world.

The format itself is designed to be easily implemented using Perl strengths, and the implementation provided is, of course, Perl-based. Nowadays, implementations for other languages are also available.

The basic principle of the Boulder API is that all data is associated with tags. The API allows easy access to the tags the programmer is interested in from the standard input, and allows the programmer to add more tags to be passed along with all the data to the next program in the pipeline through the standard out.

The use of the Boulder library, the cooperation of many other Perl developers, and many other tools, the Genome Project succeeded. I am proud to say this would be all much difficult without Perl.

Leave a comment

Categories

Authors

  • Dave Cross
  • Luis Motta Campos
  • Jason Purdy
  • Michael Peters
  • Steve Marvell

Recent Entries

Close