As science publishes new genomes of animal and plant species in peer-reviewed journals almost daily, the underlying software used to put these genetic jigsaw puzzles together, called assemblers, are not as trustworthy as you may think. This strikes at the core of computational biology. A recent study revealed that different genome assemblers produce different results on the same data. The genomics and bioinformatics community commented on this massive study in a total of 21 blogs, which can be found on the Assemblathon 2 website, and they are working on ways to solve the problem.
"Pieces of eight": The genome of the Melopsittacus undulatus (budgie) bird species was used in
the Assemblathon 2 study. (Image courtesy of Benjamint444, Wikimedia Commons.)
As science continues to piece together genetic codes of animal and plant species, like a jigsaw puzzle of life, breakthroughs abound. The recent sequencing of the barber pole worm (Haemonchus contortus), a parasite that infects livestock worldwide, and a threat to global food security, is one example. Understanding its genome could lead to better-targeted drug therapies and more resilient animals. But, the fundamental technology used to create high-quality sequences – known as genome assemblers, computer programs that piece together short reads of DNA sequences into complete genomes – are not so reliable. This has implications for scientific accuracy and applications from breakthroughs that reach the public.
The results of a large study, Assemblathon 2, on the quality of genome assemblers were recently published by 91 co-authors. The conclusions drawn forced Titus Brown a researcher at Michigan State University, US, who was not directly involved, to say that, from a bioinformatics perspective, “we're doing it all wrong”.
The Assemblathon 2 results were published on 22 July 2013 in open access and open data journal GigaScience, earlier on arXiv.org, and are also available on MyScienceWork. A total of 21 teams submitted 43 genome assemblies of a bird, a fish, and a snake, and found different assemblers produced different results from the same data.
Assemblers do not perform consistently and it is hard to know which assembly is the most accurate. Moreover, Brown said on his blog, Living in an Ivory Basement, that high-profile journals celebrate ‘the’ genome of the mouse, or ‘the’ genome of the zebrafish without realising their data may not be so trustworthy.
To help biologists, Assemblathon 2 researchers suggested using a range of different assemblers and parameters instead of just one, and using a few metrics rather than a single measure. Choosing an assembler that is the best at producing a particular type of output, such as the number of error-free bases, is one such example. Each genome assembler makes different computational assumptions to ‘speed-up’ the steps taken on next-generation sequences. There are three main factors, as processing a sequence can be computer-memory intensive, time intensive, and algorithmically channelling (which is important for the quality of results). The different assumptions that are made is the reason for the variation in results.
Genome assembly in action: This image visualises the genome assembly process. DNA is chemically parsed (an analysis by breaking down DNA into components or essential features) into short sequences and the computer program (assembler) recombines the short sequences in order to computationally reconstruct the original DNA sequence. The entire process is judged by three metrics: quality, cost and time. Assemblathon 2 only measured the quality aspect, but the other two factors are equally important for real users. (Image courtesy of Manoj Samanta.)
Calm down, dear.
But, not all bioinformaticians have the same view. “I do not see any problems in genome assembly programs,” says bioinformatics researcher, Manoj Samanta and founder of the Systemix Institute. “In 2009-2010, nobody thought it would be possible to assemble large genomes from really tiny reads.”
According to Samanta, great strides have been made in assembler software and the Assemblathon 2 results should not be a big surprise. The problem is that genome assembler output uncertainty is not communicated clearly to biologists.
Assemblathon 3 and beyond…
On the Homologus blog – which Manoj Samanta contributes to – it is recommended that future studies should dismantle various assembler programs to explain why they perform the way they do.
“It would be interesting to try to find out why certain programs work better on a dataset than others,” says Lex Nederbragt, a bioinformatician at the Centre for Ecological and Evolutionary Synthesis in Norway, who was not involved in the study, but commented on Assemblathon 2 on his blog, ‘In between lines of code’.
“This would be a job best done in collaboration between researchers having the datasets and skills to run and analyse assemblies, and the programmers of assembly programs.”
This work has started and will hopefully lead to better genome assembler software.
“Speed of execution is the real issue and was not discussed in Assemblathon. We are working on a hardware-based assembler that will make programs many times faster,” says Samanta.
Better algorithms are continually being developed. According to Samanta, a software called SPAdes is proving to be better than most other tools. Other technologies from Pacific Biosciences (PacBio) look to do away with assembling genomes from short DNA reads altogether.
While there does not appear to be an Assemblathon 3 study on the horizon, one idea for a long-term and sustainable solution could be a real-time software collaboration – a hub where assemblers can be submitted and evaluated on agreed standardised metrics to help assess which tool performs best on a certain genome.
“I find the idea really interesting, and we are just starting to explore this on a very small scale,” says Nederbragt. “This would speed things up: when a new version becomes available, everyone can immediately judge its performance, and decide whether it is worth trying on their data sets. Now, we have to wait for groups to do assembly comparisons, write up the results, and get the paper accepted – this can take years.”
A project called GAGE does this to some degree, in that it runs other people's assembly software itself.
“In theory, it would be great to have this sort of framework in place that you propose,” says Keith Bradnam – in an email interview – who is project scientist at the University of California, Davis Genome Center, US, and helped organise and run Assemblathon 2.
“In practice, there would be barriers. For Assemblathon 2, even the simple task of making available all of the raw sequence data to groups around the world and then collating the resulting assemblies was not straightforward,” says Bradnam. In many cases teams delivered hard drives by Fedex, a US courier service.
Finding Nemo… the genome: The Maylandia zebra fish species was used in the Assemblathon 2 study. (Image courtesy of Bardrock, Wikimedia Commons.)
Which pizza is right for you?
Bradnam says that choosing the best assembly method is akin to selecting the best pizzeria.
“People want the best quality genomes, but quality means different things to different people. If you asked 100 people to list their three most important factors when buying a pizza, I wonder how many different answers you would get,” says Bradnam. Currently, a standardised list of objective genome assembler metrics is a long way off.
While Bradnam quips that the most efficient way to reduce assembly inconsistency is to stop bad genome assemblies being made, advances are happening.
One cutting-edge technique is the sequencing of chromosomes, which if done regularly for all organisms could reduce the scale of assembler problems and therefore simplify the process.
Another genome sequencing problem is heterozygosity, which is the variation resulting from the fact that most human genomes are combinations of two others – one from each parent. One copy from each parent gets ‘scrambled’ through recombination.
A lot of genome data has been generated from the sequencing of trios: two parents, plus one child. This information from parent genomes could potentially help untangle heterozygosity of offspring and could generate data to improve assemblers.
By the same author:
About the author:
Adrian Giordani has a Masters in Science Communication from Imperial College London, where he was also the Editor-in-Chief of I, Science magazine. He was a science journalist and Interim Editor-in-Chief at the publication, International Science Grid This Week, in CERN, Geneva, Switzerland. He writes about health, big data, software, supercomputing, and all forms of large-scale computing. You can follow him on Twitter (@Speakster).