Parallel Algorithms and Software for High-Throughput Sequence Assembly

High-throughput next-generation DNA sequencing technologies (NGS) are causing a major revolution in life sciences research by allowing rapid and cost-effective sampling of genomes and transcriptomes (expressed genomic sequences). Assembly of genomes and transcriptomes from billions of such randomly sampled sequences is an important problem in computational biology. While significant strides have been made, much work remains in addressing the diverse and rapidly emerging platforms, improving assembly quality, and scaling to both large-scale data sizes and large genomes.

This project will harness the power of high performance computing to develop effective solutions for sequence assembly. It will lead to the development of scalable, efficient parallel algorithms and a parallel integrated software framework for genome and transcriptome assembly. The project seeks to advance the state of the art by targeting important unsolved problems such as hybrid assembly of sequences from multiple NGS platforms, making fundamental algorithmic advances to improve assembly quality, and conducting an in-depth effort at parallel algorithms development for the entire gamut of problems that arise in connection with assembly. It will be carried out by an interdisciplinary team of investigators, in partnership with leading NGS manufacturers and academicians involved in large plant genome sequencing projects.

The project will lead to the release of a scalable parallel software package for sequence assembly that will be made available to the scientific community. Postdoctoral and graduate students will be trained in computer science driven interdisciplinary research and in writing efficient high performance computing software. The project will influence curriculum development and will lead to educational materials in bioinformatics for next-generation sequencing.

Duration: 
05/01/2012 to 04/30/2018
Principal Investigator(s):