Thursday, January 12, 2012

Can GPUs Help with Genomic Data Deluge?

Source: BGI OSS
The last couple of weeks have seen increased coverage of the potential for GPUs to accelerate the analysis of large-scale genetic sequence data, thanks to an announcement from nVidia and BGI. There seems to be little question that alternative processor architectures such as GPU and FPGA hold promise for dramatically reducing the time and cost of analyzing genetic data, however the key question at this point is exactly which problems can be effectively tackled with these approaches.

BGI's announcement indicates that at least two key parts of the analysis process can be tackled with GPUs. Alignment, where the billions of sequence fragments generated by the sequencing instrument are matched up to a known genome, is typically one of the first steps in analysis and represents a non-trivial portion of the overall workload. Further into analysis, there can be the need to identify single variations or SNPs, and BGI was also successful in applying GPU technology to this problem. But as the Wired article points out, these are only two of the dozens of steps in the analysis process, and so it's really too early to make any major claims about the potential for GPUs to revolutionize genomics analysis end-to-end.

While somewhat overshadowed by the nVidia announcement, FPGA approaches also recently took a step forward with Pico's announcement at PAG XX of their success in applying their technology to the BFAST sequence alignment algorithm. They are reporting orders of magnitude improvements in performance, as well as improved alignment accuracy.

These technologies bear careful watching as they hold the promise of improving the performance of genetic analysis and reducing the cost of the technology and data center facilities required to handle the onslaught of genetic sequence data.

UPDATE 1/13: Looks like there was another FPGA announcement at PAG XX, this one from CLC Bio and Sciengines. They are reporting similar orders-of-magnitude performance improvements running BLAST and Smith-Waterman alignment algorithms.

Monday, August 29, 2011

Science Fire Hose: Intro to Molecular Biology for IT Professionals

One of the harder things I've done in a while was to try to condense down a year of genetics classes, an overview of high-throughput sequencing (HTS), and what it all means for IT into a ninety minute workshop at Bio-IT World this past spring. Based on the fact that none of the biologists in my audiences either here at Jackson or at the conference threw anything at me, I'm declaring success (such as it is).

The IT folks who have been through the workshop have said it has been very helpful in getting a high-level introduction to the science and HTS as well as an understanding of why this work and future research into proteomics and the microbiome are only going to have an increasingly large need for robust, scalable IT infrastructure. It also helps them connect their work in a meaningful way to the biology and medical challenges scientists are probing. With enough notice I'm happy to give this workshop on request to IT staff looking to better understand these topics.

Tuesday, April 19, 2011

Bio-IT World 2011: Strategic Planning for IT Intrastructure

Bio-IT World took place in Boston last week, and I gave an update on Jackson's efforts related to the whitepaper in Track 1: Hardware titled "Strategic Planning for IT Infrastructure in Support of Data-Intensive Science." Jackson has made substantial progress in the past year - we have expanded our infrastructure in support of science, we've started the process for applying for grants for infrastructure in support of sequencing, and we're currently looking at fee for service models.

More detailed examination of the HTS activities has raised the estimates for storage associated with those activities, as the instruments continue their impressive gains in output. Refinements of the models for compute/HPC are trailing somewhat behind, as they have not been as financially or technically challenging as storage.

As researchers gain more direct experience with HTS data and the associated tools needed for analysis and data management, it is starting to become clear that the time and resources available to scientists to become comfortable and facile with the sequence tools and pipelines are major governors on the pace of the science. Installing instruments and providing infrastructure are in many ways easier hurdles to clear compared to the ongoing technical training and changes in research process needed for HTS-based science.

Thursday, February 24, 2011

How Deep Is It? Next-Gen Sequencing Advances and the Data Deluge

Source: Science
In the previous post I looked at the broader picture for data growth. Closer to home in the world of biological science, sequencing is the current focus of data overload, and the situation has changed from a year ago. Scott Kahn's On the Future of Genomic Data provides an excellent survey on the topic in this month's Science Special Online Collection: Dealing with Data. The biggest single change is the move away from providing access to the raw images generated by NGS instruments such as those from Illumina. Their latest systems, including the HiSeq 2000, no longer provide access to the raw images, which constituted the major component of the storage need. However at the same time the instruments generate considerably more sequence data - the HiSeq started out at ~5X the output of its predecessor the GA IIx.

A key effort for those planning support for these systems is to develop rules of thumb for the storage required per base of sequence. As noted in Next-generation sequencing: adjusting to data overload in Nature Methods, 50 bytes/base was typical for just the raw data without images (which alone were ~170 bytes/base). However reducing the definition of raw data to include only FASTQ files can bring the ratio down to under 5 bytes/base (assuming 30X coverage). This is clearly a massive improvement in the amount of storage required per base of sequence. And there are efforts to reduce this further through reference-based compression by only encoding differences from a reference genome. This would be applicable to some data, but clearly not others (such as de novo sequencing). 

It is much more difficult to build such a rule for downstream analysis, and there is little published information on this piece of the pipeline. Since the various pipelines associated with DNAseq, RNAseq, ChIPseq, CNV analysis, etc utilize different tools and protocols, their output can be substantially different. Furthermore, research protocols for reproducibility play a key role - Galaxy is designed to store the details of every step of a pipeline, including software versions and setting, and intermediate files. This has the potential to generate a substantial amount of data. RNAseq may require another 2.5x bytes/base beyond the original raw data, whereas DNAseq may require 7x bytes/base.

How does this translate to the real world? For the sake of discussion, let's say we're running a HiSeq, and our raw data requires 5 bytes/base (again at 30X coverage), and the average need of our downstream pipelines is 20 bytes/base, for a total of 25 bytes/base. The HiSeq currently puts out ~250 gigabases per run, and running the system full time (40 runs per year) will generate 10 Tbases of output. This will require roughly a quarter PB of storage, which is substantial but not as scary as it was a year ago, when storing raw images would have pushed this number to multiple PBs for the same number of bases.  However Illumina's roadmap for the instrument has it generating a Tbase per run within a year, so the curve is still cranking upward. And the cost per base of sequencing continues to plummet to the point where exome sequencing appears to have a limited lifespan in favor of whole genome sequencing, which when combined with increased demand will also drive up total output.

Getting rid of the raw images gave a moment of respite to those working to support high-throughput sequencing pipelines, but we need to continue to carefully watch the data growth, and gain a better understanding of downstream analysis requirements.

How Deep Is It? A Check-In on the Data Deluge

There has been no shortage of articles concerning the data deluge that leading scientific research is struggling with. Jackson's own look last year at the next five years of research in its FY2010 Research Information Technology Strategic Position White Paper projected tremendous data growth. But where do things stand almost a year later?

First, information on the net generally points to ongoing exponential data growth, both generally and in science:
Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology is a report issued in December 2010 to the President and Congress from the President's Council of Advisors on Science and Technology. One of five broad crosscutting themes in their report is growth in data:
"Data volumes are growing exponentially. There are many reasons for this growth, including the creation of nearly all data today in digital form, a proliferation of sensors, and new data sources such as high-res- olution imagery and video. The collection, management, and analysis of data is a fast-growing concern of NIT research. Automated analysis techniques such as data mining and machine learning facilitate the transformation of data into knowledge, and of knowledge into action. Every Federal agency needs to have a “big data” strategy."
This echoes similar sentiments from a 2009 National Academy of the Sciences publication from the Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age:
"Huge increases in the quantity of data being generated, combined with the need to move digital data between successive storage media and software environments as technologies evolve, are creating severe challenges in preserving data for long-term use. And these issues are not restricted to large-scale research projects; they can be especially acute for the small-scale projects that continue to constitute the bulk of the research enterprise." 
Most recently, Science published a special online collection this month titled Dealing with Data, which looks at the challenges across a number of fields. There is a good survey of the issues in biology and genetics in Scott Kahn's On the Future of Genomic Data, a topic I'll look at in more detail in the next posting.

UPDATE 3/10/11: there is a new article on big data over at HPC in the Cloud, which is a broad survey from a business perspective that covers much of the existing territory, with a couple of new items. An interesting data point comes from Dhruba Borthakur, a lead Hadoop engineer at Facebook, who reports that Facebook is now storing all user email in a Hadoop-based cloud. But most interesting is a link to an older blog post from Jeff Jonas, IBM Distinguished Engineer and Chief Scientist, that looks at the idea of "context accumulation", or the idea that each new piece of information needs to be evaluated in the context of what is already known. That post is somewhat dated, but a slide deck of Jeff's from last fall covers the territory, and leaves me thinking that this idea could be extremely important in biological research. Jeff uses the analogy of a puzzle to help describe this concept - the puzzle will be assembled much more quickly if each new piece is assessed in the context of what has already been assembled.

Wednesday, February 23, 2011

Getting Science Done: Making HPC Work For Researchers

Full poster version available
Back in October at the Harvard BioMed HPC Summit I facilitated a group discussion that looked at the challenges facing research organizations looking to utilize leading edge information technology in support of their scientific mission. The 2+ hour breakout session was well-attended by a diverse group of IT staff, bioinformaticians, bench scientists and institutional leadership. I was also encouraged to discover roughly half a dozen people in the group who have formal liaison roles of some form or another.

The group identified seven key areas of concern:
  • Solution ownership - including long-term maintenance
  • Interfaces - command line vs GUI, real time vs batch, interactive supercomputing
  • Design & Analysis - ensuring good code/approach
  • IT Support Models - project management, change management, etc
  • User Education
  • Communication - perception management/PR, two-way outreach
  • Security & Standards
We then surveyed the activities we felt could make progress more immediately (less than two years) as well as more strategic approaches (two to five years).  The poster above details each of these areas and approaches. 

The group also reached a number of conclusions:
  • Communication is a core issue (ie "data" is plural!).
  • Shift towards data-driven science requires education and training in all groups.
  • Liaison roles are becoming more common, and are located through organizations.
I'm happy to say that this group was voted the most productive of the four breakout sessions at the conference! The conversation around the need for better education and cross-training has also spurred me to build a workshop covering basic genetics for IT professionals which will be given in the pre-conference workshops at Bio-IT World 2011 in Boston.

Friday, February 4, 2011

John Koskie Covers The Maine Innovation Cloud

The Jackson Laboratory Computational Sciences HPC Users Group hosted a visit today from John Koskie, Operations/Program Manager at the Target Technology Incubator at the University of Maine in Orono. John gave the group of bioinformaticians and IT professionals an overview of the Maine Innovation Cloud he has put together over the last year.

The impetus behind this effort is to build a cloud service for the Foster Student Innovation Center, the Target Tech Incubator, eventually extending the service to all Maine Incubators. The initial focus of the service is providing access to web servers and content management systems such as Drupal and Joomla.

John started with virtually no budget and has put together an offering based on Eucalyptus, the open-source implementation of the Amazon EC2 and S3 APIs.  The overview was fairly technical and indicated that this environment is still best suited to those with some knowledge of systems and systems administration - it is not yet a one-button environment for end-users.

While still in its early stages, the cloud service will see use by students in a course that will be taught by Carol Bult (Jackson) and Keith Hutchinson (UMO) this spring.