HGSC is one of the sequencing centers in the U.S. One of the projects it is dealing with is related to aging and heart diseases known as CHARGE (Cohorts for Heart and Aging Research in Genomic Epidemiology). CHARGE involves the efforts of 200 scientists from 5 institutions. They get different types of research data from the National Heart, Lung, and Blood Institute (NHLBI), where all the data makes up to 1 PB of raw data from 20 different sequence machines.
It is a giant nightmare to shift data in disks to the scientists living along different poles of the earth. Data is encrypted and distributed across different regions. Baylor consulted DNAnexus which are involved in data management, secure sequencing and collaboration of different data centers. DNAnexus PAAS solutions are built on top of AWS services and they specialize in providing on demand computation, storage, and security compliance in the bioinformatics field. The data from the sequencing tools was very massive and had to be securely distributed to the scientists. Moreover, the time it took on the distribution was too long that meanwhile major changes in technology occurs. Veeraraghavan says. “In those months, technology can change, protocols can change, and updates to the sequencing platform can mean that sequencers can double their output. So demand has doubled in the time you’ve taken to plan and estimate your hardware needs.” A third problem was that the solutions were very costly.
DNAnexus started using the PAAS APIs to shift all the data in the AWS cloud. AWS solutions provided the DNAnexus with more than 20,000 simultaneous compute cores, 1 PB of storage, millions of core hours of analysis, and hundreds of thousands of compute jobs orchestrated in the AWS Cloud. To run the workloads in compliance and security, AWS provided customers with Business Associates Agreement. DNAnexus had a pipeline called Mercury which took the data and transformed it into end results that are important for clinical research. The results were used for the purpose of new findings about genes. For storing and retrieval purposes, AWS S3 and AWS Glaciers were used. 1PB of data was stored in the cloud. DNAnexus developed a command line tool which automated the uploading of sequencing data from the instruments directly into the cloud.
This thing eliminated the need of on premises storage and saved the customers from large storage tradeoffs. Amazon EC2 carried out all the analysis tasks, and all the schedule management was done with the help of a custom queuing system. For the management of the website, customer front end portal and DNA virtualization tools and other such services, they used Reserved Instances, provided that this strategy cut the cost of instances. Veeraraghavan says. “By using one pipeline and controlling access to that pipeline, you can structure your environment in such a way as to minimize the risk.” Omar Serang, DNAnexus Chief Cloud Officer, says, “We are able to power ultra large-scale clinical studies that require computational infrastructure in a secure and compliant environment at a scale not previously possible.”
The solution was up to the mark and served all the purposes it was meant for. They were able to carry out the first experiment in 10 days and it was five times faster and used 21,000 cores. “Any scientist, whether he’s running on a Mac, Linux, or Windows, can run any tool on all the CHARGE data in DNAnexus,” said Veeraraghavan. Andrew Carroll, lead DNAnexus Scientist for CHARGE, adds, “Using the AWS Cloud makes it possible to compare tools, so that you can understand what works for your project and what doesn’t. DNAnexus on the AWS Cloud lets researchers share what they learn with the scientific community.”
So CHARGE scientists were able to cut their costs and make their work fast. Apart from that, they were also able to securely distribute large chunks of data across multiple diverse platforms, which in turn helped scientists process and analyze data at a large pace.