New method compresses terabytes of genomic data into gigabytes

Genomic researchers used to be able to store their datasets on a laptop, but with so many whole genomes available now to study, the resulting big datasets must be stored in the cloud—resulting in more expensive, slower and more unwieldy computations.

A new method developed at Cornell provides tools and methodologies to compress hundreds of terabytes of genomic data to gigabytes, once again enabling researchers to store datasets in local computers. Their paper, “Enabling Efficient Analysis of Biobank-Scale Data with Genotype Representation Graphs,” published Dec. 5 in Nature Computational Science.

“Even just a few years ago, the data we were studying usually wasn’t whole genome sequencing data, which meant only a small fraction of the genomes were being measured, rather than the entire genome. And because of that, the size of the data wasn’t so crazy,” said April Wei, assistant professor of computational biology in the College of Arts and Sciences.

Raw data size can now run into the petabytes, said co-author Drew DeHaas, computational genetics programmer in the College of Agriculture and Life Sciences.

Wei had always wanted to develop methods to utilize biobank-scale data for doing research because of the richness of the information available, but many of the things she wanted to do weren’t possible because of the computational cost and challenge. This inspired her, she said, to tackle the compression problem, which led to the Genotype Representation Graph (GRG) method, which uses graphs to manage the data.

“Graph-based methods have long been used in computer science and other fields to provide a clear framework for solving challenging problems,” DeHaas said, but prior to GRG had not been applied to a data compression solution in genomics at the Biobank scale.

Wei, trained as a population geneticist, had deep familiarity with graphs used in population genetics—although GRG is designed quite differently.

“Unlike conventional matrix-based representations, GRG represents genotypes as a graph, where relationships between individuals are captured through shared mutations in their genomes. The GRG data structure not only encodes genotypic information more intuitively and compactly, but also facilitates efficient graph-based computations for advanced analyses,” said co-author Ziqing Pan, doctoral student in the field of computational biology.

GRG compresses the data while focusing on scalability and faithfully representing the data, according to Wei.

“The great benefit of utilizing graphs for compression is that we can do computations with graphs, without the need to decompress the data,” she said. “Also, specific algorithms could be developed to do things that people couldn’t do with older formats, so there are potentially more benefits.”

Because the GRG enables researchers to analyze the same data more efficiently, it also lowers costs.

More information:
Drew DeHaas et al, Enabling efficient analysis of biobank-scale data with genotype representation graphs, Nature Computational Science (2024). DOI: 10.1038/s43588-024-00739-9

Provided by
Cornell University

Citation:
New method compresses terabytes of genomic data into gigabytes (2024, December 5)
retrieved 5 December 2024
from https://medicalxpress.com/news/2024-12-method-compresses-terabytes-genomic-gigabytes.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Raw data size can now run into the petabytes, said co-author Drew DeHaas, computational genetics programmer in the College of Agriculture and Life Sciences.

Wei, trained as a population geneticist, had deep familiarity with graphs used in population genetics—although GRG is designed quite differently.

GRG compresses the data while focusing on scalability and faithfully representing the data, according to Wei.

Because the GRG enables researchers to analyze the same data more efficiently, it also lowers costs.

More information:
Drew DeHaas et al, Enabling efficient analysis of biobank-scale data with genotype representation graphs, Nature Computational Science (2024). DOI: 10.1038/s43588-024-00739-9

Provided by
Cornell University

New method compresses terabytes of genomic data into gigabytes

7.0 earthquake hits Northern California, prompts tsunami warning

IONQ Stock Soars to All-Time High, Reaching $37.32 By Investing.com todayheadline

Related Posts

Study finds infant anesthesia exposure accelerates visual brain activity patterns

Black Death offers window into how childhood malnutrition affects adult health

IONQ Stock Soars to All-Time High, Reaching $37.32 By Investing.com todayheadline

Family calls for change after B.C. nurse dies by suicide after attacks on the job

Product reduces TPH levels to non-hazardous status

Police ID man who died after Corso Italia fight

Hospital Mergers Fail to Deliver Better Care or Lower Costs, Study Finds todayheadline

Harris tells supporters ‘never give up’ and urges peaceful transfer of power

Des Moines Man Accused Of Shooting Ex-Girlfriend’s Mother

Trump ‘looks forward’ to White House meeting with Biden

Catholic voters were critical to Donald Trump’s blowout victory: ‘Harris snubbed us’

Fortive (FTV) Q2 2025 Earnings Call Transcript todayheadline

Fitch revises UnitedHealth's outlook to negative amid profit pressure todayheadline

If Email Is Your Main Strategy, You’re Missing the Easiest Way to Build Authority todayheadline

jeffrey epstein missing minute: FBI locates ‘Missing Minute’ in Jeffrey Epstein Jail Tape. Here’s what it reveals todayheadline

Recent News

Fortive (FTV) Q2 2025 Earnings Call Transcript todayheadline

Fitch revises UnitedHealth's outlook to negative amid profit pressure todayheadline

If Email Is Your Main Strategy, You’re Missing the Easiest Way to Build Authority todayheadline

jeffrey epstein missing minute: FBI locates ‘Missing Minute’ in Jeffrey Epstein Jail Tape. Here’s what it reveals todayheadline

Browse by Category

Recent News

Fortive (FTV) Q2 2025 Earnings Call Transcript todayheadline

Fitch revises UnitedHealth's outlook to negative amid profit pressure todayheadline

Welcome Back!

Retrieve your password