Data Compression

Listed below are tools developed by the iDASH team for data compression.

Click on the name in the left-hand column to download or access the tool.

Name Description

DNA-compact

 

A new genome compression technique using a two pass procedure, which highlights the synthesis of complementary contextual models and the introduction of logistic regression mixture method, to improve the compression performance.

 

Platform: Web service

 

Datasets that can be used: .fasta file (assembly file)

 

Citation: Li, Pinghao, et al. "DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique." PLoS One. 2013 Nov 25;8(11):e80377. PMID: 24282536. PMCID: PMC3840021

HUGO (Hierarchical mUlti-reference Genome cOmpression)

HUGO is a novel compression algorithm for aligned reads in the sorted Sequence Alignment/Map (SAM) format. This tool produces a compression ratio in the range of 0.5–0.65, which corresponds to 35–50% storage savings based on experimental datasets. It achieves 15% more storage savings over CRAM and comparable compression ratio with Samcomp (CRAM and Samcomp are two of the state-of-the-art genome compression algorithms).

 

Platform: Linux

 

Datasets that can be used: .BAM files

 

Tutorial: http://jamia.bmj.com/content/21/2/363.full.pdf+html

 

Citation: Li, P., Jiang, X., Wang, S.*, Kim, J., Ohno-Machado, L. “HUGO: Hierarchical mUlti-reference Genome cOmpression for BAM files.” J Am Med Inform Assoc. 2014 Mar-Apr;21(2):363-73. PMID: 24368726. PMCID: PMC3932469

Streamlined Genome Sequence Compression using Distributed Source Coding

 

We aim at developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require heavy client (encoder side) cannot be applied. To tackle this challenge, we carefully examined distributed source coding theory and developed a customized reference-based genome compression protocol to meet the low-complexity need at the client side.

Platform: Windows and Matlab

Datasets that can be used with it: The Arabidopsis Information Resource (http://www.arabidopsis.org/) and The Institute for Genomic Research (TIGR) (http://www.arabidopsis.org/)

Citation: Wang S, Jiang X, Chen F, Cui L, Cheng S. "Streamlined Genome Sequence Compression using Distributed Source Coding." Cancer Inform. 2014 Dec 2;13(Suppl 1):123-31. PMID: 25520552