Special Offer: Get 50% off your first 2 months when you do one of the following
Personalized offer codes will be given in each session

(BigData 2020) Scalable reference genome assembly from compressed pan-genome index with Spark

About This Webinar

Abstract: High-throughput sequencing (HTS) technologies have enabled rapid sequencing of genomes and large-scale genome analytics with massive data sets. Traditionally, genetic variation analyses have been based on the human reference genome assembled from a relatively small human population. However, genetic variation could be discovered more comprehensively by using a collection of genomes i.e., pan-genome as a reference. The pan-genomic references can be assembled from larger populations or a specific population under study. Moreover, exploiting the pan-genomic references with current bioinformatics tools requires efficient compression and indexing methods. To be able to leverage the accumulating genomic data, the power of distributed and parallel computing has to be harnessed for the new genome analysis pipelines. We propose a scalable distributed pipeline, PanGenSpark, for compressing and indexing pan-genomes and assembling a reference genome from the pan-genomic index. We experimentally show the scalability of the PanGenSpark with human pan-genomes in a distributed Spark cluster comprising 448 cores distributed to 26 computing nodes. Assembling a consensus genome of a pan-genome including 50 human individuals was performed in 215 minutes and with 500 human individuals in 1468 minutes. The index of 1.41 TB pan-genome was compressed into a size of 164.5 GB in our experiments.

Authors: Altti Ilari Maarala (University of Helsinki, Finland); Ossi Arasalo (Aalto University, Finland); Daniel Valenzuela (University of Helsinki, Finland); Keijo Heljanko (University of Helsinki & HIIT, Finland); Veli Mäkinen (University of Helsinki, Finland)

Email: ilari.maarala@helsinki.fi, ossi.arasalo@vtt.fi, daniel.valenzuela.serra@gmail.com, keijo.heljanko@helsinki.fi, veli.makinen@helsinki.fi

Who can view: Everyone
Webinar Price: Free
Featured Presenters
Webinar hosting presenter Services Society
Altti Ilari Maarala is currently pursuing the Ph.D. degree at the University of Helsinki, Finland. Since 2019, he has been working as a researcher in the Department of Computer Science, University of Helsinki. He started his Ph.D. studies in 2016 in the Department of Computer science, Aalto University, Finland. After receiving his M.Sc. in computer science in 2014 from the University of Oulu, he joined the Interactive Spaces research group as a researcher. He has been working also as a research assistant in the Machine Vision research group in the University of Oulu. His current research interests include distributed and parallel algorithms, computational genomics, and big data.

Hosted By
Services Society webinar platform hosts (BigData 2020) Scalable reference genome assembly from compressed pan-genome index with Spark
Services Society's Channel