(CLOUD 2020) Bioinf-PHP: Bioinformatics Pipeline for Protein Homology and Phylogeny

About This Webinar

Abstract: Catalase is a special category of enzyme that plays a critical role in regulating the level of harmful hydrogen peroxide in cells. There are three main families of these proteins: Typical Catalases, Catalase-Peroxidases (katG), and Manganese Catalases. In order to uncover potential evolutionary relationships between these enzymes, we have developed a bioinformatics pipeline named Bioinf-PHP to search for protein homology and phylogeny, and to compare these three catalase families at the functional level based on sequence similarity. Protein motif analysis of the sequences featured in the pipeline were conducted using the MEME algorithm. The top three significant motifs were reported for all of the catalase sequences. The Bioinf-PHP pipeline also runs BLASTP to search for homology between bacteria catalase and yeast protein sequences. The neighbor-joining phylogenetic tree was constructed with Saccharomyces cerevisiae to infer evolutionary relationships as a test example. The structural similarities between orthologous sequences provided further evidence of functional similarity.

Authors: Michael Zhou (Skyline High School, USA); Yongsheng Bai (Next-Gen Intelligent Science Training, USA)

Email: 2021zhoumichaelh@aaps.k12.mi.us, yongshengbaicool@gmail.com

Who can view: Everyone

Webinar Price: Free

Featured Presenters

Michael Zhou

Building the Modern Services Industry

Michael Zhou is a senior at Skyline High School in Ann Arbor, Michigan. He’s always looking to find patterns in the world around him, which is exactly what draws him into research. Michael enjoys finding creative and practical solutions to everything from math problems to bioinformatics research. He is very excited to present as a first-time CLOUD 2020 attendee!
________________

Abstract:

Bioinformatics Pipeline for Homology Sequence Analysis using Python

Michael Zhou, Yongsheng Bai

The immense amount of publicly available bioinformatics data can be intimidating and hard to navigate. Specifically, I found the process of gathering protein sequence data and running sequence analysis through multiple bioinformatics tools needlessly time-consuming and error-prone.

Python is the most popular language for data processing, and for good reason. Using Python, I created a pipeline to streamline the multi-stage protein analysis. This pipeline automates many useful procedures for bioinformatics researchers, such as:
* sequence download
* sequence database search
* multiple sequence alignment
* phylogenetic tree construction
* 3D protein structures extraction

We will do a live demo of the software during the poster session by conducting a homology sequence analysis between bacteria catalases and yeast proteins. The demo will consist of the following steps:

* the user provides basic sequence information
* the program downloads catalase protein sequences in FASTA format from the online catalase database (RedoxiBase) via API calls
* the program performs protein motif analysis with the MEME suite API calls
* the program performs Protein-Protein BLAST (BLASTP) sequence alignment to search for homologous sequences between Baker’s yeast and bacteria catalases
* the program gets protein structure information by searching against annotated 3D protein structure databases (SWISS-MODEL and ModBase)
* the pipeline will conduct Multiple Sequence Alignment (MSA) and construct phylogenetic tree as visualization of the evolutionary relationship between bacteria catalases and yeast proteins.
* the program generates a final report in PDF format.

This automation greatly increases the efficiency and accuracy of protein sequence analysis. It shall serve as an ideal tool to introduce bioinformatics methods to high school students or other young researchers. We’ve uploaded the program to GitHub as an open source project for adoption and future enhancements.

----------------------------------
NOT NEEDED for PyCon poster proposal

In the age of big data, there are numerous publicly accessible online repositories which provide comprehensive protein sequence and function information. There are also various algorithms and tools that developed for protein sequence alignments and evolutionary studies.

Many publicly available data repositories and resources have been developed to support protein-related information management, data-driven hypothesis generation, and biological knowledge discovery. To help researchers quickly find the appropriate protein-related informatics resources, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases in this chapter. We also discuss the challenges and opportunities for developing next-generation protein bioinformatics databases and resources to support data integration and data analytics in the Big Data era.

Abstract for BigData conf paper:

Catalases are some of the most important enzymes because of their role in regulating harmful hydrogen peroxide levels in cells. There are three main families of these proteins: Typical Catalases, Catalase-Peroxidases, and Manganese Catalases. In order to uncover potential evolutionary links in the development and function of these catalases, we conducted the research to compare these three families at the functional level. In this study, catalase protein sequence data of the Bacteroidetes/Chlorobi taxonomic groups was obtained from RedoxiBase. Then, homology and protein motif analysis of these sequences were conducted using the MEME suite. Finally, the sequences were aligned with Baker’s yeast using the Blastp algorithm. We used Blastp to search for homologous sequences between yeast and bacteria by searching the UniProt database. All of the catalase sequences contained at least three significant motifs, suggesting that such motifs serve critical functions for these enzymes.
We have obtained [] sequences and noticed that ----- Furthermore, over half of the peroxide catalase proteins had their motifs conserved. We also checked between the families to see that there was no overlap between motifs.
We also examined the structural similarities between orthologous sequences for further evidence of functional similarity.
Finally, we ran the sequences through Blastp and found several interesting results. All of the catalases mapped to one of two sequences. All of the catalase peroxidases only mapped to a single sequence. The other groups mapped to several different sequences. These results indicate

Different catalase classes play similar roles in their biological context, although they do not share much similarity at the sequence level. Moreover, their biological functions are similar and conserved in yeast homologs.

VIEW PROFILE

Hosted By

Services Society

Services Society's Channel

VIEW CHANNEL CONTACT

Recommended