Machine learning-based protein annotation tool predicts protein function

Machine learning-based protein annotation tool predicts protein function
Snekmer is an application for building and searching protein family models and novel sequence clusters. Credit: Jason McDermott, Pacific Northwest National Laboratory

Microbes drive key processes of life on Earth. They affect global elemental cycles—the movement of carbon, nitrogen, and other elements. They also promote plant growth and affect the development of diseases. These roles are essential in every ecosystem. Research constantly expands the database of microbial DNA sequences but does not provide all the biological information about proteins.

To engineer microbes for sustainable bioenergy and other bioproducts, scientists need a fuller understanding of the function of proteins and other molecules. Scientists infer the function of a protein by comparing it with reference databases of already characterized proteins.

However, these comparisons are difficult and not scalable to massive databases. To address this challenge, scientists have applied machine learning to models that predict protein function. The result is the program Snekmer, which allows scientists to quickly model families of proteins.

Studying biological protein molecules in microbes will help scientists pursue new applications for engineered microbes. Snekmer is easy to deploy in high-performance computing environments. In addition, it is incorporated into the DOE KBase framework as a new application that will allow users to annotate their genome and metagenome sequences.

This will help scientists to better model the effects of engineering microbes. That includes these microbes’ effect on the climate and their benefits for crop health and bioproduction. Snekmer will also help scientists study the evolution of microbes and patterns in microbiomes.

The inability of current methods to predict function for 30-50% of bacterial protein sequences is a significant barrier to better understanding of complex systems such as soil microbiomes. Most protocols rely on pair-wise alignments, which are becoming computationally intractable and more challenging to interpret as databases expand.

For alignment-based models of protein families, the sensitivity and accuracy depend on initial training sets, which risk obsolescence as additional sequence diversity is discovered. Many bacterial proteins have either no functional assignment or are only assigned a general function based solely on taxonomic understanding.

To address this need, researchers at Pacific Northwest National Laboratory, Baylor University, and Oregon Health & Science University developed Snekmer, a software tool leveraging redundancy of amino acid residue properties to reduce sequence space and using short protein sequence (kmer) features for machine learning to generate protein family models.

Snekmer users can recode protein sequences into reduced alphabet kmer vectors and perform construction of supervised classification models trained on input protein families, or protein functional classification based on Snekmer models.

The study is published in the journal Bioinformatics Advances.

More information:
Christine H Chang et al, Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding, Bioinformatics Advances (2023). DOI: 10.1093/bioadv/vbad005

Provided by
US Department of Energy


Citation:
Machine learning-based protein annotation tool predicts protein function (2023, June 1)
retrieved 1 June 2023
from https://phys.org/news/2023-06-machine-learning-based-protein-annotation-tool.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

For all the latest Science News Click Here 

 For the latest news and updates, follow us on Google News

Read original article here

Denial of responsibility! TheDailyCheck is an automatic aggregator around the global media. All the content are available free on Internet. We have just arranged it in one platform for educational purpose only. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials on our website, please contact us by email – [email protected] The content will be deleted within 24 hours.