Sequence Region Specific Prediction of Protein Functions

OUR TEAM

Assoc. Prof. Dr.
Tunca DOĞAN

Ender SARIBAY

Sevinç EKİN

Büşra ZİLE

INTRODUCTION

Proteins have essential roles in all organisms since they perform all the functions carried out in a body of an organism. Therefore, knowing the functions carried out by proteins are essential to understanding biological organisms and to enabling faster research for drug development against diseases. For annotating the functions of proteins, gene ontology (GO) terms are widely used. GO terms are subject to a directed acyclic graph data structure where there exist parent-child relationships among them. This means that if a protein is associated with a GO term, the same protein is also associated with all parents of the GO term. The problem of protein function prediction can be approached as in the case of NLP (natural language processing) problems, where the input language is the language of proteins consisting of letters, each uniquely corresponding to a particular amino acid. For purposes of this research, the target language consists of GO terms that describe the functions of proteins. While each amino acid letter corresponds to a token of the input (source) language, each GO term word, e.g., GO:0008150, corresponds to a token of the output (target) language. Viewing the problem from this point of view enables state-of-the-art machine learning and deep learning techniques to be used for solving the problem of protein function prediction. While proteins may have many functions associated with them, the functions they carry out are usually due to some region of them, rather than the whole protein sequence. The research existing in the literature usually focuses on predicting protein functions at protein level, meaning that the GO term predictions are predicted to be associated with the whole protein, lacking the information regarding which positions of the protein sequence are responsible for the predicted function. The aim of this research is to perform protein function predictions at both protein level and amino acid level. Predicting protein functions at amino acid level makes the regional prediction of functions of proteins possible.

METHODOLOGY

In scope of this research, two models have been implemented and tests have been performed using these models. For regional protein function prediction, the transformer model has been utilized. The schema below shows the structure of the transformer model that has been used in scope of this research.

Firstly, the input protein sequence, where each amino acid is separated by a space character, is given as input to the pretrained tokenizer of the Prot T5-XL-UniRef50 model. The tokenizer has a vocabulary of 128 tokens, consisting of tokens for all amino acids and other tokens reserved for special purposes. The tokenized protein sequence produced by the tokenizer is given as input to the encoder part of the model, which utilizes the pre-trained Prot T5-XL-UniRef50 encoder model to produce protein embeddings of size 1024. The produced protein embeddings are given as input to the decoder part of the model, where the output probabilities are produced for each token found in the target language, i.e., GO terms. Since the regional dataset, on which the model has been trained, consists of data points where a GO term for each amino acid is provided, the model is propagated forward L times, where L is the length of the protein sequence, i.e., the number of amino acids found in the input protein sequence. At each iteration of L propagations, the token with the highest probability is selected and appended to the prediction. After L propagations, regional predictions of GO terms associated with the input protein sequence are produced.

For non-regional protein function prediction, a classification head model that uses a linear classification layer on top of the Prot T5-XL-UniRef50 encoder has been implemented. The schema below shows the structure of the model.

As in the case of the transformer model, the input protein sequence is fed into the pre-trained Prot T5-XL-UniRef50 tokenizer, and the resulting tokenized protein sequence is given as input to the pretrained Prot T5-XL-UniRef50 encoder model. The protein embedding produced by the pre-trained Prot T5-XL-UniRef50 encoder model is fed into a linear layer. The input size of the linear layer is 1024 since the pre-trained Prot T5-XL-UniRef50 encoder model generates protein embeddings of size 1024. The output size of the linear layer is equal to the number of GO terms. Due to this structure, the classification head model computes a number per each GO term. Since each of the numbers produced by the linear layer belongs to a particular GO term, each of them is fed into the sigmoid activation function. The sigmoid activation function maps a given number to a probability, naturally, in the range of 0 and 1. A threshold is applied to the output probabilities for each GO term produced by the sigmoid activation function. If a probability value regarding a GO term is greater than or equal to the threshold value, e.g., 0.5, the corresponding GO term is labeled as positive. Also, these two models have been merged to produce both non-regional and regional function predictions. The non-regional protein function predictions produced by the merged model has also been tested using several metrics. The schema below shows the structure of the merged model designed in scope of this research.

As in the case of transformer and classification head models, the input protein sequence is given as input to the pretrained tokenizer of the Prot T5-XL-UniRef50 model. The tokenized protein sequence is fed into both the classification head model, which performs GO term prediction on the protein level, and the transformer model, which performs GO term prediction on the amino acid level. The results produced by the transformer model are converted to the classification prediction by extracting the unique GO terms found in the prediction. The extracted GO terms are then merged with the results produced by the classification head model by union operation. This constitutes merged GO term predictions at the protein level. The regional prediction produced by the transformer model is used for producing the final prediction, in addition to the merged GO term predictions on the protein level.

OUR RESEARCH RESULT

The Tests Conducted with the Classification Head Model on Non-regional Datasets


Results of Tests Conducted with the Datasets with Sum of Protein Sequence Length and the Number of Associated GO Terms Being at most 500

The table below shows the results of the tests performed with the classification head model on the validation sets of non-regional datasets with sum of protein sequence length and the number of associated GO terms being at most 500. The number of GO terms in the resulting dataset for BP category is 11430. The number of GO terms in the resulting dataset for MF category is 3566, and the number of GO terms in the resulting dataset for CC category is 1793.

As it is seen in the table above, different thresholds have been used to test the performance of the models. The colors of cells are related to the value of the number written in that cell. Lower values are intense in terms of red color, whereas greater values are intense in terms of green color. The threshold value for which the obtained results are the best for the associated category is written in bold style along with the rest of the related row. For BP and MF categories, the best performing threshold is 0.2, whereas for CC category, the best performing threshold is 0.25. It is worth noting that for all categories, as the threshold decreases, precision metric usually decreases, and recall metric increases. This is natural since the threshold value determines how many GO terms are to be predicted by the classification head model. The classification head model, as it is explained in the Models section, produces probabilities for each possible GO term, and a selected threshold is applied to decide whether or not a GO term is associated with the input protein sequence. As the threshold decreases, more GO terms are accepted as associated with the protein since the limit to pass the test of association with the protein -the threshold- is lowered. Hence, decreasing the threshold causes the classification head model to produce more GO terms, increasing the capability of the model to capture more GO terms associated with the protein, and thus, increasing the recall metric. However, producing more GO terms also introduces the risk of producing false positives, and thus, decreasing the precision metric. The aim here is to find such a threshold point that neither the precision nor the recall is too small, with an aim of getting a higher F1 score. The bar graph below shows the performance of the classification head model in terms of the F1 score with the sum of protein sequence length and the number of associated GO terms at most 500.

In the blue line, the points from left to right correspond to threshold values of 0.15, 0.2, 0.25, 0.3, 0.4, and 0.5, respectively. For the bar graphs given below in this section and the future sections, the threshold values are the same in the same order as well. The bar graph below shows the performance of the classification head model in terms of the recall metric with the sum of protein sequence length and the number of associated GO terms at most 500.

The bar graph below shows the performance of the classification head model in terms of the precision metric with the sum of protein sequence length and the number of associated GO terms at most 500.


The Tests Conducted with the Datasets with Sum of Protein Sequence Length and the Number of Associated GO Terms Being at Most 1000

The table below shows the results of the tests performed with the classification head model on the validation sets of  non-regional datasets with sum of protein sequence length and the number of associated GO terms being at most 1000. The number of GO terms in the resulting dataset for BP category is 13352. The number of GO terms in the resulting dataset for MF category is 3946, and the number of GO terms in the resulting dataset for CC category is 2019.

As it is seen in the table above, different thresholds have been used to test the performance of the models. The colors of cells are related to the value of the number written in that cell. Lower values are intense in terms of red color, whereas greater values are intense in terms of green color. The threshold value for which the obtained results are the best for the associated category is written in bold style along with the rest of the related row. For MF and CC categories, the best performing threshold is 0.2, whereas for BP category, the best performing threshold is 0.15. By observing the table above, it is seen that the model performs best on the MF category while it performs worst on the BP category. It is significant that even though the number of GO terms found in the MF dataset is much greater than the number of GO terms found in the CC dataset, the model performs better on the MF category. However, normally, the model is expected to perform better on a category where the number of GO terms is lower. The reason behind the deviation from this expectation can be the fact that the molecular function GO terms are straightforward. In other words, it has less complex associations with the sequence of proteins. Therefore, due to the nature of the proteins and the categories of their functions, the model performs best on the MF category. The bar graph below shows the performance of the classification head model in terms of the F1 score on the datasets with the sum of protein sequence length and the number of associated GO terms at most 1000.

The bar graph below shows the performance of the classification head model in terms of the recall metric on the datasets with the sum of protein sequence length and the number of associated GO terms at most 1000.

The bar graph below shows the performance of the classification head model in terms of the precision metric on the datasets with the sum of protein sequence length and the number of associated GO terms at most 1000.

The Tests Conducted with the Transformer Model on Regional Datasets


The Tests Conducted with the Datasets with Protein Sequence Length Being at Most 250

The bar graph below shows the metric results obtained for all three GO categories, namely, BP, MF, and CC on the datasets with protein sequence length being at most 250. The number of GO terms in the resulting dataset for BP category is 875. The number of GO terms in the resulting dataset for MF category is 723, and the number of GO terms in the resulting dataset for CC category is 298.


The Tests Conducted with the Datasets with Protein Sequence Length Being at Most 1000

The bar graph below shows the metric results obtained for all three GO categories, namely, BP, MF, and CC on the datasets with protein sequence length being at most 1000. The number of GO terms in the resulting dataset for BP category is 1341. The number of GO terms in the resulting dataset for MF category is 1422, and the number of GO terms in the resulting dataset for CC category is 498.

The Tests Conducted with the Merged Model on Non-regional Datasets


The Tests Conducted with the Datasets with Sum of Protein Sequence Length and the Number of Associated GO Terms at Most 500

The table below shows the results of the tests conducted with the datasets with sum of protein sequence length and the number of associated GO terms being at most 500 using the merged model. As the threshold values, for each category, the best-performing threshold has been selected to perform the test. To detect the best-performing threshold values, the results of the tests conducted with the classification head model on non-regional datasets have been used.

When the results seen above are compared to the results obtained from using the classification head model on the non-regional datasets where the sum of protein sequence length and the number of associated GO terms per data point is 500, it is observed that the performance obtained for MF and CC categories significantly surpasses the performance of the classification head model, which is expected since the merged model combines two models together to produce better results. However, the performance obtained for the BP category is slightly lower. The reason behind this is that the biological process category is probably the most complex one when compared to the other categories, due to the fact that the biological processes that a protein is associated with are related to many more factors when compared to other categories since the biological process category encompasses a much broader context in terms of defining the scope of functions of a protein.


The Tests Conducted with the Datasets with Sum of Protein Sequence Length and the Number of Associated GO Terms at Most 1000

The table below shows the results of the tests conducted with the datasets with sum of protein sequence length and the number of associated GO terms being at most 1000 using the merged model. As the threshold values, for each category, the best-performing threshold has been selected to perform the test. To detect the best-performing threshold values, the results of the tests conducted with the classification head model on non-regional datasets have been used.

When the results seen above are compared to the results obtained from using the classification head model on the non-regional datasets where the sum of protein sequence length and the number of associated GO terms per data point is 1000, it is observed that the performance obtained for BP, MF, and CC categories significantly surpasses the performance of the classification head model, which is expected since the merged model combines two models together to produce better results. It is worth noting even though the metrics are expected to decrease when the sum limit is increased, the performance of the model on the BP category is nearly the same, especially in terms of F1 score, when compared to the result table given before the one above. This suggests that the merged model has a resistance against decrease of the performance under some limit, noting the success of the model in terms of robustness.