FunSiNN: Predicting Functional Similarity of Protein Pairs Público

Guintu, Frederic (Spring 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/b2773x15w?locale=pt-BR
Published

Abstract

The rapid growth of unannotated protein sequence data presents a major challenge

for functional annotation. This thesis proposes a scalable approach for predicting

functional similarity between protein pairs using ProtT5 language model embeddings

and a Siamese Neural Network (SNN). We first generate a similarity network by

thresholding cosine similarity scores between embeddings, followed by subsampled

Louvain clustering to produce functionally similar groups. A refinement step further

improves cluster granularity. From these clusters, we sample labeled protein pairs to

train the SNN, which learns to classify whether two proteins share function.

Experimental results show that refined clusters yield stronger labels and improved

predictive performance (F1 = 0.7670, AUC = 0.8472). Our findings demonstrate the

potential of combining pLM embeddings, unsupervised clustering, and deep learning

to enable large-scale protein function prediction in the absence of curated annotations.

Table of Contents

1 Introduction

1 2 Background 3

2.1 Protein Function Prediction Methods . . . . . . . . . . . . . . . . . . 3

2.2 Protein Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Approach 8 3.1 Protein Sequence Dataset . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 ProtT5 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Cosine Similarity Threshold Calculation . . . . . . . . . . . . 9

3.3.2 Threshold Search . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.4 Subsampled Clustering Method . . . . . . . . . . . . . . . . . 12

3.4 SNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Experiments and Analysis 19

4.1 Working Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Theshold For Cosine Similarity . . . . . . . . . . . . . . . . . 19

4.2.2 Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.3 Assessing Clustering Quality . . . . . . . . . . . . . . . . . . . 23

4.2.4 SNN Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Analysis 26

5.1 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 26

5.1.1 Overfitting and Model Stability . . . . . . . . . . . . . . . . . 26

5.1.2 Clustering Label Improvements . . . . . . . . . . . . . . . . . 26

5.1.3 Architectural Improvements for Prediction . . . . . . . . . . . 27

6 Conclusion 28

A Appendix 29

A.1 Cosine Similarity Threshold Experiments . . . . . . . . . . . . . . . . 30

A.2 Clustering Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A.3 SNN Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A.3.1 SNN Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . 35

A.3.2 SNN Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 39

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Palavra-chave
Committee Chair / Thesis Advisor
Committee Members
Última modificação

Primary PDF

Supplemental Files