FunSiNN: Predicting Functional Similarity of Protein Pairs Open Access
Guintu, Frederic (Spring 2025)
Abstract
The rapid growth of unannotated protein sequence data presents a major challenge
for functional annotation. This thesis proposes a scalable approach for predicting
functional similarity between protein pairs using ProtT5 language model embeddings
and a Siamese Neural Network (SNN). We first generate a similarity network by
thresholding cosine similarity scores between embeddings, followed by subsampled
Louvain clustering to produce functionally similar groups. A refinement step further
improves cluster granularity. From these clusters, we sample labeled protein pairs to
train the SNN, which learns to classify whether two proteins share function.
Experimental results show that refined clusters yield stronger labels and improved
predictive performance (F1 = 0.7670, AUC = 0.8472). Our findings demonstrate the
potential of combining pLM embeddings, unsupervised clustering, and deep learning
to enable large-scale protein function prediction in the absence of curated annotations.
Table of Contents
1 Introduction
1 2 Background 3
2.1 Protein Function Prediction Methods . . . . . . . . . . . . . . . . . . 3
2.2 Protein Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Approach 8 3.1 Protein Sequence Dataset . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 ProtT5 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Cosine Similarity Threshold Calculation . . . . . . . . . . . . 9
3.3.2 Threshold Search . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.4 Subsampled Clustering Method . . . . . . . . . . . . . . . . . 12
3.4 SNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experiments and Analysis 19
4.1 Working Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Theshold For Cosine Similarity . . . . . . . . . . . . . . . . . 19
4.2.2 Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Assessing Clustering Quality . . . . . . . . . . . . . . . . . . . 23
4.2.4 SNN Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Analysis 26
5.1 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Overfitting and Model Stability . . . . . . . . . . . . . . . . . 26
5.1.2 Clustering Label Improvements . . . . . . . . . . . . . . . . . 26
5.1.3 Architectural Improvements for Prediction . . . . . . . . . . . 27
6 Conclusion 28
A Appendix 29
A.1 Cosine Similarity Threshold Experiments . . . . . . . . . . . . . . . . 30
A.2 Clustering Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 33
A.3 SNN Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A.3.1 SNN Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . 35
A.3.2 SNN Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . 37
Bibliography 39
About this Honors Thesis
| School | |
|---|---|
| Department | |
| Degree | |
| Submission | |
| Language |
|
| Research Field | |
| Keyword | |
| Committee Chair / Thesis Advisor | |
| Committee Members |
Primary PDF
| Thumbnail | Title | Date Uploaded | Actions |
|---|---|---|---|
|
|
FunSiNN: Predicting Functional Similarity of Protein Pairs () | 2025-04-08 14:35:44 -0400 |
|
Supplemental Files
| Thumbnail | Title | Date Uploaded | Actions |
|---|