Finetuned DNA Language Model Based- Classifiers Captures Significant Enzymatic Activity from Metagenomic Datasets Open Access
Zheng, Weiyang (Spring 2025)
Abstract
The surge of metagenomic sequencing data demands functional annotation methods that move beyond traditional homology-based approaches. In this study, we utilize REBEAN (Read Embedding-Based Enzyme Annotator), a fine-tuned DNA language model designed to predict enzymatic activity directly from raw nucleotide sequences, and developed two classifiers, REBEAN-Halo and REBEAN-Nitro, targeting halogenase and nitrogenase functions, respectively. REBEAN-Halo identified functionally important regions within known halogenases and detected 92 candidates of novel halogenases from marine metagenomes. REBEAN-Nitro, though undertrained, successfully distinguished higher nitrogenase activity in unfertilized agricultural soils relative to fertilized ones, aligning with ecological expectations. Both models highlight REBEAN's potential to uncover functionally relevant but sequence-divergent enzymes in complex metagenomic datasets, offering a powerful tool for advancing enzyme discovery and microbiome functional profiling.
Table of Contents
Introduction .................................................................................................................................. 1
Results and Discussion .................................................................................................................. 4
Methods .......................................................................................................................................12
Bibliography .................................................................................................................................14
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
|
Finetuned DNA Language Model Based- Classifiers Captures Significant Enzymatic Activity from Metagenomic Datasets () | 2025-04-21 15:42:53 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|