Zero/Few-Shot Dark Kinase-Phosphosite Association Prediction with Biologically Grounded Data Augmentation
Mert Pekey
Computer Science and Engineering, MSc. Thesis, 2025
Thesis Jury
Assoc. Prof. Öznur Taştan Okan (Thesis Advisor),
Asst. Prof. Dilara Keküllüoğlu, Assoc. Prof. A. Ercüment Çiçek
Date & Time: 1st July, 2025 – 1.40 PM
Place: FENS G029
Keywords : Protein Sequence Classification, Zero/Shot Learning, Phosphorylation, Dark Kinases, Post-translational Modifications, Conditional Generative Models, Data Augmentation
Abstract
Protein phosphorylation, a fundamental cellular process mediated by kinases, is crucial for signaling, and its dysregulation is implicated in numerous human diseases. A significant challenge persists in identifying substrate phosphosites for the vast number of understudied 'dark' kinases, for which conventional supervised machine learning methods are ineffective due to data scarcity. To address this gap, this thesis develops a zero- and few-shot learning framework and introduces biologically grounded data augmentation strategies, all evaluated on the DARKIN benchmark.
We introduce two novel deep learning architectures: DARKIN-FT, a compatibility-based model that enhances performance through end-to-end fine-tuning of phosphosite encoder, and DARKIN-Interact, a binary classification model that directly captures kinase–substrate interactions via joint attention over sequence pairs. The central contribution is a systematic investigation into biologically grounded data augmentation, evaluating three distinct strategies: (i) kinase-conditional phosphosite generation via a fine-tuned ProGen2 model, (ii) weak supervision using predictions from the Kinase Substrate Specificity Atlas (KSSA), and (iii) augmentation with homologous sequences.
Our results demonstrate that DARKIN-FT and DARKIN-Interact significantly outperform existing baselines on the DARKIN benchmark. The investigation into data augmentation yielded mixed results: while kinase conditional generation with ProGen2 and weak labeling with KSSA degraded the performance, augmentation with homologous sequences improved the Macro Average Precision of the DARKIN-Interact model. While the results are promising, challenges persist in disambiguating kinases with high sequence similarity.
Overall, this thesis establishes a framework for kinase–phosphosite interaction prediction in low-data regimes and provides valuable insights into the strengths and limitations of data augmentation in the dark kinase-phosphosite association task.