Jay | publications

working papers

2025

Have protein-ligand co-folding methods moved beyond memorisation?

Peter Škrinjar, Jérôme Eberhardt, Janani Durairaj, and Torsten Schwede

Abstract Preprint Code

Deep learning has driven major breakthroughs in protein structure prediction, however the next critical advance is accurately predicting how proteins interact with other molecules, especially small molecule ligands, to enable real-world applications such as drug discovery and design. Recent deep learning all-atom methods have been built to address this challenge, but evaluating their performance on the prediction of protein-ligand complexes has been inconclusive due to the lack of relevant benchmarking datasets. Here we present a comprehensive evaluation of four leading all-atom cofolding deep learning methods using our newly introduced benchmark dataset Runs N’ Poses, which comprises 2,600 high-resolution protein-ligand systems released after the training cutoff used by these methods. We demonstrate that current co-folding approaches largely memorise ligand poses from their training data, hindering their use for de novo drug design. This limitation is especially pronounced for ligands that have only been seen binding in one pocket, whereas more promiscuous ligands such as cofactors show moderately improved performance. With this work and benchmark dataset, we aim to accelerate progress in the field by allowing for a more realistic assessment of the current state-of-the-art deep learning methods for predicting protein-ligand interactions.

journal articles

2024

Bioinformatics
Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone

Lorenzo Pantolini, Gabriel Studer, Joana Pereira, Janani Durairaj, Gerardo Tauriello, and Torsten Schwede

In Bioinformatics

Abstract HTML BibTeX

Motivation: Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a “semantic meaning” of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. Results: In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. Availability and implementation: The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.
@article{pantolini2024embedding, title = {Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone}, author = {Pantolini, Lorenzo and Studer, Gabriel and Pereira, Joana and Durairaj, Janani and Tauriello, Gerardo and Schwede, Torsten}, journal = {Bioinformatics}, pages = {btad786}, year = {2024}, publisher = {Oxford University Press}, month = jan, doi = {10.1093/bioinformatics/btad786}, }
Msphere
Structural implications of BK polyomavirus sequence variations in the major viral capsid protein Vp1 and large T-antigen: a computational study

Janani Durairaj, Océane M Follonier, Karoline Leuzinger, Leila T Alexander, Maud Wilhelm, Joana Pereira, Caroline A Hillenbrand, Fabian H Weissbach, Torsten Schwede, and Hans H Hirsch

In Msphere

Abstract HTML BibTeX

BK polyomavirus (BKPyV) is a double-stranded DNA virus causing nephropathy, hemorrhagic cystitis, and urothelial cancer in transplant patients. The BKPyV-encoded capsid protein Vp1 and large T-antigen (LTag) are key targets of neutralizing antibodies and cytotoxic T-cells, respectively. Our single-center data suggested that variability in Vp1 and LTag may contribute to failing BKPyV-specific immune control, and impact vaccine design. We therefore analyzed all available entries in GenBank (1516 VP1; 742 LTAG) and explored potential structural effects using computational approaches. BKPyV-genotype (gt)1 was found in 71.18% of entries, followed by BKPyV-gt4 (19.26%), BKPyV-gt2 (8.11%) and BKPyV-gt3 (1.45%), but rates differed according to country and specimen type. Vp1-mutations matched a serotype different than the assigned one or were serotype-independent in 43%, 18% affected more than one amino acid. Notable Vp1-mutations altered antibody-binding domains, interactions with sialic acid receptors, or were predicted to change conformation. LTag-sequences were more conserved, with only 16 mutations detectable in more than one entry and without significant effects on LTag-structure or interaction domains. However, LTag changes were predicted to affect HLA-class I presentation of immunodominant 9mers to cytotoxic T-cells. These global data strengthen single center observations and specifically our earlier findings revealing mutant 9mer epitopes conferring immune escape from HLA-I cytotoxic T cells. We conclude that variability of BKPyV-Vp1 and LTag may have important implications for diagnostic assays assessing BKPyV-specific immune control and for vaccine design.
@article{durairaj2024structural, title = {Structural implications of BK polyomavirus sequence variations in the major viral capsid protein Vp1 and large T-antigen: a computational study}, author = {Durairaj, Janani and Follonier, Oc{\'e}ane M and Leuzinger, Karoline and Alexander, Leila T and Wilhelm, Maud and Pereira, Joana and Hillenbrand, Caroline A and Weissbach, Fabian H and Schwede, Torsten and Hirsch, Hans H}, journal = {Msphere}, volume = {9}, number = {4}, pages = {e00799--23}, year = {2024}, publisher = {Am Soc Microbiol}, doi = {10.1128/msphere.00799-23}, month = mar }

2023

Nature
Uncovering new families and folds in the natural protein universe

Janani Durairaj, Andrew M Waterhouse, Toomas Mets, Tetiana Brodiazhenko, Minhal Abdullah, Gabriel Studer, Gerardo Tauriello, Mehmet Akdel, Antonina Andreeva, Alex Bateman, Tanel Tensor, Vasili Hauryliuk, Torsten Schwede, and Joana Pereira

In Nature

Abstract Code BibTeX Website

We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this "dark matter" of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network. By searching for novelties from sequence, structure, and semantic perspectives, we uncovered the beta-flower fold, added multiple protein families to Pfam database, and experimentally demonstrate that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating, and prioritising novel protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
@article{durairaj2023uncovering, title = {Uncovering new families and folds in the natural protein universe}, author = {Durairaj, Janani and Waterhouse, Andrew M and Mets, Toomas and Brodiazhenko, Tetiana and Abdullah, Minhal and Studer, Gabriel and Tauriello, Gerardo and Akdel, Mehmet and Andreeva, Antonina and Bateman, Alex and Tensor, Tanel and Hauryliuk, Vasili and Schwede, Torsten and Pereira, Joana}, journal = {Nature}, pages = {1--3}, year = {2023}, publisher = {Nature Publishing Group UK London}, doi = {10.1038/s41586-023-06622-3}, }
NRDD
Artificial Intelligence for Natural Product Drug Discovery

Michael W. Mullowney, Katherine R. Duncan, Somayah S. Elsayed, Neha Garg, Justin J. J. van der Hooft, Nathaniel I. Martin, David Meijer, Barbara R. Terlouw, Friederike Biermann, Kai Blin, Janani Durairaj, Marina Gorostiola González, Eric J. N. Helfrich, Florian Huber, Stefan Leopold-Messer, Kohulan Rajan, Tristan de Rond, Jeffrey A. van Santen, Maria Sorokina, Marcy J. Balunas, Mehdi A. Beniddir, Doris A. van Bergeijk, Laura M. Carroll, Chase M. Clark, Djork-Arné Clevert, Chris A. Dejong, Chao Du, Scarlet Ferrinho, Francesca Grisoni, Albert Hofstetter, Willem Jespers, Olga V. Kalinina, Satria A. Kautsar, Hyunwoo Kim, Tiago F. Leao, Joleen Masschelein, Evan R. Rees, Raphael Reher, Daniel Reker, Philippe Schwaller, Marwin Segler, Michael A. Skinnider, Allison S. Walker, Egon L. Willighagen, Barbara Zdrazil, Nadine Ziemert, Rebecca J. M. Goss, Pierre Guyomard, Andrea Volkamer, William H. Gerwick, Hyun Uk Kim, Rolf Müller, Gilles P. van Wezel, Gerard J. P. van Westen, Anna K. H. Hirsch, Roger G. Linington, Serina L. Robinson, and Marnix H. Medema

In Nature Reviews Drug Discovery

Abstract BibTeX

Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.
@article{mullowneyArtificialIntelligenceNatural2023, title = {Artificial Intelligence for Natural Product Drug Discovery}, author = {Mullowney, Michael W. and Duncan, Katherine R. and Elsayed, Somayah S. and Garg, Neha and {van der Hooft}, Justin J. J. and Martin, Nathaniel I. and Meijer, David and Terlouw, Barbara R. and Biermann, Friederike and Blin, Kai and Durairaj, Janani and Gorostiola Gonz{\'a}lez, Marina and Helfrich, Eric J. N. and Huber, Florian and {Leopold-Messer}, Stefan and Rajan, Kohulan and {de Rond}, Tristan and {van Santen}, Jeffrey A. and Sorokina, Maria and Balunas, Marcy J. and Beniddir, Mehdi A. and {van Bergeijk}, Doris A. and Carroll, Laura M. and Clark, Chase M. and Clevert, Djork-Arn{\'e} and Dejong, Chris A. and Du, Chao and Ferrinho, Scarlet and Grisoni, Francesca and Hofstetter, Albert and Jespers, Willem and Kalinina, Olga V. and Kautsar, Satria A. and Kim, Hyunwoo and Leao, Tiago F. and Masschelein, Joleen and Rees, Evan R. and Reher, Raphael and Reker, Daniel and Schwaller, Philippe and Segler, Marwin and Skinnider, Michael A. and Walker, Allison S. and Willighagen, Egon L. and Zdrazil, Barbara and Ziemert, Nadine and Goss, Rebecca J. M. and Guyomard, Pierre and Volkamer, Andrea and Gerwick, William H. and Kim, Hyun Uk and M{\"u}ller, Rolf and {van Wezel}, Gilles P. and {van Westen}, Gerard J. P. and Hirsch, Anna K. H. and Linington, Roger G. and Robinson, Serina L. and Medema, Marnix H.}, year = {2023}, month = sep, journal = {Nature Reviews Drug Discovery}, issn = {1474-1784}, doi = {10.1038/s41573-023-00774-7}, }
Proteins
Protein target highlights in CASP15: Analysis of models by structure providers

Leila T. Alexander*, Janani Durairaj*, Andriy Kryshtafovych, Luciano A. Abriata, Yusupha Bayo, Gira Bhabha, Cécile Breyton, Simon G. Caulton, James Chen, Séraphine Degroux, Damian C. Ekiert, Benedikte S. Erlandsen, Peter L. Freddolino, Dominic Gilzer, Chris Greening, Jonathan M. Grimes, Rhys Grinter, Manickam Gurusaran, Marcus D. Hartmann, Charlie J. Hitchman, Jeremy R. Keown, Ashleigh Kropp, Petri Kursula, Andrew L. Lovering, Bruno Lemaitre, Andrea Lia, Shiheng Liu, Maria Logotheti, Shuze Lu, Sigurbjörn Markússon, Mitchell D. Miller, George Minasov, Hartmut H. Niemann, Felipe Opazo, George N. Phillips Jr, Owen R. Davies, Samuel Rommelaere, Monica Rosas-Lemus, Pietro Roversi, Karla Satchell, Nathan Smith, Mark A. Wilson, Kuan-Lin Wu, Xian Xia, Han Xiao, Wenhua Zhang, Z. Hong Zhou, Krzysztof Fidelis, Maya Topf, John Moult, and Torsten Schwede

In Proteins: Structure, Function, and Bioinformatics

Abstract HTML BibTeX

We present an in-depth analysis of selected CASP15 targets, focusing on their biological and functional significance. The authors of the structures identify and discuss key protein features and evaluate how effectively these aspects were captured in the submitted predictions. While the overall ability to predict three-dimensional protein structures continues to impress, reproducing uncommon features not previously observed in experimental structures is still a challenge. Furthermore, instances with conformational flexibility and large multimeric complexes highlight the need for novel scoring strategies to better emphasize biologically relevant structural regions. Looking ahead, closer integration of computational and experimental techniques will play a key role in determining the next challenges to be unraveled in the field of structural molecular biology.
@article{alexanderProteinTargetHighlights, title = {Protein target highlights in {CASP15}: {Analysis} of models by structure providers}, volume = {n/a}, copyright = {© 2023 The Authors. Proteins: Structure, Function, and Bioinformatics published by Wiley Periodicals LLC.}, issn = {1097-0134}, shorttitle = {Protein target highlights in {CASP15}}, doi = {10.1002/prot.26545}, language = {en}, number = {n/a}, urldate = {2023-10-10}, year = {2023}, journal = {Proteins: Structure, Function, and Bioinformatics}, author = {Alexander*, Leila T. and Durairaj*, Janani and Kryshtafovych, Andriy and Abriata, Luciano A. and Bayo, Yusupha and Bhabha, Gira and Breyton, Cécile and Caulton, Simon G. and Chen, James and Degroux, Séraphine and Ekiert, Damian C. and Erlandsen, Benedikte S. and Freddolino, Peter L. and Gilzer, Dominic and Greening, Chris and Grimes, Jonathan M. and Grinter, Rhys and Gurusaran, Manickam and Hartmann, Marcus D. and Hitchman, Charlie J. and Keown, Jeremy R. and Kropp, Ashleigh and Kursula, Petri and Lovering, Andrew L. and Lemaitre, Bruno and Lia, Andrea and Liu, Shiheng and Logotheti, Maria and Lu, Shuze and Markússon, Sigurbjörn and Miller, Mitchell D. and Minasov, George and Niemann, Hartmut H. and Opazo, Felipe and Phillips Jr, George N. and Davies, Owen R. and Rommelaere, Samuel and Rosas-Lemus, Monica and Roversi, Pietro and Satchell, Karla and Smith, Nathan and Wilson, Mark A. and Wu, Kuan-Lin and Xia, Xian and Xiao, Han and Zhang, Wenhua and Zhou, Z. Hong and Fidelis, Krzysztof and Topf, Maya and Moult, John and Schwede, Torsten} }
Proteins
Automated benchmarking of combined protein structure and ligand conformation prediction

Michèle Leemann, Ander Sagasta, Jerome Eberhardt, Torsten Schwede, Xavier Robin, and Janani Durairaj

In Proteins: Structure, Function, and Bioinformatics

Abstract BibTeX

The prediction of protein-ligand complexes (PLC), using both experimental and predicted structures, is an active and important area of research, underscored by the inclusion of the Protein-Ligand Interaction category in the latest round of the Critical Assessment of Protein Structure Prediction experiment CASP15. The prediction task in CASP15 consisted of predicting both the three-dimensional structure of the receptor protein as well as the position and conformation of the ligand. This paper addresses the challenges and proposed solutions for devising automated benchmarking techniques for PLC prediction. The reliability of experimentally solved PLC as ground truth reference structures is assessed using various validation criteria. Similarity of PLC to previously released complexes are employed to judge PLC diversity and the difficulty of a PLC as a prediction target. We show that the commonly used PDBBind time-split test-set is inappropriate for comprehensive PLC evaluation, with state-of-the-art tools showing conflicting results on a more representative and high quality dataset constructed for benchmarking purposes. We also show that redocking on crystal structures is a much simpler task than docking into predicted protein models, demonstrated by the two PLC-prediction-specific scoring metrics created. Finally, we introduce a fully automated pipeline that predicts PLC and evaluates the accuracy of the protein structure, ligand pose, and protein–ligand interactions.
@article{leemannAutomatedBenchmarkingCombined, title = {Automated benchmarking of combined protein structure and ligand conformation prediction}, volume = {n/a}, copyright = {© 2023 The Authors. Proteins: Structure, Function, and Bioinformatics published by Wiley Periodicals LLC.}, issn = {1097-0134}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.26605}, doi = {10.1002/prot.26605}, language = {en}, number = {n/a}, urldate = {2023-10-27}, journal = {Proteins: Structure, Function, and Bioinformatics}, author = {Leemann, Michèle and Sagasta, Ander and Eberhardt, Jerome and Schwede, Torsten and Robin, Xavier and Durairaj, Janani}, note = {\_eprint: https://onlinelibrary.wiley.com/doi/ keywords = {3D structure prediction, CASP15, protein structure, protein-ligand complexes}, year = {2023}, }
Proteins
Assessment of protein–ligand complexes in CASP15

Xavier Robin, Gabriel Studer, Janani Durairaj, Jerome Eberhardt, Torsten Schwede, and W. Patrick Walters

In Proteins: Structure, Function, and Bioinformatics

Abstract HTML BibTeX

CASP15 introduced a new category, ligand prediction, where participants were provided with a protein or nucleic acid sequence, SMILES line notation, and stoichiometry for ligands and tasked with generating computational models for the three-dimensional structure of the corresponding protein–ligand complex. These models were subsequently compared with experimental structures determined by x-ray crystallography or cryoEM. To assess these predictions, two novel scores were developed. The Binding-Site Superposed, Symmetry-Corrected Pose Root Mean Square Deviation (BiSyRMSD) evaluated the absolute deviations of the models from the experimental structures. At the same time, the Local Distance Difference Test for Protein–Ligand Interactions (lDDT-PLI) assessed the ability of models to reproduce the protein–ligand interactions in the experimental structures. The ligands evaluated in this challenge range from single-atom ions to large flexible organic molecules. More than 1800 submissions were evaluated for their ability to predict 23 different protein–ligand complexes. Overall, the best models could faithfully reproduce the geometries of more than half of the prediction targets. The ligands’ size and flexibility were the primary factors influencing the predictions’ quality. Small ions and organic molecules with limited flexibility were predicted with high fidelity, while reproducing the binding poses of larger, flexible ligands proved more challenging.
@article{robinAssessmentProteinLigand, title = {Assessment of protein–ligand complexes in {CASP15}}, volume = {n/a}, copyright = {© 2023 Wiley Periodicals LLC.}, issn = {1097-0134}, doi = {10.1002/prot.26601}, language = {en}, number = {n/a}, urldate = {2023-10-10}, journal = {Proteins: Structure, Function, and Bioinformatics}, year = {2023}, author = {Robin, Xavier and Studer, Gabriel and Durairaj, Janani and Eberhardt, Jerome and Schwede, Torsten and Walters, W. Patrick} }
Proteins
New prediction categories in CASP15

Andriy Kryshtafovych, Maciej Antczak, Marta Szachniuk, Tomasz Zok, Rachael C. Kretsch, Ramya Rangan, Phillip Pham, Rhiju Das, Xavier Robin, Gabriel Studer, Janani Durairaj, Jerome Eberhardt, Aaron Sweeney, Maya Topf, Torsten Schwede, Krzysztof Fidelis, and John Moult

In Proteins: Structure, Function, and Bioinformatics

Abstract HTML BibTeX

Prediction categories in the Critical Assessment of Structure Prediction (CASP) experiments change with the need to address specific problems in structure modeling. In CASP15, four new prediction categories were introduced: RNA structure, ligand-protein complexes, accuracy of oligomeric structures and their interfaces, and ensembles of alternative conformations. This paper lists technical specifications for these categories and describes their integration in the CASP data management system.
@article{kryshtafovychNewPredictionCategories, title = {New prediction categories in {CASP15}}, volume = {n/a}, copyright = {© 2023 The Authors. Proteins: Structure, Function, and Bioinformatics published by Wiley Periodicals LLC.}, issn = {1097-0134}, doi = {10.1002/prot.26515}, language = {en}, number = {n/a}, urldate = {2023-10-10}, journal = {Proteins: Structure, Function, and Bioinformatics}, year = {2023}, author = {Kryshtafovych, Andriy and Antczak, Maciej and Szachniuk, Marta and Zok, Tomasz and Kretsch, Rachael C. and Rangan, Ramya and Pham, Phillip and Das, Rhiju and Robin, Xavier and Studer, Gabriel and Durairaj, Janani and Eberhardt, Jerome and Sweeney, Aaron and Topf, Maya and Schwede, Torsten and Fidelis, Krzysztof and Moult, John} }
Science
Maize resistance to witchweed through changes in strigolactone biosynthesis

C. Li, L. Dong, J. Durairaj, J.-C. Guan, M. Yoshimura, P. Quinodoz, R. Horber, K. Gaus, J. Li, Y. B. Setotaw, J. Qi, H. De Groote, Y. Wang, B. Thiombiano, K. Floková, A. Walmsley, T. V. Charnikhova, A. Chojnacka, S. Correia de Lemos, Y. Ding, D. Skibbe, K. Hermann, C. Screpanti, A. De Mesmaeker, E. A. Schmelz, A. Menkir, M. Medema, A. D. J. Van Dijk, J. Wu, K. E. Koch, and H. J. Bouwmeester

In Science

Abstract HTML BibTeX

Maize (Zea mays) is a major staple crop in Africa, where its yield and the livelihood of millions are compromised by the parasitic witchweed Striga. Germination of Striga is induced by strigolactones exuded from maize roots into the rhizosphere. In a maize germplasm collection, we identified two strigolactones, zealactol and zealactonoic acid, which stimulate less Striga germination than the major maize strigolactone, zealactone. We then showed that a single cytochrome P450, ZmCYP706C37, catalyzes a series of oxidative steps in the maize-strigolactone biosynthetic pathway. Reduction in activity of this enzyme and two others involved in the pathway, ZmMAX1b and ZmCLAMT1, can change strigolactone composition and reduce Striga germination and infection. These results offer prospects for breeding Striga-resistant maize.
@article{liMaizeResistanceWitchweed2023, title = {Maize resistance to witchweed through changes in strigolactone biosynthesis}, volume = {379}, doi = {10.1126/science.abq4775}, number = {6627}, urldate = {2023-10-10}, journal = {Science}, author = {Li, C. and Dong, L. and Durairaj, J. and Guan, J.-C. and Yoshimura, M. and Quinodoz, P. and Horber, R. and Gaus, K. and Li, J. and Setotaw, Y. B. and Qi, J. and De Groote, H. and Wang, Y. and Thiombiano, B. and Floková, K. and Walmsley, A. and Charnikhova, T. V. and Chojnacka, A. and Correia de Lemos, S. and Ding, Y. and Skibbe, D. and Hermann, K. and Screpanti, C. and De Mesmaeker, A. and Schmelz, E. A. and Menkir, A. and Medema, M. and Van Dijk, A. D. J. and Wu, J. and Koch, K. E. and Bouwmeester, H. J.}, month = jan, year = {2023}, note = {Publisher: American Association for the Advancement of Science}, pages = {94--99}, }
CSBJ
Beyond sequence: Structure-based machine learning

Janani Durairaj, Dick de Ridder, and Aalt D. J. van Dijk

In Computational and Structural Biotechnology Journal

Abstract HTML BibTeX

Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
@article{durairajSequenceStructurebasedMachine2023, title = {Beyond sequence: {Structure}-based machine learning}, volume = {21}, issn = {2001-0370}, shorttitle = {Beyond sequence}, doi = {10.1016/j.csbj.2022.12.039}, urldate = {2023-10-10}, journal = {Computational and Structural Biotechnology Journal}, author = {Durairaj, Janani and de Ridder, Dick and van Dijk, Aalt D. J.}, month = jan, year = {2023}, keywords = {Deep learning, Machine learning, Protein structures}, pages = {630--643}, }

2022

NSMB
A structural biology community assessment of AlphaFold2 applications

Mehmet Akdel*, Douglas E V Pires*, Eduard Porta Pardo*, Jürgen Jänes*, Arthur O Zalevsky*, Bálint Mészáros*, Patrick Bryant*, Lydia L. Good*, Roman A Laskowski*, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H M Rodrigues, Alistair S Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Kresten Lindorff-Larsen, Alfonso Valencia#, Sergey Ovchinnikov#, Janani Durairaj#, David B Ascher#, Janet M Thornton#, Norman E Davey#, Amelie Stein#, Arne Elofsson#, Tristan I Croll#, and Pedro Beltrao#

In Nature Structural & Molecular Biology

Abstract HTML BibTeX

Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.
@article{akdelStructuralBiologyCommunity2022b, title = {A structural biology community assessment of {AlphaFold2} applications}, volume = {29}, copyright = {2022 The Author(s)}, issn = {1545-9985}, doi = {10.1038/s41594-022-00849-w}, language = {en}, number = {11}, urldate = {2023-10-10}, journal = {Nature Structural \& Molecular Biology}, author = {Akdel*, Mehmet and Pires*, Douglas E V and Porta Pardo*, Eduard and J{\"a}nes*, J{\"u}rgen and Zalevsky*, Arthur O and M{\'e}sz{\'a}ros*, B{\'a}lint and Bryant*, Patrick and Good*, Lydia L. and Laskowski*, Roman A and Pozzati, Gabriele and Shenoy, Aditi and Zhu, Wensi and Kundrotas, Petras and Ruiz Serra, Victoria and Rodrigues, Carlos H M and Dunham, Alistair S and Burke, David and Borkakoti, Neera and Velankar, Sameer and Frost, Adam and Lindorff-Larsen, Kresten and Valencia#, Alfonso and Ovchinnikov#, Sergey and Durairaj#, Janani and Ascher#, David B and Thornton#, Janet M and Davey#, Norman E and Stein#, Amelie and Elofsson#, Arne and Croll#, Tristan I and Beltrao#, Pedro}, month = nov, year = {2022}, note = {Number: 11 Publisher: Nature Publishing Group}, keywords = {Protein folding, Research data, Structural biology}, pages = {1056--1067} }
New Phytologist
The tomato cytochrome P450 CYP712G1 catalyzes the double oxidation of orobanchol en route to the rhizosphere signaling strigolactone, solanacol

Yanting Wang, Janani Durairaj, Hernando G Suárez Duran, Robin van Velzen, Kristyna Flokova, Che-Yang Liao, Aleksandra Chojnacka, Stuart MacFarlane, M Eric Schranz, Marnix H Medema, Aalt DJ van Dijk, Lemeng Dong, and Harro Bouwmeester

In New Phytologist

Abstract HTML BibTeX

Strigolactones (SLs) are rhizosphere signalling molecules and phytohormones. The biosynthetic pathway of SLs in tomato has been partially elucidated, but the structural diversity in tomato SLs predicts that additional biosynthetic steps are required. Here, root RNA-seq data and co-expression analysis were used for SL biosynthetic gene discovery. This strategy resulted in a candidate gene list containing several cytochrome P450s. Heterologous expression in Nicotiana benthamiana and yeast showed that one of these, CYP712G1, can catalyse the double oxidation of orobanchol, resulting in the formation of three didehydro-orobanchol (DDH) isomers. Virus-induced gene silencing and heterologous expression in yeast showed that one of these DDH isomers is converted to solanacol, one of the most abundant SLs in tomato root exudate. Protein modelling and substrate docking analysis suggest that hydroxy-orbanchol is the likely intermediate in the conversion from orobanchol to the DDH isomers. Phylogenetic analysis demonstrated the occurrence of CYP712G1 homologues in the Eudicots only, which fits with the reports on DDH isomers in that clade. Protein modelling and orobanchol docking of the putative tobacco CYP712G1 homologue suggest that it can convert orobanchol to similar DDH isomers as tomato.
@article{wang2022tomato, title = {The tomato cytochrome P450 CYP712G1 catalyzes the double oxidation of orobanchol en route to the rhizosphere signaling strigolactone, solanacol}, author = {Wang, Yanting and Durairaj, Janani and Su{\'a}rez Duran, Hernando G and van Velzen, Robin and Flokova, Kristyna and Liao, Che-Yang and Chojnacka, Aleksandra and MacFarlane, Stuart and Schranz, M Eric and Medema, Marnix H and van Dijk, Aalt DJ and Dong, Lemeng and Bouwmeester, Harro}, journal = {New Phytologist}, year = {2022}, publisher = {Wiley Online Library}, }

2021

PLOS Comp. Bio.
Integrating structure-based machine learning and co-evolution to investigate specificity in plant sesquiterpene synthases

Janani Durairaj, Elena Melillo, Harro J Bouwmeester, Jules Beekwilder, Dick de Ridder, and Aalt DJ van Dijk

In PLOS Computational Biology

Abstract HTML Code BibTeX

Sesquiterpene synthases (STSs) catalyze the formation of a large class of plant volatiles called sesquiterpenes. While thousands of putative STS sequences from diverse plant species are available, only a small number of them have been functionally characterized. Sequence identity-based screening for desired enzymes, often used in biotechnological applications, is difficult to apply here as STS sequence similarity is strongly affected by species. This calls for more sophisticated computational methods for functionality prediction. We investigate the specificity of precursor cation formation in these elusive enzymes. By inspecting multi-product STSs, we demonstrate that STSs have a strong selectivity towards one precursor cation. We use a machine learning approach combining sequence and structure information to accurately predict precursor cation specificity for STSs across all plant species. We combine this with a co-evolutionary analysis on the wealth of uncharacterized putative STS sequences, to pinpoint residues and distant functional contacts influencing cation formation and reaction pathway selection. These structural factors can be used to predict and engineer enzymes with specific functions, as we demonstrate by predicting and characterizing two novel STSs from Citrus bergamia.
@article{durairaj2021integrating, title = {Integrating structure-based machine learning and co-evolution to investigate specificity in plant sesquiterpene synthases}, author = {Durairaj, Janani and Melillo, Elena and Bouwmeester, Harro J and Beekwilder, Jules and de Ridder, Dick and van Dijk, Aalt DJ}, journal = {PLOS Computational Biology}, volume = {17}, number = {3}, pages = {e1008197}, year = {2021}, publisher = {Public Library of Science San Francisco, CA USA}, }

2020

CSBJ
Caretta–A multiple protein structure alignment and feature extraction suite

Mehmet Akdel*, Janani Durairaj*, Dick de Ridder, and Aalt DJ van Dijk

In Computational and Structural Biotechnology Journal

Abstract HTML Code BibTeX Website

The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta’s performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.
@article{akdel2020caretta, title = {Caretta--A multiple protein structure alignment and feature extraction suite}, author = {Akdel*, Mehmet and Durairaj*, Janani and de Ridder, Dick and van Dijk, Aalt DJ}, journal = {Computational and Structural Biotechnology Journal}, volume = {18}, pages = {981--992}, year = {2020}, publisher = {Elsevier}, }
Bioinformatics
Geometricus represents protein structures as shape-mers derived from moment invariants

Janani Durairaj, Mehmet Akdel, Dick de Ridder, and Aalt DJ van Dijk

In Bioinformatics

Abstract HTML Code BibTeX

As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well. We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family.
@article{durairaj2020geometricus, title = {Geometricus represents protein structures as shape-mers derived from moment invariants}, author = {Durairaj, Janani and Akdel, Mehmet and de Ridder, Dick and van Dijk, Aalt DJ}, journal = {Bioinformatics}, volume = {36}, number = {Supplement\_2}, pages = {i718--i725}, year = {2020}, publisher = {Oxford University Press}, }
Arch.
Biochem.
Biophys.
The santalene synthase from Cinnamomum camphora: Reconstruction of a sesquiterpene synthase from a monoterpene synthase

Alice Di Girolamo*, Janani Durairaj*, Adèle van Houwelingen, Francel Verstappen, Dirk Bosch, Katarina Cankar, Harro Bouwmeester, Dick de Ridder, Aalt DJ van Dijk, and Jules Beekwilder

In Archives of Biochemistry and Biophysics

Abstract HTML BibTeX

Plant terpene synthases (TPSs) can mediate formation of a large variety of terpenes, and their diversification contributes to the specific chemical profiles of different plant species and chemotypes. Plant genomes often encode a number of related terpene synthases, which can produce very different terpenes. The relationship between TPS sequence and resulting terpene product is not completely understood. In this work we describe two TPSs from the Camphor tree Cinnamomum camphora (L.) Presl. One of these, CiCaMS, acts as a monoterpene synthase (monoTPS), and mediates the production of myrcene, while the other, CiCaSSy, acts as a sesquiterpene synthase (sesquiTPS), and catalyses the production of α-santalene, β-santalene and trans-α-bergamotene. Interestingly, these enzymes share 97% DNA sequence identity and differ only in 22 amino acid residues out of 553. To understand which residues are essential for the catalysis of monoterpenes resp. sesquiterpenes, a number of hybrid synthases were prepared, and supplemented by a set of single-residue variants. These were tested for their ability to produce monoterpenes and sesquiterpenes by in vivo production of sesquiterpenes in E. coli, and by in vitro enzyme assays. This analysis pinpointed three residues in the sequence which could mediate the change in product specificity from a monoterpene synthase to a sesquiterpene synthase. Another set of three residues defined the sesquiterpene product profile, including the ratios between sesquiterpene products.
@article{di2020santalene, title = {The santalene synthase from Cinnamomum camphora: Reconstruction of a sesquiterpene synthase from a monoterpene synthase}, author = {Di Girolamo*, Alice and Durairaj*, Janani and van Houwelingen, Ad{\`e}le and Verstappen, Francel and Bosch, Dirk and Cankar, Katarina and Bouwmeester, Harro and de Ridder, Dick and van Dijk, Aalt DJ and Beekwilder, Jules}, journal = {Archives of Biochemistry and Biophysics}, volume = {695}, pages = {108647}, year = {2020}, publisher = {Academic Press}, }

2019

Phytochemistry
An analysis of characterized plant sesquiterpene synthases

Janani Durairaj*, Alice Di Girolamo*, Harro J Bouwmeester, Dick de Ridder, Jules Beekwilder, and Aalt DJ van Dijk

In Phytochemistry

Abstract HTML Preprint BibTeX Website

Plants exhibit a vast array of sesquiterpenes, C15 hydrocarbons which often function as herbivore-repellents or pollinator-attractants. These in turn are produced by a diverse range of sesquiterpene synthases. A comprehensive analysis of these enzymes in terms of product specificity has been hampered by the lack of a centralized resource of sufficient functionally annotated sequence data. To address this, we have gathered 262 plant sesquiterpene synthase sequences with experimentally characterized products. The annotated enzyme sequences allowed for an analysis of terpene synthase motifs, leading to the extension of one motif and recognition of a variant of another. In addition, putative terpene synthase sequences were obtained from various resources and compared with the annotated sesquiterpene synthases. This analysis indicated regions of terpene synthase sequence space which so far are unexplored experimentally. Finally, we present a case describing mutational studies on residues altering product specificity, for which we analyzed conservation in our database. This demonstrates an application of our database in choosing likely-functional residues for mutagenesis studies aimed at understanding or changing sesquiterpene synthase product specificity.
@article{durairaj2019analysis, title = {An analysis of characterized plant sesquiterpene synthases}, author = {Durairaj*, Janani and Di Girolamo*, Alice and Bouwmeester, Harro J and de Ridder, Dick and Beekwilder, Jules and van Dijk, Aalt DJ}, journal = {Phytochemistry}, volume = {158}, pages = {157--165}, year = {2019}, publisher = {Elsevier}, }

conference articles & others

2024

ICML

PLINDER: The Protein-Ligand Interactions Dataset and Evaluation Resource

Janani Durairaj*, Yusuf Adeshina*, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duignan, Zachary McClure, Xavier Robin, Gabriel Studer, Daniel Kovtun, Emanuele Rossi, Guoqing Zhou, Srimukh Veccham, Clemens Isert, Yuxing Peng, Prabindh Sundareson, Mehmet Akdel, Gabriele Corso, Hannes Stärk, Gerardo Tauriello, Zachary Carpenter, Michael Bronstein, Emine Kucukbenli, Torsten Schwede, and Luca Naef

In ICML’24 Workshop ML for Life and Material Science: From Theory to Industry Applications

Abstract Preprint Code Website

Protein-ligand interactions (PLI) are foundational to small molecule drug design. With computational methods striving towards experimental accuracy, there is a critical demand for a well-curated and diverse PLI dataset. Existing datasets are often limited in size and diversity, and commonly used evaluation sets suffer from training information leakage, hindering the realistic assessment of method generalization capabilities. To address these shortcomings, we present PLINDER, the largest and most annotated dataset to date, comprising 449,383 PLI systems, each with over 500 annotations, similarity metrics at protein, pocket, interaction and ligand levels, and paired unbound (apo) and predicted structures. We propose an approach to generate training and evaluation splits that minimizes task-specific leakage and maximizes test set quality, and compare the resulting performance of DiffDock when retrained with different kinds of splits.

2023

Springer Book
From Genomes to Variant Interpretations Through Protein Structures

Janani Durairaj*, Leila Tamara Alexander*, Gabriel Studer, Gerardo Tauriello, Ingrid Guarnetti Prandi, Rosalba Lepore, Giovanni Chillemi, and Torsten Schwede

2023

Abstract HTML BibTeX

The large amount of genetic, phenotypic, and structural data from diverse conditions and environments offers opportunities for new groundbreaking research. Today, the major scientific task is to interpret the vast number of genetic variants within these data. As described in this chapter, identifying relevant variants and connecting them with the associated protein structural and environmental information is a powerful approach to biological discoveries. The unified view of the data brings us a step closer to understanding genetic variation, which is also fundamental for achieving the goals of personalized medicine and the planet’s environment.
@inproceedings{durairajGenomesVariantInterpretations2023, address = {Cham}, series = {{SpringerBriefs} in {Applied} {Sciences} and {Technology}}, title = {From {Genomes} to {Variant} {Interpretations} {Through} {Protein} {Structures}}, isbn = {978-3-031-30691-4}, language = {en}, urldate = {2023-10-10}, booktitle = {{Exscalate4CoV}: {High}-{Performance} {Computing} for {COVID} {Drug} {Discovery}}, publisher = {Springer International Publishing}, author = {Durairaj*, Janani and Alexander*, Leila Tamara and Studer, Gabriel and Tauriello, Gerardo and Prandi, Ingrid Guarnetti and Lepore, Rosalba and Chillemi, Giovanni and Schwede, Torsten}, editor = {Coletti, Silvano and Bernardi, Gabriella}, year = {2023}, doi = {10.1007/978-3-031-30691-4_6}, pages = {41--50} }
ACM
Computing Frontiers
Tunable and Portable Extreme-Scale Drug Discovery Platform at Exascale: the LIGATE Approach

Gianluca Palermo, Gianmarco Accordi, Davide Gadioli, Emanuele Vitali, Cristina Silvano, Bruno Guindani, Danilo Ardagna, Andrea R. Beccari, Domenico Bonanni, Carmine Talarico, Filippo Lunghini, Jan Martinovic, Paulo Silva, Ada Bohm, Jakub Beranek, Jan Krenek, Branislav Jansik, Luigi Crisci, Biagio, Cosenza, Peter Thoman, Philip Salzmann, Thomas Fahringer, Leila Alexander, Gerardo Tauriello, Torsten Schwede, Janani Durairaj, Andrew Emerson, Federico Ficarelli, Sebastian Wingbermuhle, Eric Lindahl, Daniele Gregori, Emanuele Sana, Silvano Coletti, and Philip Gschwandtner

In 20th ACM International Conference on Computing Frontiers (CF’23)

Abstract HTML BibTeX

Today digital revolution is having a dramatic impact on the pharmaceutical industry and the entire healthcare system. The implementation of machine learning, extreme-scale computer simulations, and big data analytics in the drug design and development process offers an excellent opportunity to lower the risk of investment and reduce the time to the patient. Within the LIGATE project, we aim to integrate, extend, and co-design best-in-class European components to design Computer-Aided Drug Design (CADD) solutions exploiting today’s high-end supercomputers and tomorrow’s Exascale resources, fostering European competitiveness in the field. The proposed LIGATE solution is a fully integrated workflow that enables to deliver the result of a virtual screening campaign for drug discovery with the highest speed along with the highest accuracy. The full automation of the solution and the possibility to run it on multiple supercomputing centers at once permit to run an extreme scale in silico drug discovery campaign in few days to respond promptly for example to a worldwide pandemic crisis.
@inproceedings{palermoTunablePortableExtremeScale2023, title = {Tunable and {Portable} {Extreme}-{Scale} {Drug} {Discovery} {Platform} at {Exascale}: the {LIGATE} {Approach}}, shorttitle = {Tunable and {Portable} {Extreme}-{Scale} {Drug} {Discovery} {Platform} at {Exascale}}, language = {en}, urldate = {2023-10-10}, journal = {arXiv.org}, author = {Palermo, Gianluca and Accordi, Gianmarco and Gadioli, Davide and Vitali, Emanuele and Silvano, Cristina and Guindani, Bruno and Ardagna, Danilo and Beccari, Andrea R. and Bonanni, Domenico and Talarico, Carmine and Lunghini, Filippo and Martinovic, Jan and Silva, Paulo and Bohm, Ada and Beranek, Jakub and Krenek, Jan and Jansik, Branislav and Crisci, Luigi and Biagio and Cosenza and Thoman, Peter and Salzmann, Philip and Fahringer, Thomas and Alexander, Leila and Tauriello, Gerardo and Schwede, Torsten and Durairaj, Janani and Emerson, Andrew and Ficarelli, Federico and Wingbermuhle, Sebastian and Lindahl, Eric and Gregori, Daniele and Sana, Emanuele and Coletti, Silvano and Gschwandtner, Philip}, month = apr, year = {2023}, conference = {true}, eventtitle = {20th ACM International Conference on Computing Frontiers (CF'23)}, doi = {10.1145/3587135.3592172} }

2021

MLSB NeurIPS
Fast and adaptive protein structure representations for machine learning

Janani Durairaj*, Mehmet Akdel*, Dick de Ridder, and Aalt DJ van Dijk

In Machine Learning for Structural Biology Workshop, NeurIPS 2020

Abstract Preprint Code BibTeX Website

The growing prevalence and popularity of protein structure data, both experimental and computationally modelled, necessitates fast tools and algorithms to enable exploratory and interpretable structure-based machine learning. Alignment-free approaches have been developed for divergent proteins, but proteins sharing func-tional and structural similarity are often better understood via structural alignment, which has typically been too computationally expensive for larger datasets. Here, we introduce the concept of rotation-invariant shape-mers to multiple structure alignment, creating a structure aligner that scales well with the number of proteins and allows for aligning over a thousand structures in 20 minutes. We demonstrate how alignment-free shape-mer counts and aligned structural features, when used in machine learning tasks, can adapt to different levels of functional hierarchy in protein kinases, pinpointing residues and structural fragments that play a role in catalytic activity.
@inproceedings{durairaj2021fast, title = {Fast and adaptive protein structure representations for machine learning}, author = {Durairaj*, Janani and Akdel*, Mehmet and de Ridder, Dick and van Dijk, Aalt DJ}, journal = {bioRxiv}, year = {2021}, eventtitle = {Machine Learning for Structural Biology Workshop, NeurIPS 2020}, conference = {true}, }

2020

Course
Crash Course Machine Learning

Geert W Kootstra, Aalt DJ van Dijk, David Rapado Rincon, and Janani Durairaj

2020

HTML Code BibTeX
@inproceedings{kootstra2020crash, title = {Crash Course Machine Learning}, author = {Kootstra, Geert W and van Dijk, Aalt DJ and Rincon, David Rapado and Durairaj, Janani}, year = {2020}, }

theses

2021

PhD Thesis

Computational approaches to discover novel enzymes for fragrance and flavour

Janani Durairaj

BibTeX Website

@phdthesis{durairaj2021computational,
  title = {Computational approaches to discover novel enzymes for fragrance and flavour},
  author = {Durairaj, Janani},
  year = {2021},
  school = {Wageningen University},
}