- PreprintEmbedding-based alignment: combining protein language models and alignment approaches to detect structural similarities in the twilight-zoneLorenzo Pantolini, Gabriel Studer, Joana Pereira, Janani Durairaj, and Torsten Schwede
Language models are now routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful tools in the bioinformatics field. Protein language models (pLMs) generate high dimensional embeddings on a per-residue level and encode the “semantic meaning” of each individual amino acid in the context of the full protein sequence. Multiple works use these representations as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA), and show how these capture structural similarities even in the twilight zone, outperforming both classical sequence-based scores and other approaches based on protein language models. The method shows excellent accuracy despite the absence of training and parameter optimization. We expect that the association of pLMs and alignment methods will soon rise in popularity, helping the detection of relationships between proteins in the twilight-zone.
- NatureUncovering new families and folds in the natural protein universeJanani Durairaj, Andrew M Waterhouse, Toomas Mets, Tetiana Brodiazhenko, Minhal Abdullah, Gabriel Studer, Gerardo Tauriello, Mehmet Akdel, Antonina Andreeva, Alex Bateman, Tanel Tensor, Vasili Hauryliuk, Torsten Schwede, and Joana Pereira
We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this "dark matter" of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network. By searching for novelties from sequence, structure, and semantic perspectives, we uncovered the beta-flower fold, added multiple protein families to Pfam database, and experimentally demonstrate that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating, and prioritising novel protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.
- NRDDArtificial Intelligence for Natural Product Drug DiscoveryMichael W. Mullowney, Katherine R. Duncan, Somayah S. Elsayed, Neha Garg, Justin J. J. van der Hooft, Nathaniel I. Martin, David Meijer, Barbara R. Terlouw, Friederike Biermann, Kai Blin, Janani Durairaj, Marina Gorostiola González, Eric J. N. Helfrich, Florian Huber, Stefan Leopold-Messer, Kohulan Rajan, Tristan de Rond, Jeffrey A. van Santen, Maria Sorokina, Marcy J. Balunas, Mehdi A. Beniddir, Doris A. van Bergeijk, Laura M. Carroll, Chase M. Clark, Djork-Arné Clevert, Chris A. Dejong, Chao Du, Scarlet Ferrinho, Francesca Grisoni, Albert Hofstetter, Willem Jespers, Olga V. Kalinina, Satria A. Kautsar, Hyunwoo Kim, Tiago F. Leao, Joleen Masschelein, Evan R. Rees, Raphael Reher, Daniel Reker, Philippe Schwaller, Marwin Segler, Michael A. Skinnider, Allison S. Walker, Egon L. Willighagen, Barbara Zdrazil, Nadine Ziemert, Rebecca J. M. Goss, Pierre Guyomard, Andrea Volkamer, William H. Gerwick, Hyun Uk Kim, Rolf Müller, Gilles P. van Wezel, Gerard J. P. van Westen, Anna K. H. Hirsch, Roger G. Linington, Serina L. Robinson, and Marnix H. Medema
In Nature Reviews Drug Discovery
Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.
- ProteinsProtein target highlights in CASP15: Analysis of models by structure providersLeila T. Alexander
*, Janani Durairaj *, Andriy Kryshtafovych, Luciano A. Abriata, Yusupha Bayo, Gira Bhabha, Cécile Breyton, Simon G. Caulton, James Chen, Séraphine Degroux, Damian C. Ekiert, Benedikte S. Erlandsen, Peter L. Freddolino, Dominic Gilzer, Chris Greening, Jonathan M. Grimes, Rhys Grinter, Manickam Gurusaran, Marcus D. Hartmann, Charlie J. Hitchman, Jeremy R. Keown, Ashleigh Kropp, Petri Kursula, Andrew L. Lovering, Bruno Lemaitre, Andrea Lia, Shiheng Liu, Maria Logotheti, Shuze Lu, Sigurbjörn Markússon, Mitchell D. Miller, George Minasov, Hartmut H. Niemann, Felipe Opazo, George N. Phillips Jr, Owen R. Davies, Samuel Rommelaere, Monica Rosas-Lemus, Pietro Roversi, Karla Satchell, Nathan Smith, Mark A. Wilson, Kuan-Lin Wu, Xian Xia, Han Xiao, Wenhua Zhang, Z. Hong Zhou, Krzysztof Fidelis, Maya Topf, John Moult, and Torsten Schwede
In Proteins: Structure, Function, and Bioinformatics
We present an in-depth analysis of selected CASP15 targets, focusing on their biological and functional significance. The authors of the structures identify and discuss key protein features and evaluate how effectively these aspects were captured in the submitted predictions. While the overall ability to predict three-dimensional protein structures continues to impress, reproducing uncommon features not previously observed in experimental structures is still a challenge. Furthermore, instances with conformational flexibility and large multimeric complexes highlight the need for novel scoring strategies to better emphasize biologically relevant structural regions. Looking ahead, closer integration of computational and experimental techniques will play a key role in determining the next challenges to be unraveled in the field of structural molecular biology.
- ProteinsAutomated benchmarking of combined protein structure and ligand conformation predictionMichèle Leemann, Ander Sagasta, Jerome Eberhardt, Torsten Schwede, Xavier Robin, and Janani Durairaj
In Proteins: Structure, Function, and Bioinformatics
The prediction of protein-ligand complexes (PLC), using both experimental and predicted structures, is an active and important area of research, underscored by the inclusion of the Protein-Ligand Interaction category in the latest round of the Critical Assessment of Protein Structure Prediction experiment CASP15. The prediction task in CASP15 consisted of predicting both the three-dimensional structure of the receptor protein as well as the position and conformation of the ligand. This paper addresses the challenges and proposed solutions for devising automated benchmarking techniques for PLC prediction. The reliability of experimentally solved PLC as ground truth reference structures is assessed using various validation criteria. Similarity of PLC to previously released complexes are employed to judge PLC diversity and the difficulty of a PLC as a prediction target. We show that the commonly used PDBBind time-split test-set is inappropriate for comprehensive PLC evaluation, with state-of-the-art tools showing conflicting results on a more representative and high quality dataset constructed for benchmarking purposes. We also show that redocking on crystal structures is a much simpler task than docking into predicted protein models, demonstrated by the two PLC-prediction-specific scoring metrics created. Finally, we introduce a fully automated pipeline that predicts PLC and evaluates the accuracy of the protein structure, ligand pose, and protein–ligand interactions.
- ProteinsAssessment of protein–ligand complexes in CASP15Xavier Robin, Gabriel Studer, Janani Durairaj, Jerome Eberhardt, Torsten Schwede, and W. Patrick Walters
In Proteins: Structure, Function, and Bioinformatics
CASP15 introduced a new category, ligand prediction, where participants were provided with a protein or nucleic acid sequence, SMILES line notation, and stoichiometry for ligands and tasked with generating computational models for the three-dimensional structure of the corresponding protein–ligand complex. These models were subsequently compared with experimental structures determined by x-ray crystallography or cryoEM. To assess these predictions, two novel scores were developed. The Binding-Site Superposed, Symmetry-Corrected Pose Root Mean Square Deviation (BiSyRMSD) evaluated the absolute deviations of the models from the experimental structures. At the same time, the Local Distance Difference Test for Protein–Ligand Interactions (lDDT-PLI) assessed the ability of models to reproduce the protein–ligand interactions in the experimental structures. The ligands evaluated in this challenge range from single-atom ions to large flexible organic molecules. More than 1800 submissions were evaluated for their ability to predict 23 different protein–ligand complexes. Overall, the best models could faithfully reproduce the geometries of more than half of the prediction targets. The ligands’ size and flexibility were the primary factors influencing the predictions’ quality. Small ions and organic molecules with limited flexibility were predicted with high fidelity, while reproducing the binding poses of larger, flexible ligands proved more challenging.
- ProteinsNew prediction categories in CASP15Andriy Kryshtafovych, Maciej Antczak, Marta Szachniuk, Tomasz Zok, Rachael C. Kretsch, Ramya Rangan, Phillip Pham, Rhiju Das, Xavier Robin, Gabriel Studer, Janani Durairaj, Jerome Eberhardt, Aaron Sweeney, Maya Topf, Torsten Schwede, Krzysztof Fidelis, and John Moult
In Proteins: Structure, Function, and Bioinformatics
Prediction categories in the Critical Assessment of Structure Prediction (CASP) experiments change with the need to address specific problems in structure modeling. In CASP15, four new prediction categories were introduced: RNA structure, ligand-protein complexes, accuracy of oligomeric structures and their interfaces, and ensembles of alternative conformations. This paper lists technical specifications for these categories and describes their integration in the CASP data management system.
- ScienceMaize resistance to witchweed through changes in strigolactone biosynthesisC. Li, L. Dong, J. Durairaj, J.-C. Guan, M. Yoshimura, P. Quinodoz, R. Horber, K. Gaus, J. Li, Y. B. Setotaw, J. Qi, H. De Groote, Y. Wang, B. Thiombiano, K. Floková, A. Walmsley, T. V. Charnikhova, A. Chojnacka, S. Correia de Lemos, Y. Ding, D. Skibbe, K. Hermann, C. Screpanti, A. De Mesmaeker, E. A. Schmelz, A. Menkir, M. Medema, A. D. J. Van Dijk, J. Wu, K. E. Koch, and H. J. Bouwmeester
Maize (Zea mays) is a major staple crop in Africa, where its yield and the livelihood of millions are compromised by the parasitic witchweed Striga. Germination of Striga is induced by strigolactones exuded from maize roots into the rhizosphere. In a maize germplasm collection, we identified two strigolactones, zealactol and zealactonoic acid, which stimulate less Striga germination than the major maize strigolactone, zealactone. We then showed that a single cytochrome P450, ZmCYP706C37, catalyzes a series of oxidative steps in the maize-strigolactone biosynthetic pathway. Reduction in activity of this enzyme and two others involved in the pathway, ZmMAX1b and ZmCLAMT1, can change strigolactone composition and reduce Striga germination and infection. These results offer prospects for breeding Striga-resistant maize.
- CSBJBeyond sequence: Structure-based machine learningJanani Durairaj, Dick de Ridder, and Aalt D. J. van Dijk
In Computational and Structural Biotechnology Journal
Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
- NSMBA structural biology community assessment of AlphaFold2 applicationsMehmet Akdel
*, Douglas E V Pires *, Eduard Porta Pardo *, Jürgen Jänes *, Arthur O Zalevsky *, Bálint Mészáros *, Patrick Bryant *, Lydia L. Good *, Roman A Laskowski *, Gabriele Pozzati, Aditi Shenoy, Wensi Zhu, Petras Kundrotas, Victoria Ruiz Serra, Carlos H M Rodrigues, Alistair S Dunham, David Burke, Neera Borkakoti, Sameer Velankar, Adam Frost, Kresten Lindorff-Larsen, Alfonso Valencia #, Sergey Ovchinnikov #, Janani Durairaj #, David B Ascher #, Janet M Thornton #, Norman E Davey #, Amelie Stein #, Arne Elofsson #, Tristan I Croll #, and Pedro Beltrao #
In Nature Structural & Molecular Biology
Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research.
- New PhytologistThe tomato cytochrome P450 CYP712G1 catalyzes the double oxidation of orobanchol en route to the rhizosphere signaling strigolactone, solanacolYanting Wang, Janani Durairaj, Hernando G Suárez Duran, Robin van Velzen, Kristyna Flokova, Che-Yang Liao, Aleksandra Chojnacka, Stuart MacFarlane, M Eric Schranz, Marnix H Medema, Aalt DJ van Dijk, Lemeng Dong, and Harro Bouwmeester
In New Phytologist
Strigolactones (SLs) are rhizosphere signalling molecules and phytohormones. The biosynthetic pathway of SLs in tomato has been partially elucidated, but the structural diversity in tomato SLs predicts that additional biosynthetic steps are required. Here, root RNA-seq data and co-expression analysis were used for SL biosynthetic gene discovery. This strategy resulted in a candidate gene list containing several cytochrome P450s. Heterologous expression in Nicotiana benthamiana and yeast showed that one of these, CYP712G1, can catalyse the double oxidation of orobanchol, resulting in the formation of three didehydro-orobanchol (DDH) isomers. Virus-induced gene silencing and heterologous expression in yeast showed that one of these DDH isomers is converted to solanacol, one of the most abundant SLs in tomato root exudate. Protein modelling and substrate docking analysis suggest that hydroxy-orbanchol is the likely intermediate in the conversion from orobanchol to the DDH isomers. Phylogenetic analysis demonstrated the occurrence of CYP712G1 homologues in the Eudicots only, which fits with the reports on DDH isomers in that clade. Protein modelling and orobanchol docking of the putative tobacco CYP712G1 homologue suggest that it can convert orobanchol to similar DDH isomers as tomato.
- PLOS Comp. Bio.Integrating structure-based machine learning and co-evolution to investigate specificity in plant sesquiterpene synthasesJanani Durairaj, Elena Melillo, Harro J Bouwmeester, Jules Beekwilder, Dick de Ridder, and Aalt DJ van Dijk
In PLOS Computational Biology
Sesquiterpene synthases (STSs) catalyze the formation of a large class of plant volatiles called sesquiterpenes. While thousands of putative STS sequences from diverse plant species are available, only a small number of them have been functionally characterized. Sequence identity-based screening for desired enzymes, often used in biotechnological applications, is difficult to apply here as STS sequence similarity is strongly affected by species. This calls for more sophisticated computational methods for functionality prediction. We investigate the specificity of precursor cation formation in these elusive enzymes. By inspecting multi-product STSs, we demonstrate that STSs have a strong selectivity towards one precursor cation. We use a machine learning approach combining sequence and structure information to accurately predict precursor cation specificity for STSs across all plant species. We combine this with a co-evolutionary analysis on the wealth of uncharacterized putative STS sequences, to pinpoint residues and distant functional contacts influencing cation formation and reaction pathway selection. These structural factors can be used to predict and engineer enzymes with specific functions, as we demonstrate by predicting and characterizing two novel STSs from Citrus bergamia.
- CSBJCaretta–A multiple protein structure alignment and feature extraction suiteMehmet Akdel
*, Janani Durairaj *, Dick de Ridder, and Aalt DJ van Dijk
In Computational and Structural Biotechnology Journal
The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta’s performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.
- BioinformaticsGeometricus represents protein structures as shape-mers derived from moment invariantsJanani Durairaj, Mehmet Akdel, Dick de Ridder, and Aalt DJ van Dijk
As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well. We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family.
Biophys.The santalene synthase from Cinnamomum camphora: Reconstruction of a sesquiterpene synthase from a monoterpene synthaseAlice Di Girolamo
*, Janani Durairaj *, Adèle van Houwelingen, Francel Verstappen, Dirk Bosch, Katarina Cankar, Harro Bouwmeester, Dick de Ridder, Aalt DJ van Dijk, and Jules Beekwilder
In Archives of Biochemistry and Biophysics
Plant terpene synthases (TPSs) can mediate formation of a large variety of terpenes, and their diversification contributes to the specific chemical profiles of different plant species and chemotypes. Plant genomes often encode a number of related terpene synthases, which can produce very different terpenes. The relationship between TPS sequence and resulting terpene product is not completely understood. In this work we describe two TPSs from the Camphor tree Cinnamomum camphora (L.) Presl. One of these, CiCaMS, acts as a monoterpene synthase (monoTPS), and mediates the production of myrcene, while the other, CiCaSSy, acts as a sesquiterpene synthase (sesquiTPS), and catalyses the production of α-santalene, β-santalene and trans-α-bergamotene. Interestingly, these enzymes share 97% DNA sequence identity and differ only in 22 amino acid residues out of 553. To understand which residues are essential for the catalysis of monoterpenes resp. sesquiterpenes, a number of hybrid synthases were prepared, and supplemented by a set of single-residue variants. These were tested for their ability to produce monoterpenes and sesquiterpenes by in vivo production of sesquiterpenes in E. coli, and by in vitro enzyme assays. This analysis pinpointed three residues in the sequence which could mediate the change in product specificity from a monoterpene synthase to a sesquiterpene synthase. Another set of three residues defined the sesquiterpene product profile, including the ratios between sesquiterpene products.
- PhytochemistryAn analysis of characterized plant sesquiterpene synthasesJanani Durairaj
*, Alice Di Girolamo *, Harro J Bouwmeester, Dick de Ridder, Jules Beekwilder, and Aalt DJ van Dijk
Plants exhibit a vast array of sesquiterpenes, C15 hydrocarbons which often function as herbivore-repellents or pollinator-attractants. These in turn are produced by a diverse range of sesquiterpene synthases. A comprehensive analysis of these enzymes in terms of product specificity has been hampered by the lack of a centralized resource of sufficient functionally annotated sequence data. To address this, we have gathered 262 plant sesquiterpene synthase sequences with experimentally characterized products. The annotated enzyme sequences allowed for an analysis of terpene synthase motifs, leading to the extension of one motif and recognition of a variant of another. In addition, putative terpene synthase sequences were obtained from various resources and compared with the annotated sesquiterpene synthases. This analysis indicated regions of terpene synthase sequence space which so far are unexplored experimentally. Finally, we present a case describing mutational studies on residues altering product specificity, for which we analyzed conservation in our database. This demonstrates an application of our database in choosing likely-functional residues for mutagenesis studies aimed at understanding or changing sesquiterpene synthase product specificity.
conference articles & others
- Springer BookFrom Genomes to Variant Interpretations Through Protein StructuresJanani Durairaj
*, Leila Tamara Alexander *, Gabriel Studer, Gerardo Tauriello, Ingrid Guarnetti Prandi, Rosalba Lepore, Giovanni Chillemi, and Torsten Schwede
The large amount of genetic, phenotypic, and structural data from diverse conditions and environments offers opportunities for new groundbreaking research. Today, the major scientific task is to interpret the vast number of genetic variants within these data. As described in this chapter, identifying relevant variants and connecting them with the associated protein structural and environmental information is a powerful approach to biological discoveries. The unified view of the data brings us a step closer to understanding genetic variation, which is also fundamental for achieving the goals of personalized medicine and the planet’s environment.
Computing FrontiersTunable and Portable Extreme-Scale Drug Discovery Platform at Exascale: the LIGATE ApproachGianluca Palermo, Gianmarco Accordi, Davide Gadioli, Emanuele Vitali, Cristina Silvano, Bruno Guindani, Danilo Ardagna, Andrea R. Beccari, Domenico Bonanni, Carmine Talarico, Filippo Lunghini, Jan Martinovic, Paulo Silva, Ada Bohm, Jakub Beranek, Jan Krenek, Branislav Jansik, Luigi Crisci, Biagio, Cosenza, Peter Thoman, Philip Salzmann, Thomas Fahringer, Leila Alexander, Gerardo Tauriello, Torsten Schwede, Janani Durairaj, Andrew Emerson, Federico Ficarelli, Sebastian Wingbermuhle, Eric Lindahl, Daniele Gregori, Emanuele Sana, Silvano Coletti, and Philip Gschwandtner
In 20th ACM International Conference on Computing Frontiers (CF’23)
Today digital revolution is having a dramatic impact on the pharmaceutical industry and the entire healthcare system. The implementation of machine learning, extreme-scale computer simulations, and big data analytics in the drug design and development process offers an excellent opportunity to lower the risk of investment and reduce the time to the patient. Within the LIGATE project, we aim to integrate, extend, and co-design best-in-class European components to design Computer-Aided Drug Design (CADD) solutions exploiting today’s high-end supercomputers and tomorrow’s Exascale resources, fostering European competitiveness in the field. The proposed LIGATE solution is a fully integrated workflow that enables to deliver the result of a virtual screening campaign for drug discovery with the highest speed along with the highest accuracy. The full automation of the solution and the possibility to run it on multiple supercomputing centers at once permit to run an extreme scale in silico drug discovery campaign in few days to respond promptly for example to a worldwide pandemic crisis.
- MLSB NeurIPSFast and adaptive protein structure representations for machine learningJanani Durairaj
*, Mehmet Akdel *, Dick de Ridder, and Aalt DJ van Dijk
In Machine Learning for Structural Biology Workshop, NeurIPS 2020
The growing prevalence and popularity of protein structure data, both experimental and computationally modelled, necessitates fast tools and algorithms to enable exploratory and interpretable structure-based machine learning. Alignment-free approaches have been developed for divergent proteins, but proteins sharing func-tional and structural similarity are often better understood via structural alignment, which has typically been too computationally expensive for larger datasets. Here, we introduce the concept of rotation-invariant shape-mers to multiple structure alignment, creating a structure aligner that scales well with the number of proteins and allows for aligning over a thousand structures in 20 minutes. We demonstrate how alignment-free shape-mer counts and aligned structural features, when used in machine learning tasks, can adapt to different levels of functional hierarchy in protein kinases, pinpointing residues and structural fragments that play a role in catalytic activity.