Select Page

We are happy to share our recent publication:

Guiding questions to avoid data leakage in biological machine learning applications

In biological research, the use of artificial intelligence has opened up countless possibilities and opportunities and has become indispensable for understanding complex biological systems. By applying machine learning (ML) methods to biomolecular data, researchers can identify patterns and relationships in DNA, RNA and protein sequences, for example. This has led to significant advances in many areas of biological research, such as in predicting 3D protein structures.

In practical applications, however, researchers repeatedly encounter the problem that the reported results of ML-based predictors are often too optimistic and cannot be reproduced with independent data. One main reason for this is so-called ‘data leakage’ – i.e. the unauthorised transfer of information between training and test data. This leads to overly optimistic performance estimates that cannot be validated in practice.
A team of researchers from the Technical University of Munich (TUM), the Friedrich-Alexander-University Erlangen-Nuremberg (FAU), the University of Applied Sciences Weihenstephan-Triesdorf (HSWT), the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and the University of Saarland (UdS) has therefore addressed the question of how these pitfalls can be avoided when applying ML-based approaches, which can quickly lead to data leakage and thus to over-optimistic results, especially in biological applications.

‘Especially in biological and medical applications, data leakage can lead to unrealistic assessments of the performance of ML approaches,’ says Prof. Olga Kalinina from the HIPS/UdS, ’and can potentially even endanger patient safety.’

With this in mind, the researchers present seven questions to help avoid data leakage when constructing machine learning models in biology. By applying these questions to specific examples, the researchers demonstrate their usefulness and provide a guide to robust and reproducible research in machine learning in biology. ‘Our goal is to raise awareness of potential issues with data leakage and to contribute to the development of reliable machine learning models. We hope that our questions will help researchers to identify complex and hidden dependencies in biological data and thus avoid data leakage,’ says Prof. Grimm, head of the Professorship of Bioinformatics at the TUM Campus Straubing and the HSWT.

‘Nowadays, it has become easier to ensure a valid ML workflow thanks to popular software and programming frameworks. In practice, however, their ease of use increases the risk of scientifically incorrect applications and false results,’ notes Prof. David Blumenthal from the Department of Artificial Intelligence in Biomedical Engineering at FAU.

Conversely, the complexity of biological data can lead to data leakage if it is overlooked by data scientists without sufficient qualifications in the respective application domain. For these reasons, we strongly recommend interdisciplinary collaboration between experts from both fields,’ says Prof. Markus List, Professor of Data Science in Systems Biology at TUM in Freising.

Summarising, Prof. Haselbeck, Professor of Smart Farming at HSWT, explains: ‘I would particularly like to highlight the excellent inter-institutional cooperation. We hope that our work will improve the quality and reliability of future machine learning models for biological applications.’

Original Publication (Open Access)
Bernett, J., Blumenthal, D.B., Grimm, D.G., Haselbeck, F. et al. Guiding questions to avoid data leakage in biological machine learning applications. Nat Methods 21, 1444–1453 (2024). https://doi.org/10.1038/s41592-024-02362-y