Dominik Grimm, Autor bei BIT - TUM Campus Straubing

New Article in Nature Methods: Guiding questions to avoid data leakage in biological machine learning applications

Artificial intelligence (AI) has become indispensable in biological research and is driving major advances. However, in certain cases, real-world applications fail to confirm reported predictive performance. One of the main reasons for this is data leakage, i.e. the unauthorized transfer of information between training and test data.

In this Nature Methods Perspective, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. By applying these questions to real examples in biology, we aim to make researchers aware of the complex latent interdependencies and possibilities of data leakage in biological applications. We strongly encourage researchers to engage in an interdisciplinary dialogue and to consult domain experts from both domains to ensure robust, reliable, and reproducible ML research in biology.

TMLR Paper awarded with “Featured” certification

Our latest paper “Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement” has been published in Transactions on Machine Learning Research (TMLR). This is impressive work by Jonathan Pirnay and was awarded with a “Featured” Certification.

Keynote Talk @ ECML, Machine Learning for Chemistry and Chemical Engineering (ML4CCE)

Dominik has given a invited keynote talk at the European Conference on Machine Learning and Data Mining at the Machine Learning for Chemistry and Chemical Engineering (ML4CCE) Workshop about „Automated flowsheet synthesis with deep reinforcement learning“

Jasmin joins the team as Team Assistant

We welcome our new Team Assistant Jasmin in our team.

Ashima joins the Team as Research Assistant

Ashima joins the team as research assistant. She will work on novel machine learning methods for synthetic protein design and the in silico evaluation of generated artificial sequences.