Why did you see the need for the guideline you published with researchers from FAU Erlangen, the Helmholtz Institute for Pharmaceutical Research Saarland, and Saarland University?
Dominik Grimm: There is a lot of activity in this area, which is good because many questions can no longer be answered with purely human analytical capabilities. At the same time, there is a discrepancy between the results obtained in studies and those obtained in real-world applications. Results are often not reproducible. This poses a significant risk, for example, when these models are used in clinical diagnostics.
Markus List: Many publications present models with very high predictive accuracy. This creates a false sense of security, as the model initially appears to reliably solve the required task. However, it is often impossible to understand how the model arrived at its predictions. Machine learning problems and hidden data dependencies can lead to unrealistically high accuracy. The latter can only be identified with expertise in both machine learning and the life sciences. Therefore, we advocate for more collaboration between the different disciplines to combine their competencies. This way, they can identify problems that caused by hidden dependencies.
What do you mean by hidden dependencies?
List: Often, data from a single study is used to develop models. It is rarely tested whether models also work in practice with data collected in a different location or with other measuring devices. For example, imagine researchers creating a dataset describing the microbiome of 500 people from Munich. We share this data and use 400 samples as training data for the model. We initially hold back 100 samples to measure how well the model applies to unseen data—these are our test data. The model then learns to recognize patterns present at the molecular level in patients living in Munich. It works very well with the 100 held-back samples—the test data. However, when applied to people in Hamburg, the results suddenly differ. One cause could be hidden dependencies, such as people living in Munich having a different microbiome than the population of Hamburg.
A problem also arises when the model is trained with information that is unavailable later. For example, if you want the model to predict whether someone will develop high blood pressure, you use clinical data from people with high blood pressure as training data. The model then looks for indicators of high blood pressure and finds that patients take antihypertensive drugs. However, if you use it for a person with undiagnosed high blood pressure, you will not see this feature in the clinical data because the person is not yet taking medication.
So parts of the training data end up in the test data, but they shouldn’t be there?
Grimm: Yes, that’s correct. We call this data leakage, which can be described as the illicit spillover of information from the training data to the test data. There are hidden correlations between irrelevant or misleading measurements in the actual application. Our guidelines aim to raise awareness of this problem and, more importantly, to improve the understanding of data and applications. This way, hidden dependencies can be identified early, and data leakage can be avoided when developing and training new models.
List: Ultimately, it’s a matter of carefully considering the application for which the models are being developed. When training, you must ensure that you have the appropriate data for the specific application. However, independent data is often not available for testing. To successfully train robust models, they must be designed to avoid taking shortcuts or incorporating biases.
Can you briefly explain what you mean by that?
List: Oft wird auf Daten trainiert, die bestimmte Aspekte einseitig darstellen. Beim vorherigen Beispiel des Mikrobioms war dies die geografische Komponente, die nicht ausreichend berücksichtig wurde. In der Praxis begegnet uns häufig als Problem, dass gut erforschte Krankheiten gegenüber solchen, für die wenige gesicherte Erkenntnisse vorliegen, in Datenbanken überrepräsentiert sind. Solche Verzerrungen führen dann zu mitunter falschen Vorhersagen der Modelle.
And what happens if these problems are not addressed?
Grimm: Data collected over decades of research is stored in databases and can be used for subsequent research projects. If errors creep in, they perpetuate themselves in subsequent studies. Ultimately, this could affect medical treatment and, in the worst case, even jeopardize patient safety.
List: This problem is exacerbated as we collect more data and the methods become more complex. With simple models, it is still possible to understand how a result comes about. With highly complex neural networks, this eventually becomes impossible. We must break open the black box, critically examine possible biases, and test models for practical applicability. Many researchers are also developing new methods that allow us to look into the black box and understand decision-making processes.
Grimm: Researchers need to understand the complexity of the data and dependencies, and what they are feeding the algorithms. They also need to be clear about the questions they want the models to answer. Used wisely, models can help us narrow down search spaces and find clues to solutions. It is now essential to steer the work with the models in the right direction to achieve this.