Select Page
New Article in Nature Methods: Guiding questions to avoid data leakage in biological machine learning applications

New Article in Nature Methods: Guiding questions to avoid data leakage in biological machine learning applications

Artificial intelligence (AI) has become indispensable in biological research and is driving major advances. However, in certain cases, real-world applications fail to confirm reported predictive performance. One of the main reasons for this is data leakage, i.e. the unauthorized transfer of information between training and test data.

In this Nature Methods Perspective, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. By applying these questions to real examples in biology, we aim to make researchers aware of the complex latent interdependencies and possibilities of data leakage in biological applications. We strongly encourage researchers to engage in an interdisciplinary dialogue and to consult domain experts from both domains to ensure robust, reliable, and reproducible ML research in biology.

TMLR Paper awarded with “Featured” certification

TMLR Paper awarded with “Featured” certification

Our latest paper “Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement” has been published in Transactions on Machine Learning Research (TMLR). This is impressive work by Jonathan Pirnay and was awarded with a “Featured” Certification.

Ashima joins the Team as Research Assistant

Ashima joins the Team as Research Assistant

Ashima joins the team as research assistant. She will work on novel machine learning methods for synthetic protein design and the in silico evaluation of generated artificial sequences.

New Paper: Forecasting seasonally fluctuating sales of perishable products in the horticultural industry

New Paper: Forecasting seasonally fluctuating sales of perishable products in the horticultural industry

Our latest paper on horticultural demand forecasting has been published in Expert Systems with Applications: “Forecasting seasonally fluctuating sales of perishable products in the horticultural industry”. Accurately forecasting demand is a potential competitive advantage, especially in the context of perishable products such as in the horticultural industry, where the disposal of unsold items results in environmental and financial damage. Despite challenging operational decisions to avoid out-of-stock and overstock situations, horticultural businesses have received limited attention in forecasting research. In addition, horticultural sales are typically highly seasonal. Sudden changes in both directions, rising and falling, characterize horticultural sales cycles.In our study, we explore the research questions of the applicability of general versus dataset-specific predictors, the impact of external information, and online model update schemes. Using a diverse set of real-world horticultural data, we applied three classical and twelve machine learning-based forecasting approaches.💡 Key Findings:Multivariate machine learning models dominate: Our results show the superiority of multivariate machine learning methods over classical forecasting approaches, with the ensemble learner XGBoost emerging as a standout performer.External factors play a critical role: The inclusion of statistical, calendrical, and weather-related features in the feature set is critical for robust performance.Firm-specific predictors outperform general cross-firm models: We find that a generalized model, which would be advantageous in terms of computational resources, maintenance, and transferability to other datasets, falls short in capturing the heterogeneity of horticultural data, highlighting the need for firm-specific predictors.Impact of frequent model updates is negligible: Surprisingly, frequent model updates have a negligible impact on forecast quality, allowing long-term forecasting without significant performance degradation.