News
New Article in Nature Methods: Guiding questions to avoid data leakage in biological machine learning applications
Artificial intelligence (AI) has become indispensable in biological research and is driving major advances. However, in certain cases, real-world applications fail to confirm reported predictive performance. One of the main reasons for this is data leakage, i.e. the unauthorized transfer of information between training and test data.
In this Nature Methods Perspective, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. By applying these questions to real examples in biology, we aim to make researchers aware of the complex latent interdependencies and possibilities of data leakage in biological applications. We strongly encourage researchers to engage in an interdisciplinary dialogue and to consult domain experts from both domains to ensure robust, reliable, and reproducible ML research in biology.
TMLR Paper awarded with “Featured” certification
Our latest paper “Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement” has been published in Transactions on Machine Learning Research (TMLR). This is impressive work by Jonathan Pirnay and was awarded with a “Featured” Certification.
Keynote Talk @ ECML, Machine Learning for Chemistry and Chemical Engineering (ML4CCE)
Dominik has given a invited keynote talk at the European Conference on Machine Learning and Data Mining at the Machine Learning for Chemistry and Chemical Engineering (ML4CCE) Workshop about „Automated flowsheet synthesis with deep reinforcement learning“
Jasmin joins the team as Team Assistant
We welcome our new Team Assistant Jasmin in our team.
Ashima joins the Team as Research Assistant
Ashima joins the team as research assistant. She will work on novel machine learning methods for synthetic protein design and the in silico evaluation of generated artificial sequences.
New Paper: Forecasting seasonally fluctuating sales of perishable products in the horticultural industry
Our latest paper on horticultural demand forecasting has been published in Expert Systems with Applications: “Forecasting seasonally fluctuating sales of perishable products in the horticultural industry”. Accurately forecasting demand is a potential competitive advantage, especially in the context of perishable products such as in the horticultural industry, where the disposal of unsold items results in environmental and financial damage. Despite challenging operational decisions to avoid out-of-stock and overstock situations, horticultural businesses have received limited attention in forecasting research. In addition, horticultural sales are typically highly seasonal. Sudden changes in both directions, rising and falling, characterize horticultural sales cycles.In our study, we explore the research questions of the applicability of general versus dataset-specific predictors, the impact of external information, and online model update schemes. Using a diverse set of real-world horticultural data, we applied three classical and twelve machine learning-based forecasting approaches.💡 Key Findings:Multivariate machine learning models dominate: Our results show the superiority of multivariate machine learning methods over classical forecasting approaches, with the ensemble learner XGBoost emerging as a standout performer.External factors play a critical role: The inclusion of statistical, calendrical, and weather-related features in the feature set is critical for robust performance.Firm-specific predictors outperform general cross-firm models: We find that a generalized model, which would be advantageous in terms of computational resources, maintenance, and transferability to other datasets, falls short in capturing the heterogeneity of horticultural data, highlighting the need for firm-specific predictors.Impact of frequent model updates is negligible: Surprisingly, frequent model updates have a negligible impact on forecast quality, allowing long-term forecasting without significant performance degradation.
New Paper: Manually annotated and curated Dataset of diverse Weed Species in Maize and Sorghum for Computer Vision
New paper about an impressive manually annotated and curated dataset of diverse weed species in maize and sorghum for computer vision. Here we present a dataset, the Moving Fields Weed Dataset (MFWD), which captures the growth of 28 weed species commonly found in sorghum and maize fields in Germany. A total of 94,321 images were acquired in a fully automated, high-throughput phenotyping facility to track over 5,000 individual plants at high spatial and temporal resolution. A rich set of manually curated ground truth information is also provided, which can be used not only for plant species classification, object detection and instance segmentation tasks, but also for multiple object tracking.
New Paper: Superior Protein Thermophilicity Prediction With Protein Language Model Embeddings
New paper about a Protein Language model-based Thermophilicity predictor (ProLaTherm). ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics.
Florian successfully defended his PhD
The first Grimm lab member successfully defended his PhD. Congratulations Dr. Florian Haselbeck for this great achievment!
New Paper: Improved Weed Segmentation in UAV Imagery of Sorghum Fields with a Combined Deblurring Segmentation Model
New paper about a combined deblurring and segmentation model for weed and crop segmentation in motion blurred images. Our combined deblurring and segmentation model DeBlurWeedSeg is able to accurately segment weeds from sorghum and background, in both sharp as well as motion blurred drone captures. This has high practical implications, as lower error rates in weed and crop segmentation could lead to better weed control, e.g. when using robots for mechanical weed removal.