[edit]
Volume 2
This is the second volume.
-
- The Matrix Reloaded: Towards Counterfactual Group Fairness in Machine Learning
- Mariana Pinto, Andre V Carreiro, Pedro Madeira, Alberto Lopez, Hugo Gamboa, (1):1−55, 2024.
[PDF] [bib]Abstract
In today’s data-driven world, addressing bias is essential to minimize discriminatory outcomes and work toward fairness in machine learning models. This paper presents a novel data-centric framework for bias analysis, harnessing the power of counterfactual reasoning. We detail a process for generating plausible counterfactuals suited for group evaluation, using probabilistic distributions and optionally incorporating domain knowledge, as a more efficient alternative to computationally intensive generative models.Additionally, we introduce the Counterfactual Confusion Matrix, from which we derive a suite of metrics that provide a comprehensive view of a model’s behaviour under counterfactual conditions. These metrics offer unique insights into the model’s resilience and susceptibility to changes in sensitive attributes, such as sex or race. We demonstrate their utility and complementarity with standard group fairness metrics through experiments on real-world datasets. Our results show that domain knowledge is key, and that our metrics can reveal subtle biases that traditional bias evaluation strategies may overlook, providing a more nuanced understanding of potential model bias.
-
- Properties of Alternative Data for Fairer Credit Risk Predictions
- Jung Youn Lee, Joonhyuk Yang, (2):1−27, 2024.
[PDF] [bib]Abstract
In the consumer lending market, women tend to have lower access to credit than men, despite evidence suggesting that women are better at repaying their debts. This study explores the potential impact of leveraging alternative data, which traditionally has not been used by financial institutions, on credit risk predictions between men and women. By leveraging unique data on individuals’ credit card default behaviors and their purchase behaviors at a supermarket, we simulate a credit card issuer’s credit scoring process. In the absence of supermarket data, the algorithm’s predictive accuracy for women is about 2.3% lower than that for men. We then integrate data from each of the 410 product markets within the supermarket into the algorithm and measure the changes in the gender gap in predictive accuracy. We find a wide variation in both direction and magnitude in the incremental gender gap, ranging from -142% to 70% compared to the baseline. These findings highlight that leveraging alternative data from a non-financial domain can lead to fairer credit outcomes, but only under certain conditions. We characterize the conditions by identifying two data properties: the capacity to proxy gender and the relative amount of creditworthiness signals data provide for each gender.
-
- OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection
- Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Yixuan Li, Ziwei Liu, Yiran Chen, Hai Li, (3):1−32, 2024.
[PDF] [bib]Abstract
Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and scope. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate and standardized evaluation of OOD detection methodologies at large scale. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale data sets (ImageNet) and foundation models (e.g., CLIP and DINOv2), and expands its scope to investigate full-spectrum OOD detection which considers semantic and covariate distribution shifts at the same time. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research.
-
- Evaluating Durability: Benchmark Insights into Image and Text Watermarking
- Jielin Qiu, William Han, Xuandong Zhao, Shangbang Long, Christos Faloutsos, Lei Li, (4):1−44, 2024.
[PDF] [bib]Abstract
As large models become increasingly prevalent, watermarking has emerged as a crucial technology for copyright protection, authenticity verification, and content tracking. The rise of multimodal applications further amplifies the importance of effective watermarking techniques. While watermark robustness is critical for real-world deployment, the current understanding of watermark robustness against various forms of corruption remains limited. Our study evaluates watermark robustness in both image and text domains, testing against an extensive set of 100 image perturbations and 63 text perturbations. The results reveal significant vulnerabilities in contemporary watermarking approaches - detection accuracy deteriorates by more than 50% under common perturbations, highlighting a critical gap between current capabilities and practical requirements. These findings emphasize the urgent need for more robust watermarking methods that can withstand real-world disturbances. Our project website can be found at https://mmwatermark-robustness.github.io/.
-
- ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications
- Juan Pablo Zuluaga Gomez, Karel Veselý, Igor Szöke, Alexander Blatt, Petr Motlicek, Martin Kocour, Khalid Choukri, Iuliia Nigmatulina, Claudia Cevenini, Allan Tart, Jan Cernocký, Dietrich Klakow, (5):1−45, 2024.
[PDF] [bib]Abstract
Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC, large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). However, ATC is considered a low-resource domain. In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. In addition, we also open-source a GitHub repository that contains data preparation and training scripts useful to replicate our baselines related to ASR and NLU. The ATCO2 corpus covers 1) audio and radar data collection and pre-processing, 2) pseudo-transcriptions of speech audio, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets: (i) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold transcriptions for named-entity recognition (callsign, command, value) and speaker role detection. (ii) The ATCO2-test-set-1h corpus is a one-hour open-sourced subset from the 4h test set, free to download at https://www.atco2.org/data. (iii) The ATCO2-PL-set corpus consists of 5’281 hours of pseudo-transcribed ATC speech enriched with contextual information (list of relevant n-gram sequences per utterance), speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. The whole ATCO2 corpus is publicly distributed through ELDA catalog (https://catalog.elra.info/en-us/repository/browse/ELRA-S0484/). We expect the corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.
-
- Constructing Confidence Intervals for “the” Generalization Error – a Comprehensive Benchmark Study
- Hannah Schulz-Kümpel, Sebastian Felix Fischer, Roman Hornung, Anne-Laure Boulesteix, Thomas Nagler, Bernd Bischl, (6):1−73, 2025.
[PDF] [bib]Abstract
When assessing the quality of prediction models in machine learning, confidence intervals (CIs) for the generalization error, which measures predictive performance, are a crucial tool. Luckily, there exist many methods for computing such CIs and new promising approaches are continuously being proposed. Typically, these methods combine various resampling procedures, most popular among them cross-validation and bootstrapping, with different variance estimation techniques. Unfortunately, however, there is currently no consensus on when any of these combinations may be most reliably employed and how they generally compare. In this work, we conduct a large-scale study comparing CIs for the generalization error, the first one of such size, where we empirically evaluate 13 different CI methods on a total of 19 tabular regression and classification problems, using seven different inducers and a total of eight loss functions. We give an overview of the methodological foundations and inherent challenges of constructing CIs for the generalization error and provide a concise review of all 13 methods in a unified framework. Finally, the CI methods are evaluated in terms of their relative coverage frequency, width, and runtime. Based on these findings, we can identify a subset of methods that we would recommend. We also publish the datasets as a benchmarking suite on OpenML and our code on GitHub to serve as a basis for further studies.
-
- Towards impactful challenges: post-challenge paper, benchmarks and other dissemination actions
- David Rousseau, Antoine Marot, Zhen Xu, (7):1−20, 2025.
[PDF] [bib]Abstract
The conclusion of an AI challenge is not the end of its lifecycle; ensuring a long-lasting impact requires meticulous post-challenge activities. The long-lasting impact also needs to be organised. This chapter covers the various activities after the challenge is formally finished. This work identifies target audiences for post-challenge initiatives and outlines methods for collecting and organizing challenge outputs. The multiple outputs of the challenge are listed, along with the means to collect them. The central part of the chapter is a template for a typical post-challenge paper, including possible graphs and advice on how to turn the challenge into a long-lasting benchmark.
-
- SuperBench: A Super-Resolution Benchmark Dataset for Scientific Machine Learning
- Pu Ren, N. Benjamin Erichson, Junyi Guo, Shashank Subramanian, Omer San, Zarija Lukic, Michael W. Mahoney, (8):1−45, 2025.
[PDF] [bib]Abstract
Super-resolution (SR) techniques aim to enhance data resolution, enabling the retrieval of finer details, and improving the overall quality and fidelity of the data representation. There is growing interest in applying SR methods to complex spatiotemporal systems within the Scientific Machine Learning (SciML) community, with the hope of accelerating numerical simulations and/or improving forecasts in weather, climate, and related areas. However, the lack of standardized benchmark datasets for comparing and validating SR methods hinders progress and adoption in SciML. To address this, we introduce SuperBench (https://github.com/erichson/SuperBench), the first benchmark dataset featuring high-resolution datasets (up to 2048 x 2048 dimensions), including data from fluid flows, cosmology, and weather. Here, we focus on validating spatial SR performance from data-centric and physics-preserved perspectives, as well as assessing robustness to data degradation tasks. While deep learning-based SR methods (developed in the computer vision community) excel on certain tasks, despite relatively limited prior physics information, we identify limitations of these methods in accurately capturing intricate fine-scale features and preserving fundamental physical properties and constraints in scientific data. These shortcomings highlight the importance and subtlety of incorporating domain knowledge into ML models. We anticipate that SuperBench will help to advance SR methods for science.
-
- V-LoL: A Diagnostic Dataset for Visual Logical Learning
- Lukas Helff, Wolfgang Stammer, Hikaru Shindo, Devendra Singh Dhami, Kristian Kersting, (9):1−41, 2025.
[PDF] [bib]Abstract
Despite the successes of recent developments in visual AI, different shortcomings still exist; from missing exact logical reasoning, to abstract generalization abilities, to understanding complex and noisy scenes. Unfortunately, existing benchmarks, were not designed to capture more than a few of these aspects. Whereas deep learning datasets focus on visually complex data but simple visual reasoning tasks, inductive logic datasets involve complex logical learning tasks, however, lack the visual component. To address this, we propose the diagnostic visual logical learning dataset, V-LoL, that seamlessly combines visual and logical challenges. Notably, we introduce the first instantiation of V-LoL, V-LoL-Train, – a visual rendition of a classic benchmark in symbolic AI, the Michalski train problem. By incorporating intricate visual scenes and flexible logical reasoning tasks within a versatile framework, V-LoL-Train provides a platform for investigating a wide range of visual logical learning challenges. We evaluate a variety of AI systems including traditional symbolic AI, neural AI, as well as neuro-symbolic AI. Our evaluations demonstrate that even SOTA AI faces difficulties in dealing with visual logical learning challenges, highlighting unique advantages and limitations of each methodology. Overall, V-LoL opens up new avenues for understanding and enhancing current abilities in visual logical learning for AI systems.
-
- Challenge design roadmap
- Hugo Jair Escalante, Isabelle Guyon, Addison Howard, Walter Reade, Sebastien Treguer, (10):1−42, 2025.
[PDF] [bib]Abstract
This document serves as a comprehensive guide for designing and organizing effective challenges, particularly within the domains of machine learning and artificial intelligence. It provides detailed guidelines on every phase of the process, from conception and execution to post-challenge analysis. Challenges function as motivational mechanisms that drive participants to address significant tasks. Consequently, organizers must establish rules that fulfill objectives beyond mere participant engagement. These objectives include solving real-world problems, advancing scientific or technical fields, facilitating discoveries, educating the public, providing platforms for skill development, and recruiting new talent. The creation of a challenge is analogous to product development; it requires enthusiasm, rigorous testing, and aims to attract participants. The process commences with a comprehensive plan, such as a challenge proposal submitted for peer review at an international conference. This document presents guidelines for developing such a robust challenge plan, ensuring it is both engaging and impactful.
-
- Data Acquisition: A New Frontier in Data-centric AI
- Lingjiao Chen, Bilge Acun, Newsha Ardalani, Yifan Sun, Feiyang Kang, Hanrui Lyu, Yongchan Kwon, Ruoxi Jia, Carole-Jean Wu, Matei Zaharia, James Zou, (11):1−19, 2025.
[PDF] [bib]Abstract
As Machine Learning (ML) systems continue to grow, the demand for relevant and comprehensive datasets becomes imperative. There is limited study on the challenges of data acquisition due to ad-hoc processes and lack of consistent methodologies. We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers in a data marketplace. The benchmark was released as a part of DataPerf Mazumder et al. (2022). Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in ML.
-
- Deep Learning for Accurate Diagnosis of Viral Infections through scRNA-seq Analysis: A Comprehensive Benchmark Study
- Ziwei Yang, Xuxi Chen, Biqing Zhu, Tianlong Chen, Zhangyang Wang, (12):1−19, 2025.
[PDF] [bib]Abstract
Infectious disease diagnostics primarily rely on physicians’ clinical expertise and rapid antigen/antibody tests, a subjective approach prone to errors due to various factors including patient history accuracy and physician experience. To address these challenges, we propose a biological evidence-based diagnostic tool using deep learning to analyze patient-derived single-cell RNA sequencing (scRNA-seq) profiles from blood samples. scRNA-seq provides high-resolution gene expression data at the single-cell level, capturing unique transcriptional signatures and immunological responses induced by different viral infections. In this work, we conducted the first-of-its-kind benchmark study to evaluate five computational models, including four deep learning-based methods (contrastiveVI, scVI, SAVER, scGPT) and PCA as a baseline - trained and evaluated on patient-derived scRNA-seq datasets carefully sourced by us. We assess their efficacy in distinguishing scRNA-seq profiles associated with various viral infections, aiming to identify distinct immunological features representative of each infection. The results demonstrate that contrastiveVI outperforms other models in all key performance metrics and the visual cluster performance. Furthermore, our research also underscores the substantial influence of batch effects when analyzing scRNA-seq data from multiple sources. Overall, our study successfully demonstrates that deep learning models can accurately identify the type of infection from patient plasma samples based on scRNA-seq profiles, and improve the accuracy and specificity in the diagnosis of infectious diseases. This research contributes to the development of more objective, evidence-based diagnostic methods in the infectious disease domain, potentially reducing diagnostic errors and improving patient outcomes.
-
- Text Quality-Based Pruning for Efficient Training of Language Models
- Vasu Sharma, Karthik Padthe, Newsha Ardalani, Kushal Tirumala, Russell Howes, Hu Xu, Po-Yao Huang, Daniel Li Chen, Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer, (13):1−13, 2025.
[PDF] [bib]Abstract
In recent times training Language Models (LMs) have relied on computationally heavy training over massive datasets which makes this training process extremely laborious. In this paper we propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets in a model agnostic manner to assign the text instances a quality score. By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances, leading to improved training efficiency for LM models. Experimental results over multiple models and datasets demonstrate the efficacy of this approach, showcasing substantial gains in training effectiveness and highlighting the potential for resource-efficient LM training. For example, we observe an absolute accuracy improvement of 0.9% averaged over 14 downstream evaluation tasks for multiple LM models while using 40% lesser data and training 42% faster when training on the OpenWebText dataset and 0.8% average absolute accuracy improvement while using 20% lesser data and training 21% faster on the Wikipedia dataset.
-
- FlowBench: A Large Scale Benchmark for Flow Simulation over Complex Geometries
- Ronak Tali, Ali Rabeh, Cheng-Hau Yang, Mehdi Shadkhah, Samundra Karki, Abhisek Upadhyaya, Suriya Dhakshinamoorthy, Marjan Saadati, Soumik Sarkar, Adarsh Krishnamurthy, Chinmay Hegde, Aditya Balu, Baskar Ganapathysubramanian, (14):1−35, 2025.
[PDF] [bib]Abstract
Simulating fluid flow around arbitrary shapes is key to solving various engineering problems. However, simulating flow physics across complex geometries remains numerically challenging and computationally resource-intensive, particularly when using conventional PDE solvers. Machine learning methods offer attractive opportunities to create fast and adaptable PDE solvers. However, benchmark datasets to measure the performance of such methods are scarce, especially for flow physics across complex geometries. We introduce FlowBench, a dataset for neural simulators with over 10K samples, which is currently larger than any publicly available flow physics dataset. FlowBench contains flow simulation data across complex geometries (parametric vs. non-parametric), spanning a range of flow conditions (Reynolds number and Grashoff number), capturing a diverse array of flow phenomena (steady vs. transient; forced vs. free convection), and for both 2D and 3D. FlowBench contains over 10K data samples, with each sample the outcome of a fully resolved, direct numerical simulation using a well-validated simulator framework designed for modeling transport phenomena in complex geometries. For each sample, we include velocity, pressure, and temperature field data at 3 different resolutions and several summary statistics features of engineering relevance (such as coefficients of lift and drag, and Nusselt numbers). We envision that FlowBench will enable evaluating the interplay between complex geometry, coupled flow phenomena, and data sufficiency on the performance of current, and future, neural PDE solvers. We enumerate several evaluation metrics to help rank order the performance of current (and future) neural PDE solvers. We benchmark the performance of several methods, including Fourier Neural Operators (FNO), Convolutional Neural Operators (CNO), DeepONets, and recent foundational models. This dataset (https://huggingface.co/datasets/BGLab/FlowBench/tree/main) will be a valuable resource for developing and evaluating AI-for-science approaches, specifically neural PDE solvers, that model complex fluid dynamics around 2D and 3D objects.
-
- MONSTER: Monash Scalable Time Series Evaluation Repository
- Angus Dempster, Navid Mohammadi Foumani, Chang Wei Tan, Lynn Miller, Amish Mishra, Mahsa Salehi, Charlotte Pelletier, Daniel F. Schmidt, Geoffrey I. Webb, (15):1−47, 2025.
[PDF] [bib]Abstract
We introduce MONSTER—the MONash Scalable Time Series Evaluation Repository—a collection of large datasets for time series classification and associated set of classification tasks that jointly define a new time series classification benchmark. The field of time series classification has benefitted from common benchmarks set by the UCR and UEA time series classification repositories. However, the datasets in these benchmarks are small, with median training set sizes of 217 and 255 examples, respectively. In consequence they favour a narrow subspace of models that are optimised to achieve low classification error on a wide variety of smaller datasets, that is, models that minimise variance, and give little weight to computational issues such as scalability. Our hope is to diversify the field by introducing benchmarks using larger datasets. We believe that there is enormous potential for new progress in the field by engaging with the theoretical and practical challenges of learning effectively from larger quantities of data.
-
- Chronicling Germany: An Annotated Historical Newspaper Dataset
- Christian Schultze, Niklas Kerkfeld, Kara Kuebart, Princilia Weber, Moritz Wolter, Felix Selgert, (16):1−29, 2025.
[PDF] [bib]Abstract
The correct detection of dense article layout and the recognition of characters in historical newspaper pages remains a challenging requirement for Natural Language Processing (NLP) and machine learning applications in the field of digital history. Digital newspaper portals for historic Germany typically provide Optical Character Recognition (OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source’s scope. Our dataset is designed to enable the training of layout and OCR models for historic German-language newspapers. The Chronicling Germany dataset contains 801 annotated historical newspaper pages from the time period between 1617 and 1933. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision.
-
- The FIX Benchmark: Extracting Features Interpretable to eXperts
- Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong, (17):1−43, 2025.
[PDF] [bib]Abstract
Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.
-
- Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs
- Jost Arndt, Utku Isil, Michael Detzel, Wojciech Samek, Jackie Ma, (18):1−36, 2025.
[PDF] [bib]Abstract
Many physical processes can be expressed through partial differential equations (PDEs). Real-world measurements of such processes are often collected at irregularly distributed points in space, which can be effectively represented as graphs; however, there are currently only a few existing datasets. Our work aims to make advancements in the field of PDE-modeling accessible to the temporal graph machine learning community, while addressing the data scarcity problem, by creating and utilizing datasets based on PDEs. In this work, we create and use synthetic datasets based on PDEs to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset. Additionally, we show how pre-training on this dataset can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. The source code for our methodology and the three created datasets can be found on github.com/Jostarndt/Synthetic_Datasets_for_Temporal_Graphs.
-
- MolTextQA: A Question-Answering Dataset and Benchmark for Evaluating Multimodal Architectures and LLMs on Molecular Structure-Text Understanding
- Siddhartha Laghuvarapu, Namkyeong Lee, Chufan Gao, Jimeng Sun, (19):1−37, 2025.
[PDF] [bib]Abstract
Recent advancements in AI have greatly improved molecular representation learning for property prediction and molecule design. However, leveraging the vast textual molecular data from databases and literature remains challenging. While recent research has explored Large Language Models (LLMs) and multi-modal architectures to link text with molecular structures, existing datasets lack evaluation specificity and comprehensive benchmarking. To address this, we introduce a dataset of 500,000 question-answer pairs covering 240,000 molecules from PubChem, designed for structure-directed questions and text-based molecule retrieval. Moreover, we benchmark various architectural classes fine-tuned using this dataset, including multi-modal architectures, large language models and large reasoning models uncovering several insights. Among the non-LLM baselines, BioT5 and MoleculeSTM achieved the highest performance on the Molecule QA and Molecule Retrieval tasks, respectively, with accuracies approaching 70%. While traditional LLMs struggled with general molecular understanding, our experiments show that fine-tuning LLMs can significantly improve their performance on molecular tasks. Furthermore, large reasoning models, particularly the GPT-o3 series outperform their non-reasoning counterparts and multi-modal architectures, highlighting the importance of explicit reasoning for effective structure-text learning. We have made both the dataset and the fine-tuned models publicly available.
-
- TopoBench: A Framework for Benchmarking Topological Deep Learning
- Lev Telyatnikov, Guillermo Bernardez, Marco Montagna, Mustafa Hajij, Martin Carrasco, Pavlo Vasylenko, Mathilde Papillon, Ghada Zamzmi, Michael T Schaub, Jonas Verhellen, Pavel Snopov, Bertran Miquel-Oliver, Manel Gil-Sorribes, Alexis Molina, VICTOR GUALLAR, Theodore Long, Julian Suk, Patryk Rygiel, Alexander V Nikitin, Giordan Escalona, Michael Banf, Dominik Filipiak, Liliya Imasheva, Max Schattauer, Alvaro L. Martinez, Halley Fritze, Marissa Masden, Valentina Sánchez, Manuel Lecha, Andrea Cavallo, Claudio Battiloro, Matthew Piekenbrock, Mauricio Tec, George Dasoulas, Nina Miolane, Simone Scardapane, Theodore Papamarkou, (20):1−39, 2025.
[PDF] [bib]Abstract
This work introduces TopoBench, an open-source library designed to standardize benchmarking and accelerate research in topological deep learning (TDL). TopoBench decomposes TDL into a sequence of independent modules for data generation, loading, transforming and processing, as well as model training, optimization and evaluation. This modular organization provides flexibility for modifications and facilitates the adaptation and optimization of various TDL pipelines. A key feature of TopoBench is its support for transformations and lifting across topological domains. Mapping the topology and features of a graph to higher-order topological domains, such as simplicial and cell complexes, enables richer data representations and more fine-grained analyses. The applicability of TopoBench is demonstrated by benchmarking several TDL architectures across diverse tasks and datasets.
-
- MPFBench: A Large Scale Dataset for SciML of Multi-Phase-Flows: Droplet and Bubble Dynamics
- Mehdi Shadkhah, Ronak Tali, Ali Rabeh, Cheng-Hau Yang, Ethan Herron, Abhisek Upadhyaya, Adarsh Krishnamurthy, Chinmay Hegde, Aditya Balu, Baskar Ganapathysubramanian, (21):1−35, 2025.
[PDF] [bib]Abstract
Multiphase fluid dynamics, such as falling droplets and rising bubbles, is critical for many industrial applications. However, simulating these phenomena efficiently is challenging due to the complexity of instabilities, wave patterns, and bubble breakup. This paper investigates the potential of scientific machine learning (SciML) to model these dynamics using neural operators and foundation models. We apply sequence-to-sequence learning techniques to a comprehensive dataset of 11,000 simulations, which includes over 1 million time snapshots, generated using a well-validated, CUDA-accelerated Lattice Boltzmann Method (LBM) framework. The results demonstrate the ability of machine learning models to capture transient dynamics and intricate fluid interactions, paving the way for more accurate and computationally efficient SciML-based solvers for multiphase applications.
-
- DecordFace: A Framework for Degraded and Corrupted Face Recognition
- Surbhi Mittal, Rishi Dey Chowdhury, Mayank Vatsa, Richa Singh, (22):1−43, 2025.
[PDF] [bib]Abstract
Face recognition (FR) models have become an integral part of day-to-day activities involving surveillance and biometric verification. While these models perform remarkably well in constrained settings, the performance is limited in the presence of certain challenging covariates. One such covariate is the presence of unforeseen image degradations and corruptions. These degradations, which inevitably occur during image acquisition, transmission, or storage, substantially impact real-world applicability. In order to analyze the performance of FR systems in these scenarios, we provide the first-ever Degraded and Corrupted Face Recognition (DecordFace) framework to evaluate the robustness of FR models. Corrupted versions of multiple standard datasets are created, and experiments are performed using more than 3.6 million corrupted face images with over 25 recognition models with different architectures and backbones, using 16 corruptions at 5 severity levels. For quantitative estimation of the impact of corruption, we introduce two novel evaluation metrics, error-based mVCE and embedding-based mCEI. Using these metrics and a cohort of FR models, we conduct a detailed analysis of model robustness under different model and input parameters. We observe a severe drop in the performance of models for unconstrained face recognition with performance errors over 20% across different corruptions. The performance of model variants with shallow backbones is observed to suffer even more. The code for the DecordFace framework can be accessed at https://github.com/IAB-IITJ/DecordFace.
-
- Towards Causal Relationship in indefinite data: New Datasets and Baseline Model
- Hang Chen, Xinyu Yang, Keqing Du, (23):1−40, 2025.
[PDF] [bib]Abstract
The cross-fertilization of deep learning and causal discovery has given birth to broader causal data forms, involving multi-structured data like the Netsim dataset, and complex variables such as those in the RECCON dataset. Interestingly, we observe an absence of research that concurrently addresses data with multi-structures and complex variables, named ‘indefinite data.’ In our previous survey, we introduced the concept of this data paradigm, yet exploring indefinite data still faces two substantial gaps: the dataset gap and the model gap. In this paper, we release two high-quality datasets - Causalogue and Causaction for dataset gap, containing text dialogue samples and video action samples with causal annotations respectively. Moreover, the model gap arises from the coexistence of multi-structure data and complex variables, breaking the assumptions of all current methods, and rendering them infeasible on indefinite data. To this end, we propose a probabilistic framework as a baseline. It enables overcoming challenges brought by indefinite data, and paves the way for the extension of latent confounders. Comprehensive experiments have evaluated baseline results of causal structures, causal representations, and confounding disentanglement. Our codes and datasets are available at https://github.com/Zodiark-ch/master-of-paper-Towards-Causal-Relationship-in-Indefinite-Data-Baseline-Model-and-New-Datasets.
-
- SEESAW: Do Graph Neural Networks Improve Node Representation Learning for All?
- Yushun Dong, William Shiao, Yozen Liu, Jundong Li, Neil Shah, Tong Zhao, (24):1−42, 2025.
[PDF] [bib]Abstract
Graph Neural Networks (GNNs) have garnered increasing attention in recent years, given their significant proficiency in various graph learning tasks. Consequently, there has been a notable transition away from the conventional and prevalent shallow graph embedding methods which pre-dated GNNs. However, in tandem with this transition which is pre-supposed in the literature, an imperative question arises: do GNNs always outperform shallow embedding methods in node representation learning? This question remains inadequately explored, as the field of graph machine learning still lacks a systematic understanding of their relative strengths and limitations. To address this gap, we propose a principled framework that unifies the ideologies of representative shallow graph embedding methods and GNNs. With comparative analysis, we show that GNNs actually bear drawbacks that are typically not shared by shallow embedding methods. These drawbacks are often masked by data characteristics in commonly used benchmarks and thus not well-discussed in the literature, leading to potential suboptimal performance when GNNs are indiscriminately adopted in applications. We further show that our analysis can be generalized to GNNs under various learning paradigms, which provides further insights to emphasize the research significance of shallow embedding methods. Finally, with these insights, we conclude with a guide to meet various needs of researchers and practitioners.
-
- A Model Zoo on Phase Transitions in Neural Networks
- Konstantin Schürholt, Léo Meynent, Yefan Zhou, Haiquan Lu, Yaoqing Yang, Damian Borth, (25):1−34, 2025.
[PDF] [bib]Abstract
Using the weights of trained Neural Network (NN) models as data modality has recently gained traction as a research field - dubbed Weight Space Learning (WSL). Multiple recent works propose WSL methods to analyze models, evaluate methods, or synthesize weights. Weight space learning methods require populations of trained models as datasets for development and evaluation. However, existing collections of models – called model zoos – are unstructured or follow a rudimentary definition of diversity. In parallel, work rooted in statistical physics has identified phases and phase transitions in NN models. Models are homogeneous within the same phase but qualitatively differ from one phase to another. We combine the idea of model zoos with phase information to create a controlled notion of diversity in populations. We introduce 12 large-scale zoos that systematically cover known phases and vary over model architecture, size, and datasets. These datasets cover different modalities, such as computer vision, natural language processing, and scientific ML. For every model, we compute loss landscape metrics and validate full coverage of the phases. With this dataset, we provide the community with a resource with a wide range of potential applications for WSL and beyond. Evidence suggests the loss landscape phase plays a role in applications such as model training, analysis, or sparsification. We demonstrate this in an exploratory study of the downstream methods like transfer learning or model weights averaging.
-
- MM-GEN: Principled and Generalizable Data Curation for Enhancing Task Performance in VLMs
- Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman, (26):1−28, 2025.
[PDF] [bib]Abstract
Vision-language models (VLMs) often struggle on specialized tasks requiring fine-grained image understanding due to inadequate task-specific text annotations in the training data. We introduce MM-Gen, a framework for data curation that improves VLM performance on such tasks guided by four principles: coverage of task subgroups, diversity of examples, quality of annotations, and informational value. Given reference samples from the target task, keywords enumerating task subgroups, and a pool of candidate images, MM-Gen implements a multi-stage process: (1) partitioning data by subgroup to ensure coverage, (2) generating diverse annotations via in-context learning for each subgroup using corresponding reference samples, and (3) applying perplexity-based filtering to ensure high quality annotations while prioritizing examples that provide novel information to the model. When fine-tuning Llava-1.5 (7B) with our generated data, we achieve absolute improvements of 15%, 14%, and 29% on chart understanding, diagram interpretation, and spatial reasoning tasks, respectively. Moreover, our filtering approach enables discarding 50% of the data without performance loss. Our results confirm that task-specific text curation is indeed the critical bottleneck in VLM performance, and MM-Gen provides a principled and generalizable solution that can be applied to any image-understanding task with minimal human intervention. Code available at https://github.com/sjoshi804/MM-Gen.
-
- The life cycle of challenges and benchmarks
- Gustavo Stolovitzky, Julio Saez-Rodriguez, Julie Bletz, Jake Albrecht, Gaia Andreoletti, James C Costello, PAUL C BOUTROS, (27):1−16, 2025.
[PDF] [bib]Abstract
Data Science research is undergoing a revolution fueled by the transformative power of technology, the Internet, and an ever-increasing computational capacity. The rate at which sophisticated algorithms can be developed is unprecedented, yet they remain outpaced by the massive amounts of data that are increasingly available to researchers. Here we argue for the need to creatively leverage the scientific research and algorithm development community as an axis of robust innovation. Engaging these communities in the scientific discovery enterprise by critical assessments, community experiments, and/or crowdsourcing will multiply opportunities to develop new data-driven, reproducible and well-benchmarked algorithmic solutions to fundamental and applied problems of current interest. Coordinated community engagement in the analysis of highly complex and massive data has emerged as one approach to find robust methodologies that best address these challenges. When community engagement is done in the form of challenges, by which we mean a skill-based scientific contest, with a limited time duration, ending by a total ranking of participants according to a pre-defined scoring metric, and the selection of winners, the validation of the analytical methodology is inherently addressed, establishing performance benchmarks. Finally, challenges foster open innovation across multiple disciplines to create communities that collaborate directly or indirectly to address significant scientific gaps. Together, participants can solve important problems as varied as health research, climate change, and social equity. Ultimately, challenges can catalyze and accelerate the synthesis of complex data into knowledge or actionable information, and should be viewed as a powerful tool to make lasting social and research contributions.
-
- Towards Human-Guided, Data-Centric LLM Co-Pilots
- Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar, (28):1−74, 2025.
[PDF] [bib]Abstract
Machine learning (ML) has the potential to revolutionize various domains and industries, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this, we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC’s ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains – healthcare, finance, social sciences and more – to actively participate in driving real-world impact using ML. CliMB-DC is open-sourced at: https://github.com/vanderschaarlab/climb/tree/climb-dc-canonical