Volume 3

Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark

Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen, (1):1−48, 2026.

Abstract

Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models’ ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. The Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our data is available at https://huggingface.co/datasets/MBZUAI-LLM/Mobile-MMLU.

[PDF] [bib]

Automated Data Preparation for Machine Learning: A Survey

Sasa Mladenovic, Marius Lindauer, Carola Doerr, (2):1−72, 2026.

Abstract

Data preparation is essential for effective machine learning (ML), yet typically remains a manual, time-consuming process. While automated machine learning (AutoML) has successfully addressed modeling aspects of the ML workflow, data preparation has largely been overlooked, leading to challenges with real-world, imperfect data. Conversely, a rising paradigm in the world of artificial intelligence (AI) and ML is that of data-centric AI, shifting focus from just refining models, to enhancing data in order to advance performance boundaries. This survey motivates the need for automated solutions regarding data preparation, offering a fundamental understanding of the benefits of data transformations and establishing the complexity of data pipeline optimization, while highlighting the importance of data quality. We provide a comprehensive overview and categorization of existing automation approaches, both in AutoML and as standalone fully or semi-automated systems. We discuss underlying methodologies, their advantages, and limitations. Our work explores the prospects of expanding automation to cover a broader data preparation process, aiming to bridge the gap between data-centric AI and AutoML. It paves the way to a wholly automated pipeline from raw real-world data to quality model predictions, and outlines future research directions towards that goal.

[PDF] [bib]

ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning

Jannis Becktepe, Julian Dierkes, Carolin Benjamins, Aditya Mohan, David Salinas, Raghu Rajan, Frank Hutter, Holger Hoos, Marius Lindauer, Theresa Eimer, (3):1−41, 2026.

Abstract

Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.

[PDF] [bib]

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

Justus Westerhoff, Erblina Purelku, Jakob Hackstein, Jonas Loos, Leo Pinetzki, Erik Rodner, Lorenz Hufe, (4):1−30, 2026.

Abstract

Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. Existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings indicate that typographic attacks remain effective against state-of-the-art Large Vision-Language Models, especially those employing vision encoders inherently vulnerable to such attacks. However, employing larger Large Language Model backbones reduces this vulnerability while simultaneously enhancing typographic understanding. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. Finally, we publicly release the datasets introduced in this paper, along with the code for evaluations under https://www.bliss.berlin/research/scam

[PDF] [bib]

Hierarchical and Multimodal Data for Daily Activity Understanding

Ghazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, Ameya Patil, (5):1−30, 2026.

Abstract

Daily Activity Recordings for artificial intelligence (DARai, pronounced /Dahr-ree/), is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The unscripted nature of DARai enables the collection of action counterfactuals, defined as observed alternative executions of the same activity under different conditions (e.g., lifting a heavy versus a light object). Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To showcase the shortcomings of individual sensors, we conduct domain-variant experiments that are possible because of DARai’s multi-sensor setup and its inclusion of action counterfactuals, i.e., observed alternative executions of the same activity. The code, documentation, and dataset is available at the dedicated DARai website.

[PDF] [bib]

IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation

Thiviyan Thanapalasingam, Emile van Krieken, Peter Bloem, Paul Groth, (6):1−34, 2026.

Abstract

Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations, commonly trained to predict missing links between entities. However, Knowledge Graphs are not just sets of links but also have complex semantics underlying their structure. Semantics plays a crucial role in several downstream tasks, such as query answering and reasoning. Recognizing this, our work goes beyond simple link prediction to focus on inferred knowledge that adheres to rich semantics. Specifically, 1) we introduce the subgraph inference task, where a model is required to generate novel subgraphs that are logically consistent with background knowledge; 2) we propose IntelliGraphs, a set of five new datasets that contain subgraphs with logical rules that express complex semantics for evaluating subgraph inference models, and 3) we design four baseline models, which include three models based on traditional KGEs, and show empirically that the KGE-based baselines can not capture complex semantics. We believe that IntelliGraphs will encourage the development of machine learning models that focus on semantic understanding.

[PDF] [bib]

FinSurvival: A Suite of Large Scale Survival Modeling Tasks from Finance

Aaron Green, Zihan Nie, Hanzhen Qin, Oshani Seneviratne, Kristin Bennett, (7):1−32, 2026.

Abstract

Survival modeling predicts the time until an event occurs and is widely used in risk analysis; for example, it’s used in medicine to predict the survival of a patient based on censored data. There is a need for large-scale, realistic, and freely available datasets for benchmarking artificial intelligence (AI) survival models. In this paper, we derive a suite of 16 survival modeling tasks from publicly available transaction data generated by lending of cryptocurrencies in Decentralized Finance (DeFi). Each task was constructed using an automated pipeline based on choices of index and outcome events. For example, the model predicts the time from when a user borrows cryptocurrency coins (index event) until their first repayment (outcome event). We formulate a survival benchmark consisting of a suite of 16 survival-time prediction tasks (FinSurvival). We also automatically create 16 corresponding classification problems for each task by thresholding the survival time using the restricted mean survival time. With over 7.5 million records, FinSurvival provides a suite of realistic financial modeling tasks that will spur future AI survival modeling research. Our evaluation indicated that these are challenging tasks that are not well addressed by existing methods. FinSurvival enables the evaluation of AI survival models applicable to traditional finance, industry, medicine, and commerce, which is currently hindered by the lack of large public datasets. Our benchmark demonstrates how AI models could assess opportunities and risks in DeFi. In the future, the FinSurvival benchmark pipeline can be used to create new benchmarks by incorporating more DeFi transactions and protocols as the use of cryptocurrency grows.

[PDF] [bib]

Time Series Machine Learning for Classifying Electroencephalograms

Aiden Rushbrooke, Saber Sami, Matthew Middlehurst, Tony Bagnall, (8):1−26, 2026.

Abstract

Electroencephalography (EEG) is a crucial tool across neuroscience domains, including medical diagnostics, psychological research, and brain-computer interfacing (BCI). Its popularity is due to its non-invasiveness, high temporal resolution, and cost-effectiveness. The task of EEG classification, which involves learning to predict class labels associated with EEG segments based on previously observed data. This task is fundamental yet complex, given the high dimensionality, variability, and subject-specific nuances inherent in EEG data. We systematically evaluate recent advances in general-purpose time series machine learning (TSML) approaches to EEG classification. We present an EEG classification archive of 30 benchmark datasets, spanning diverse applications from clinical diagnostics to cognitive and BCI tasks. Our empirical evaluation compares traditional EEG approaches, deep learning models, Riemannian geometry-based classifiers, and state-of-the-art time series machine learning algorithms on this new benchmark. We find that one algorithm, a meta ensemble called HIVE-COTE v2.0, consistently outperforms alternative classifiers.

[PDF] [bib]

LOCKED: A Dataset of Sociodemographic, Economic, Health and Living Features to Assess Mental Health Impact of the Spanish Lockdown during COVID-19

Alberto Nogales, Alfredo Guitian, Blanca Mellor-Marsá, Alvaro J. García-Tejedor, (9):1−37, 2026.

Abstract

The COVID-19 pandemic, which began in late 2019 in Wuhan, China, quickly escalated into a global crisis that affected nearly every aspect of life. Governments around the world implemented stringent public health measures to control the spread, including quarantines, social distancing, and lockdowns. In Spain, where the first cases emerged in January 2020, a nationwide lockdown was imposed on 14 March after the number of infections exceeded 5,000. Although these interventions were crucial for public health, they also presented significant social challenges, particularly for vulnerable groups. The primary goal of this study is to describe a dataset that combines psychological assessments of nine mental health conditions with sociodemographic, economic, living, and general health features. The data collected could potentially be used to identify the key factors that influenced mental health outcomes during lockdown. By analysing these data, the research seeks to shed light on the broader psychological effects of the pandemic and the factors that can exacerbate or mitigate these impacts. As an added value and to demonstrate the quality and potential of the dataset for mental health research, baseline machine learning models were developed, achieving performance metrics that exceed 80%. LOCKED is publicly available at https://zenodo.org/uploads/14203988.

[PDF] [bib]

A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants

Peter Samoaa, Marcus Vukojevic, Morteza Haghir Chehreghani, Antonio Longa, (10):1−41, 2026.

Abstract

Graph-level regression underpins many real-world applications, yet public benchmarks remain heavily skewed toward molecular graphs and citation networks. This limited diversity hinders progress on models that must generalize across both homogeneous and heterogeneous graph structures. We introduce RelSC, a new graph-regression dataset built from program graphs that combine syntactic and semantic information extracted from source code. Each graph is labelled with the execution-time cost of the corresponding program, providing a continuous target variable that differs markedly from those found in existing benchmarks. RelSC is released in two complementary variants. TypeOne supplies rich node features under a single (homogeneous) edge type, while TypeTwo preserves the original multi-relational structure, connecting nodes through multiple edge types that encode distinct semantic relationships. Together, these variants let researchers probe how representation choice influences model behaviour. We evaluate a diverse set of graph neural network architectures on both variants of RelSC. The results reveal consistent performance differences between the homogeneous and multi-relational settings, emphasising the importance of structural representation. These findings demonstrate RelSC’s value as a challenging and versatile benchmark for advancing graph regression methods.

[PDF] [bib]

Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning

Juan Claude Formanek, Louise Beyers, Callum Rhys Tilbury, Jonathan Phillip Shock, Arnu Pretorius, (11):1−24, 2026.

Abstract

Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.

[PDF] [bib]

AI Competitions and Benchmarks book - Practical issues and open problems

Andrey Ustyuzhanin, Harald Carlens, (12):1−68, 2026.

Abstract

The ecosystem of artificial intelligence contests is diverse and multifaceted, encompassing several platforms that each host numerous competitions and challenges annually, alongside many specialized websites dedicated to individual contests. These platforms manage the overarching administrative responsibilities inherent in orchestrating contests, thus allowing organizers to allocate greater attention to other aspects of their contests. Notably, these platforms exhibit considerable variety in their features, economic models, and communities. This chapter conducts an extensive review of the leading services in this space and explores alternative methods facilitating the independent hosting of such contests. We provide hints and tips on choosing the right platform for your challenge at the end.

[PDF] [bib]

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao LI, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan, (13):1−40, 2026.

Abstract

Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical systems and labor-intensive components remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components: CFDQuery, CFDCodeBench, and FoamBench, designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NLR-Theseus/cfdllmbench/.

[PDF] [bib]

AIDOVECL: AI-generated Dataset of Outpainted Vehicles for Eye-level Classification and Localization

Amir Kazemi, Qurat ul ain Fatima, Volodymyr Kindratenko, Christopher W Tessum, (14):1−34, 2026.

Abstract

Image labeling is a critical bottleneck in the development of computer vision technologies, often constraining machine learning performance due to the time-intensive nature of manual annotations. This work introduces a novel approach that leverages outpainting to mitigate annotated data scarcity by generating artificial contexts and annotations, significantly reducing labeling efforts. We apply this technique to a particularly acute challenge in autonomous driving, urban planning, and environmental monitoring: the lack of diverse, eye-level vehicle images from desired classes. Our dataset comprises AI-generated vehicle images obtained by detecting and cropping vehicles from manually selected seed images, which are then outpainted onto larger canvases to simulate varied real-world conditions. The outpainted images include detailed annotations, providing high-quality ground truth data. Advanced outpainting techniques and image quality assessments ensure visual fidelity and contextual relevance. Ablation results show that incorporating AIDOVECL improves overall detection performance by up to about 10%, and delivers gains of up to about 40% in settings with greater diversity of context, object scale, and placement, with underrepresented classes achieving up to about 50% higher true positives. AIDOVECL enhances vehicle detection by augmenting real training data and supporting evaluation across diverse scenarios. By demonstrating outpainting as an automatic annotation paradigm, it offers a practical and versatile solution for building fine-grained datasets with reduced labeling effort across multiple machine learning domains. The code and links to datasets are available for further research and replication at https://github.com/amir-kazemi/aidovecl.

[PDF] [bib]

MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning

Florian Felten, Umut Ucak, Hicham Azmani, Gao Peng, Willem Röpke, Hendrik Baier, Patrick Mannion, Diederik M Roijers, J K Terry, El Ghazali Talbi, Grégoire Danoy, Ann Nowe, Roxana Rădulescu, (15):1−49, 2026.

Abstract

Many challenging tasks such as managing traffic systems, electricity grids, or supply chains involve complex decision-making processes that must balance multiple conflicting objectives and coordinate the actions of various independent decision-makers (DMs). One perspective for formalising and addressing such tasks is multi-objective multi-agent reinforcement learning (MOMARL). MOMARL broadens reinforcement learning (RL) to problems with multiple agents each needing to consider multiple objectives in their learning process. In reinforcement learning research, benchmarks are crucial in facilitating progress, evaluation, and reproducibility. The significance of benchmarks is underscored by the existence of numerous benchmark frameworks developed for various RL paradigms, including single-agent RL (e.g., Gymnasium), multi-agent RL (e.g., PettingZoo), and single-agent multi-objective RL (e.g., MO-Gymnasium). To support the advancement of the MOMARL field, we introduce MOMAland, the first collection of standardised environments for multi-objective multi-agent reinforcement learning. MOMAland addresses the need for comprehensive benchmarking in this emerging field, offering over 10 diverse environments that vary in the number of agents, state representations, reward structures, and utility considerations. To provide strong baselines for future research, MOMAland also includes algorithms capable of learning policies in such settings.

[PDF] [bib]

Wake Vision: A Tailored Dataset and Benchmark Suite for TinyML Computer Vision Applications

Colby Banbury, Emil Njor, Andrea Mattia Garavagno, Mark Mazumder, Pete Warden, Manjunath Kudlur, Nat Jeffries, Xenofon Fafoutis, Vijay Janapa Reddi, (16):1−48, 2026.

Abstract

Tiny machine learning (TinyML) co-locates models with sensors on microcontrollers, where small models (which are disproportionately sensitive to label noise) and bespoke binary tasks (which lack standard benchmarks) make general-purpose dataset practices a poor fit. Visual Wake Words (VWW), the prior standard TinyML person detection benchmark, contains roughly 123K images and has an estimated label error rate of 7.8%, which limits its usefulness for production-grade systems. Manual labeling, however, is prohibitively expensive for the scale and diversity of TinyML use cases. We address this gap with the Wake Vision pipeline, an automated method for generating and curating large-scale binary classification datasets for TinyML. We use data-centric TinyML for the dataset construction, curation, and lifecycle methods that produce the large, well-curated datasets these systems require. The pipeline combines label fusion across image-level and bounding-box sources, confidence-, area-, and depiction-aware filtering, label correction on the evaluation splits, and automatic generation of fine-grained benchmark subsets. Applying it to person detection, we release Wake Vision, a dataset of almost 6M images (close to 100× more person images than VWW) with a manually relabeled validation and test set at a 2.2% label error rate. Models trained on Wake Vision improve test accuracy by up to 6.6% over VWW across MobileNetV2, MCUNet, MicroNets, and ColabNAS architectures, and match or exceed VWW-trained models on 13 of 16 fine-grained subsets covering perceived gender, perceived age, distance, lighting, and depictions. The advantage holds under distribution shift on three out-of-distribution datasets covering driving and overhead-surveillance imagery. We additionally uncover two TinyML-specific insights: small models are more sensitive to label errors than large models, and two-stage training, which pretrains on the noisier large set and fine-tunes on the cleaner small set, is a viable strategy even for tiny, low-capacity models. Beyond person detection, the Wake Vision pipeline applies to the 9.6K trainable classes of Open Images v7; on bird detection it produces a dataset 27× larger than a VWW-style baseline with label error reduced from 6.6% to 0.6%. All artifacts are released under CC-BY 4.0 through TensorFlow Datasets and Hugging Face. To continue improving the dataset over time, we partner with the Edge AI Foundation to host community competitions; the first round contributed a label-correction technique that reduced the Wake Vision (Large) label error rate from 15.2% to 9.8%, at a cost orders of magnitude below the $600,000 implied by manual relabeling at this scale.

[PDF] [bib]