This is the third volume.

  • Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark
    Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen, (1):1−48, 2026.
    Abstract

    Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models’ ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. The Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our data is available at https://huggingface.co/datasets/MBZUAI-LLM/Mobile-MMLU.

    [PDF] [bib]

  • Automated Data Preparation for Machine Learning: A Survey
    Sasa Mladenovic, Marius Lindauer, Carola Doerr, (2):1−72, 2026.
    Abstract

    Data preparation is essential for effective machine learning (ML), yet typically remains a manual, time-consuming process. While automated machine learning (AutoML) has successfully addressed modeling aspects of the ML workflow, data preparation has largely been overlooked, leading to challenges with real-world, imperfect data. Conversely, a rising paradigm in the world of artificial intelligence (AI) and ML is that of data-centric AI, shifting focus from just refining models, to enhancing data in order to advance performance boundaries. This survey motivates the need for automated solutions regarding data preparation, offering a fundamental understanding of the benefits of data transformations and establishing the complexity of data pipeline optimization, while highlighting the importance of data quality. We provide a comprehensive overview and categorization of existing automation approaches, both in AutoML and as standalone fully or semi-automated systems. We discuss underlying methodologies, their advantages, and limitations. Our work explores the prospects of expanding automation to cover a broader data preparation process, aiming to bridge the gap between data-centric AI and AutoML. It paves the way to a wholly automated pipeline from raw real-world data to quality model predictions, and outlines future research directions towards that goal.

    [PDF] [bib]

  • ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning
    Jannis Becktepe, Julian Dierkes, Carolin Benjamins, Aditya Mohan, David Salinas, Raghu Rajan, Frank Hutter, Holger Hoos, Marius Lindauer, Theresa Eimer, (3):1−41, 2026.
    Abstract

    Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.

    [PDF] [bib]

  • SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
    Justus Westerhoff, Erblina Purelku, Jakob Hackstein, Jonas Loos, Leo Pinetzki, Erik Rodner, Lorenz Hufe, (4):1−30, 2026.
    Abstract

    Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. Existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings indicate that typographic attacks remain effective against state-of-the-art Large Vision-Language Models, especially those employing vision encoders inherently vulnerable to such attacks. However, employing larger Large Language Model backbones reduces this vulnerability while simultaneously enhancing typographic understanding. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems. Finally, we publicly release the datasets introduced in this paper, along with the code for evaluations under https://www.bliss.berlin/research/scam

    [PDF] [bib]

  • Hierarchical and Multimodal Data for Daily Activity Understanding
    Ghazal Kaviani, Yavuz Yarici, Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib, Mashhour Solh, Ameya Patil, (5):1−30, 2026.
    Abstract

    Daily Activity Recordings for artificial intelligence (DARai, pronounced /Dahr-ree/), is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The unscripted nature of DARai enables the collection of action counterfactuals, defined as observed alternative executions of the same activity under different conditions (e.g., lifting a heavy versus a light object). Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To showcase the shortcomings of individual sensors, we conduct domain-variant experiments that are possible because of DARai’s multi-sensor setup and its inclusion of action counterfactuals, i.e., observed alternative executions of the same activity. The code, documentation, and dataset is available at the dedicated DARai website.

    [PDF] [bib]

  • IntelliGraphs: Datasets for Benchmarking Knowledge Graph Generation
    Thiviyan Thanapalasingam, Emile van Krieken, Peter Bloem, Paul Groth, (6):1−34, 2026.
    Abstract

    Knowledge Graph Embedding (KGE) models are used to learn continuous representations of entities and relations, commonly trained to predict missing links between entities. However, Knowledge Graphs are not just sets of links but also have complex semantics underlying their structure. Semantics plays a crucial role in several downstream tasks, such as query answering and reasoning. Recognizing this, our work goes beyond simple link prediction to focus on inferred knowledge that adheres to rich semantics. Specifically, 1) we introduce the subgraph inference task, where a model is required to generate novel subgraphs that are logically consistent with background knowledge; 2) we propose IntelliGraphs, a set of five new datasets that contain subgraphs with logical rules that express complex semantics for evaluating subgraph inference models, and 3) we design four baseline models, which include three models based on traditional KGEs, and show empirically that the KGE-based baselines can not capture complex semantics. We believe that IntelliGraphs will encourage the development of machine learning models that focus on semantic understanding.

    [PDF] [bib]

  • FinSurvival: A Suite of Large Scale Survival Modeling Tasks from Finance
    Aaron Green, Zihan Nie, Hanzhen Qin, Oshani Seneviratne, Kristin Bennett, (7):1−32, 2026.
    Abstract

    Survival modeling predicts the time until an event occurs and is widely used in risk analysis; for example, it’s used in medicine to predict the survival of a patient based on censored data. There is a need for large-scale, realistic, and freely available datasets for benchmarking artificial intelligence (AI) survival models. In this paper, we derive a suite of 16 survival modeling tasks from publicly available transaction data generated by lending of cryptocurrencies in Decentralized Finance (DeFi). Each task was constructed using an automated pipeline based on choices of index and outcome events. For example, the model predicts the time from when a user borrows cryptocurrency coins (index event) until their first repayment (outcome event). We formulate a survival benchmark consisting of a suite of 16 survival-time prediction tasks (FinSurvival). We also automatically create 16 corresponding classification problems for each task by thresholding the survival time using the restricted mean survival time. With over 7.5 million records, FinSurvival provides a suite of realistic financial modeling tasks that will spur future AI survival modeling research. Our evaluation indicated that these are challenging tasks that are not well addressed by existing methods. FinSurvival enables the evaluation of AI survival models applicable to traditional finance, industry, medicine, and commerce, which is currently hindered by the lack of large public datasets. Our benchmark demonstrates how AI models could assess opportunities and risks in DeFi. In the future, the FinSurvival benchmark pipeline can be used to create new benchmarks by incorporating more DeFi transactions and protocols as the use of cryptocurrency grows.

    [PDF] [bib]

  • Time Series Machine Learning for Classifying Electroencephalograms
    Aiden Rushbrooke, Saber Sami, Matthew Middlehurst, Tony Bagnall, (8):1−26, 2026.
    Abstract

    Electroencephalography (EEG) is a crucial tool across neuroscience domains, including medical diagnostics, psychological research, and brain-computer interfacing (BCI). Its popularity is due to its non-invasiveness, high temporal resolution, and cost-effectiveness. The task of EEG classification, which involves learning to predict class labels associated with EEG segments based on previously observed data. This task is fundamental yet complex, given the high dimensionality, variability, and subject-specific nuances inherent in EEG data. We systematically evaluate recent advances in general-purpose time series machine learning (TSML) approaches to EEG classification. We present an EEG classification archive of 30 benchmark datasets, spanning diverse applications from clinical diagnostics to cognitive and BCI tasks. Our empirical evaluation compares traditional EEG approaches, deep learning models, Riemannian geometry-based classifiers, and state-of-the-art time series machine learning algorithms on this new benchmark. We find that one algorithm, a meta ensemble called HIVE-COTE v2.0, consistently outperforms alternative classifiers.

    [PDF] [bib]

  • LOCKED: A Dataset of Sociodemographic, Economic, Health and Living Features to Assess Mental Health Impact of the Spanish Lockdown during COVID-19
    Alberto Nogales, Alfredo Guitian, Blanca Mellor-Marsá, Alvaro J. García-Tejedor, (9):1−37, 2026.
    Abstract

    The COVID-19 pandemic, which began in late 2019 in Wuhan, China, quickly escalated into a global crisis that affected nearly every aspect of life. Governments around the world implemented stringent public health measures to control the spread, including quarantines, social distancing, and lockdowns. In Spain, where the first cases emerged in January 2020, a nationwide lockdown was imposed on 14 March after the number of infections exceeded 5,000. Although these interventions were crucial for public health, they also presented significant social challenges, particularly for vulnerable groups. The primary goal of this study is to describe a dataset that combines psychological assessments of nine mental health conditions with sociodemographic, economic, living, and general health features. The data collected could potentially be used to identify the key factors that influenced mental health outcomes during lockdown. By analysing these data, the research seeks to shed light on the broader psychological effects of the pandemic and the factors that can exacerbate or mitigate these impacts. As an added value and to demonstrate the quality and potential of the dataset for mental health research, baseline machine learning models were developed, achieving performance metrics that exceed 80%. LOCKED is publicly available at https://zenodo.org/uploads/14203988.

    [PDF] [bib]

  • A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants
    Peter Samoaa, Marcus Vukojevic, Morteza Haghir Chehreghani, Antonio Longa, (10):1−41, 2026.
    Abstract

    Graph-level regression underpins many real-world applications, yet public benchmarks remain heavily skewed toward molecular graphs and citation networks. This limited diversity hinders progress on models that must generalize across both homogeneous and heterogeneous graph structures. We introduce RelSC, a new graph-regression dataset built from program graphs that combine syntactic and semantic information extracted from source code. Each graph is labelled with the execution-time cost of the corresponding program, providing a continuous target variable that differs markedly from those found in existing benchmarks. RelSC is released in two complementary variants. TypeOne supplies rich node features under a single (homogeneous) edge type, while TypeTwo preserves the original multi-relational structure, connecting nodes through multiple edge types that encode distinct semantic relationships. Together, these variants let researchers probe how representation choice influences model behaviour. We evaluate a diverse set of graph neural network architectures on both variants of RelSC. The results reveal consistent performance differences between the homogeneous and multi-relational settings, emphasising the importance of structural representation. These findings demonstrate RelSC’s value as a challenging and versatile benchmark for advancing graph regression methods.

    [PDF] [bib]

  • Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning
    Juan Claude Formanek, Louise Beyers, Callum Rhys Tilbury, Jonathan Phillip Shock, Arnu Pretorius, (11):1−24, 2026.
    Abstract

    Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.

    [PDF] [bib]

  • AI Competitions and Benchmarks book - Practical issues and open problems
    Andrey Ustyuzhanin, Harald Carlens, (12):1−68, 2026.
    Abstract

    The ecosystem of artificial intelligence contests is diverse and multifaceted, encompassing several platforms that each host numerous competitions and challenges annually, alongside many specialized websites dedicated to individual contests. These platforms manage the overarching administrative responsibilities inherent in orchestrating contests, thus allowing organizers to allocate greater attention to other aspects of their contests. Notably, these platforms exhibit considerable variety in their features, economic models, and communities. This chapter conducts an extensive review of the leading services in this space and explores alternative methods facilitating the independent hosting of such contests. We provide hints and tips on choosing the right platform for your challenge at the end.

    [PDF] [bib]