Data-centric Machine Learning Research

Announcing DMLR, a Top Archival Venue for Data-centric Machine Learning Research

The availability of data, computing power, and algorithmic advancements have been essential to the rapid progress of Machine Learning (ML) and Artificial Intelligence (AI) over the last decade. Among these three intermingled components, data plays a crucial role and the last decade witnessed explosive developments. Among the most influential examples, there are benchmark datasets such as ImageNet, data consortium and communities such as LAION, research communities such as Data-centric AI/ML, and DataPerf, along with the development of fundamental research methodologies behind data influence, importance, weak supervision, and data market (e.g., an example review article). Alongside these exciting developments, some new publication opportunities have appeared. One of the exciting progress is the establishment of NeurIPS Dataset and Benchmark track, which provides a top venue for research that focuses on novel datasets, benchmark tasks, and evaluation protocols. Another one is the recurring Data-centric AI workshop organized over the last 5 years, providing a forum to discuss the impact of data, data challenges, and the future of AI driven by data.

There are, however, challenges and opportunities that we are facing to further unleash the potential of Data-centric Machine Learning Research:

We need another top venue: The NeurIPS dataset and benchmark track happens only once a year, with an acceptance rate of 36.46% for 2022. This means that more than 60% of papers, after the authors integrate the reviewer feedback, need to wait another year to be resubmitted. This brings challenges in forming a larger community, especially in bringing young students to work on data-centric topics.
We need a journal serving similar purposes as JMLR does for publishing ML algorithms: to allow authors to provide a rigorous forum for longer, more detailed, archival papers, with a journal-style review process.
We need a broader scope: Data-centric machine learning research should look beyond only the end artifacts, i.e., the dataset and benchmark, and produce fundamental scientific methodologies for creating those artifacts.
We need a process to accommodate diverse, focused topics: As the field grows, we need to accommodate diverse, focused topics, in the form of special issues hosted on a prestigious platform.

These challenges brought together several adjacent communities and endured discussions in the last few years in various forums. Today, we are excited to announce the founding of a new journal in the JMLR family, the Journal of Data-centric Machine Learning Research (DMLR). DMLR is the latest member of the JMLR family, which consists of the well-established Journal of Machine Learning Research (JMLR), the JMLR Machine Learning Open Source Software (MLOSS), Proceedings of Machine Learning Research (PMLR), and the Transactions on Machine Learning Research (TMLR). The key difference between DMLR and its sister journals is its focus on data, broadly defined, covering but not limited to the following topics:

Datasets for machine learning research
Benchmarks for machine learning research (collections of datasets with particular aims)
Benchmarking tools and methods
Methodology and empirical evaluation of data collection processes, data generation, data labeling, data augmentation processes, generalizability of datasets, feature representations, text generation models, and image generation models
Societal and ethical studies around creation and uses of data
Fundamental contributions (theoretical or empirical) on various aspects of data quality, including data bias, variance, uncertainty and their influence on ML
Algorithms for data cleaning, acquisition, quality evaluation, and alignment for ML - Prompt design and creation for generative and foundational models
Experimental design, registered experiments, methodology of empirical evaluations, including design of competitions and benchmarks
Frameworks for responsible dataset development, audits of existing datasets, identifying significant problems with existing datasets and their use
Systematic analyses of existing systems on novel datasets or benchmarks that yield important new insight.

DMLR aims to maintain the same high-quality bar of JMLR with a rigorous review process supported by high-profile, dedicated, editorial, advisory, and review boards, having diverse expertise and representation from all adjacent fields. We will continue to refine and tailor our process to better serve the special nature of data-centric ML research. We thus hope that DMLR will become the venue for innovations that are tailored to the special nature of data-centric ML research. In the longer term, DMLR aims to become the beacon of responsible scientific data creation, curation, and management, and all associated aspects of reproducible science.

DMLR will only be as strong as the community behind it, and we need all your help! Please suggest names (including yourself!) that you think we should reach out to. If you receive such an invitation, we hope that you can accept and help us. If you have suggestions about upcoming special issues, we’d also love to know! Moreover, if you are passionate about DMLR-related initiatives, please reach out and we would be excited to hear your ideas! To share your ideas, you can join our Discord channel or email any of the editors-in-chief.

Links:

Website: https://data.mlr.press/
Discord: https://discord.gg/Dk2gPvKMPv<
Editors-in-chief

Newsha Ardalani (Meta)
Isabelle Guyon (Google)
Neil Lawrence (University of Cambridge)
Joaquin Vanschoren (TU Eindhoven)
Ce Zhang (ETH Zurich)

Executive Editor

Merve Gürel (TU Delft)