ICML 2025 Workshop on


DataWorld: Unifying Data Curation Frameworks Across Domains





Overview

One area that remains relatively underexplored is how data-centric methods perform across different data modalities, domains, and downstream applications. A key question is: which lessons can be shared across these settings, and which are domain-specific? Understanding these nuances is critical for building data-centric methods that are robust, efficient, and adaptable across domains.

This ICML 2025 workshop aims to bring together researchers and practitioners to bridge the gap between domain-specific data-centric approaches and identify generalizable principles. Participants will explore theoretical frameworks, empirical findings, and practical tools that enable effective knowledge transfer across the diverse landscape of data-centric AI.

📅 Workshop Date: Saturday, July 19     📍 Room: Meeting Room 208-209

Call for Papers

We invite submissions from researchers in the field of data-centric ML. Our topics of interest include, but are not limited to:

  • Domain-specific data issues: Challenges and best practices in data curation across modalities and domains.
  • Human-in-the-loop: Standards and trade-offs in annotation quality, crowd-sourcing vs. expert labeling.
  • Data & Society: Ethical sourcing, privacy, fairness, and adversarial risks in real-world datasets.
  • Benchmarks & evaluation: Rigorous evaluation of data pipelines and generalizable data quality metrics.

Non-Archival: This workshop will not have formal proceedings.

Submission URL: Please submit your work via OpenReview. To help maintain the quality of the review process, we kindly ask you to nominate a potential reviewer by providing their email address in the OpenReview submission.

Length and Formatting: Submitted papers must be between 4 - 9 pages in PDF format using the ICML 25 Style Files including figures and tables. Authors are permitted to upload unlimited supplementary materials and references with their submissions. We will use a double-blind review process.

Important Dates:

  • Paper Submission Deadline: May 24 May 31
  • Author Notification: June 9
  • Camera-ready Deadline: July 13

We are sponsored by DatologyAI and will be awarding prizes to the best submissions!

  • 🥇 $1,000 Best Paper Award
  • 🥈 $250 Honorable Mention (×2)

If you have any questions, please send us an email at data_world@googlegroups.com


Schedule

Morning Session


Time Type of Event Speakers
08:55 - 09:05 Opening Remarks Organizers
09:05 - 09:35 Invited Talk Pang Wei Koh
09:35 - 10:05 Invited Talk Hu Xu
10:05 - 11:20 Poster Session Accepted poster presenters
11:20 - 11:30 Coffee Break
11:30 - 12:15 Panel Invited panelists & moderator
12:15 - 12:30 Oral Presentation: Leveraging Base Language Models for Few-Shot Synthetic Data Generation Accepted oral presenter
12:30 - 12:45 Oral Presentation: DataDecide: How to Predict Best Pretraining Data with Small Experiments Accepted oral presenter
12:45 - 13:30 Lunch Break

Afternoon Session


13:30 - 14:00 Invited Talk Aditi Raghunathan
14:00 - 14:30 Invited Talk Ari Morcos
14:30 - 14:45 Oral Presentation: Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights Accepted oral presenter
14:45 - 15:00 Oral Presentation: KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Accepted oral presenter
15:00 - 15:10 Coffee Break
15:10 - 16:20 Poster Session Accepted poster presenters
16:20 - 16:50 Invited Talk James Zou
16:50 - 17:00 Closing Remarks & Awards Organizers
 

To ensure the accessibility of our workshop for virtual attendees, we will stream all presentations and facilitate questions from online attendees via TBD.

Invited Speakers




Pang Wei Koh

University of Washington

Hu Xu

Facebook AI Research (FAIR)

Ari Morcos

DatologyAI

Aditi Raghunathan

Carnegie Mellon University

James Zou

Stanford University

Panelists




Irene Chen

UC Berkeley

Liam Parker

Polymathic AI

Jason Laska

Instacart

Priya L. Donti

Massachusetts Institute of Technology

Workshop Organizers




Sara Beery

Massachusetts Institute of Technology

Benjamin Feuer

New York University

Neha Hulkund

Massachusetts Institute of Technology

Thao Nguyen

University of Washington, Meta AI Research

Sewoong Oh

University of Washington



Ludwig Schmidt

Stanford University, Anthropic

Serena Yeung-Levy

Stanford University

Yuhui Zhang

Stanford University

Niv Cohen

New York University

Sponsored by

Accepted Papers

We are pleased to announce the following accepted papers for the DataWorld 2025 workshop. All papers will be presented during the designated poster sessions.


Paper Title & Authors Serial Number Poster Session
UNREAL: Unlabeled Nodes Retrieval and Labeling for Heavily-imbalanced Node Classification
Liang Yan, Shengzhong Zhang, Bisheng Li, Menglin Yang, Chen Yang, Min Zhou, Weiyang Ding, Zengfeng Huang
3 Morning Session
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights
Tzu-Heng Huang, Manjot Bilkhu, Frederic Sala
6 Morning Session
DataS3: Dataset Subset Selection for Specialization
Neha Hulkund, Alaa Maalouf, Levi Cai, Daniel Yang, Abigail O'Neill, Timm Haucke, Sandeep Mukherjee, Vikram V. Ramaswamy, Judy Hanwen Shen, Gabriel Tseng, Mike Walmsley, Tsun-Hsuan Wang, Hannah Kerner, Irene Y. Chen, Yogesh Girdhar + 3 more authors
7 Morning Session
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao
8 Morning Session
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Elad Levi, Ilan Kadar
9 Morning Session
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang, Yingzhuo Yu, Jin Huang, Xingjian Zhang, Jiaqi W. Ma
10 Morning Session
Towards Cross-Modal Error Detection with Tables and Images
Olga Ovcharenko, Sebastian Schelter
13 Morning Session
Lookahead Bias in Pretrained Language Models
Suproteem K Sarkar, Keyon Vafa
14 Morning Session
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee
15 Morning Session
R&B: Breaking the Data Mixing Bottleneck with Just 0.01% Overhead
Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala
19 Morning Session
Leveraging Base Language Models for Few-Shot Synthetic Data Generation
Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
22 Morning Session
Embrace the Diversity: Avoiding Mode Collapse with Polarized Curation in Generative Retraining
Ali Falahati, Mohammad Mohammadi Amiri, Kate Larson, Lukasz Golab
23 Morning Session
Do Data Valuations Make Good Data Prices?
Dongyang Fan, Tyler J. Rotello, Sai Praneeth Karimireddy
25 Morning Session
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data Elicits LLM Personalization to Real Users
Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn
26 Morning Session
Faithful Group Shapley Value
Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang
27 Morning Session
LARP: Learner-Agnostic Robust Data Prefiltering
Kristian Minchev, Dimitar Iliev Dimitrov, Nikola Konstantinov
28 Morning Session
Filter, Augment, Forecast: Online Data Selection for Robust Time Series Forecasting
Ege Onur Taga, Halil Alperen Gozeten, Kutay Tire, Rahul Dalvi, Reinhard Heckel, Samet Oymak
29 Morning Session
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Dongyang Fan, Vinko Sabolčec, Matin Ansaripour, Ayush Kumar Tarun, Martin Jaggi, Antoine Bosselut, Imanol Schlag
32 Morning Session
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment for Code
Elyas Obbad, Brando Miranda, Iddah Mlauzi, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo
33 Morning Session
Aquilon: Towards Building Multimodal Weather LLMs
Sumanth Varambally, Veeramakali Vignesh Manivannan, Yasaman Jafari, Luyu Han, Zachary Novack, Zhirui Xia, Salva Rühling Cachay, Srikar Eranky, Ruijia Niu, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu
34 Morning Session
Robust Reward Modeling via Causal Rubrics and Synthetic Data Curation
Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup
35 Morning Session
Inferring the Invisible: Neuro-Symbolic Rule Discovery for Missing Value Imputation
Wendi Ren, Ke Wan, Junyu Leng, Shuang Li
36 Morning Session
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework
Can Polat, HASAN KURBAN, Erchin Serpedin, Mustafa Kurban
37 Morning Session
What Variables Affect Out-of-Distribution Generalization in Pretrained Models?
Md Yousuf Harun, Kyungbok Lee, Jhair Gallardo, Giri P Krishnan, Christopher Kanan
39 Morning Session
SIEVE: A Scalable and General Purpose Data Filtering System for Large Language Models
Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert D Nowak
40 Morning Session
Multimodal-Guided Dynamic Dataset Pruning for Robust and Generalizable Data-Centric Learning
Suorong Yang, Peijia Li, Yujie Liu, Xu Zhiming, Peng Ye, Wanli Ouyang, Furao Shen, Dongzhan Zhou
42 Morning Session
Active sample selection with stable reversible graph convolutional networks
Hichem Sahbi
43 Morning Session
f-INE: Influence Estimation using Hypothesis Testing
Subhodip Panda, Shashwat Sourav, Prathosh AP, Sai Praneeth Karimireddy
46 Afternoon Session
EvalX: A Platform for Code LLM Evaluation in the Wild
Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar
47 Afternoon Session
EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits
Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue
48 Afternoon Session
AutoDavis: Automatic and Dynamic Evaluation Protocol of Large Vision-Language Models on Visual Question-Answering
Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuying Chen, Yue Zhao, Tianyi Zhou, Mohamed Elhoseiny, Xiangliang Zhang
52 Afternoon Session
Daunce: Data Attribution through Uncertainty Estimation
Xingyuan Pan, Chenlu Ye, Joseph Melkonian, Jiaqi W. Ma, Tong Zhang
53 Afternoon Session
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh
54 Afternoon Session
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger
56 Afternoon Session
How to Recommend a Dataset for Model Training Team? Rethinking Proxy-Model-based Technique
Jiachen T. Wang, Tong Wu, Kaifeng Lyu, Dawn Song, Ruoxi Jia, Prateek Mittal
58 Afternoon Session
Domain-Constrained Diffusion Models to Synthesize Tabular Data: A Case Study in Power Systems
Milad Hoseinpour, Vladimir Dvorkin
59 Afternoon Session
HiLWS: A Human-in-the-Loop Weak Supervision Framework for Curating Clinical and Home Video Data for Neurological Assessment
Atefeh Irani, Maryam Mirian, Alexander Lassooij, Reshad Hosseini, Hadi Moradi, Martin J. McKeown
61 Afternoon Session
Core Knowledge Deficits in Multi-Modal Language Models
Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haiyun Lyu, Haoran Sun, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng
62 Afternoon Session
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
63 Afternoon Session
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets
Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck
67 Afternoon Session
General and Estimable Learning Bound Unifying Covariate and Concept Shifts
Hongbo Chen, Li C Xia
68 Afternoon Session
Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings
Lucas Mattioli, Youness Ait Hadichou, Sabrina Chaouche, Martin Gonzalez
69 Afternoon Session
Less is More? Data Specialization for Self-Supervised Remote Sensing Models
Alvard Barseghyan, Ani Vanyan, Hakob Tamazyan, Evan Shelhamer, Hrant Khachatrian
70 Afternoon Session
Quantifying the Importance of Data Alignment in Downstream Model Performance
Krrish Chawla, Aryan Sahai, Mario DePavia, Sudharsan Sundar, Brando Miranda, Elyas Obbad, Sanmi Koyejo
73 Afternoon Session
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning
Soham Kulkarni, Raayan Dhar, Yuchen Cui
74 Afternoon Session
SNAC-DB: The Hitchhiker's Guide to Building Better Predictive Models of Antibody & NANOBODY® VHH–Antigen Complexes
Abhinav Gupta, Bryan Munoz Rivero, Jorge Roel-Touris, Ruijiang Li, Norbert Furtmann, Yves Fomekong Nanfack, Maria Wendt, Yu Qiu
75 Afternoon Session
Evaluating Deepfake Detectors in the Wild
Viacheslav Pirogov
76 Afternoon Session
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran
79 Afternoon Session
VISUALSPHINX: Large-Scale Synthetic Vision Logic Puzzles for RL
Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
80 Afternoon Session
The BrainApp Study: Engineering a New Frontier in Brain Tumor Speech Research
N. AIZAAN ANWAR, Elias Allara, Lucia Specia, Matt Williams
81 Afternoon Session
FAIM: Fair Imputation with Adversarial Training for Mitigating Bias in Missing Data
Rasta Tadayon, Haewon Jeong, Ramtin Pedarsani
85 Afternoon Session
Pearls from Pebbles: Improved Confidence Functions for Auto-labeling
Harit Vishwakarma, Yi Chen, Sui Jiet Tay, Satya Sai Srinath Namburi GNVV, Frederic Sala, Ramya Korlakai Vinayak
87 Afternoon Session
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
88 Afternoon Session
How to Get Your LLM to Generate Challenging Problems for Evaluation
Arkil Patel, Siva Reddy, Dzmitry Bahdanau
91 Afternoon Session
DEETS: Detailed Evaluation of Image Text Specificity
Yasumasa Onoe, Hailey Joren, Cyrus Rashtchian, Su Wang, Olivia Wiles, Yonatan Bitton, Brian Gordon, Keran Rong, Austin Waters, Jason Michael Baldridge, Roopal Garg, Radu Soricut, Jordi Pont-Tuset
94 Afternoon Session

Note: The poster sessions are scheduled as follows:

  • Morning Session: 10:05 - 11:20
  • Afternoon Session: 15:10 - 16:20