One area that remains relatively underexplored is how data-centric methods perform across different data modalities, domains, and downstream applications. A key question is: which lessons can be shared across these settings, and which are domain-specific? Understanding these nuances is critical for building data-centric methods that are robust, efficient, and adaptable across domains.
This ICML 2025 workshop aims to bring together researchers and practitioners to bridge the gap between domain-specific data-centric approaches and identify generalizable principles. Participants will explore theoretical frameworks, empirical findings, and practical tools that enable effective knowledge transfer across the diverse landscape of data-centric AI.
📅 Workshop Date: Saturday, July 19 📍 Room: Meeting Room 208-209
We invite submissions from researchers in the field of data-centric ML. Our topics of interest include, but are not limited to:
Non-Archival: This workshop will not have formal proceedings.
Submission URL: Please submit your work via OpenReview. To help maintain the quality of the review process, we kindly ask you to nominate a potential reviewer by providing their email address in the OpenReview submission.
Length and Formatting: Submitted papers must be between 4 - 9 pages in PDF format using the ICML 25 Style Files including figures and tables. Authors are permitted to upload unlimited supplementary materials and references with their submissions. We will use a double-blind review process.
Important Dates:
We are sponsored by DatologyAI and will be awarding prizes to the best submissions!
If you have any questions, please send us an email at data_world@googlegroups.com
Time | Type of Event | Speakers |
---|---|---|
08:55 - 09:05 | Opening Remarks | Organizers |
09:05 - 09:35 | Invited Talk | Pang Wei Koh |
09:35 - 10:05 | Invited Talk | Hu Xu |
10:05 - 11:20 | Poster Session | Accepted poster presenters |
11:20 - 11:30 | Coffee Break | |
11:30 - 12:15 | Panel | Invited panelists & moderator |
12:15 - 12:30 | Oral Presentation: Leveraging Base Language Models for Few-Shot Synthetic Data Generation | Accepted oral presenter |
12:30 - 12:45 | Oral Presentation: DataDecide: How to Predict Best Pretraining Data with Small Experiments | Accepted oral presenter |
12:45 - 13:30 | Lunch Break |
13:30 - 14:00 | Invited Talk | Aditi Raghunathan |
14:00 - 14:30 | Invited Talk | Ari Morcos |
14:30 - 14:45 | Oral Presentation: Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights | Accepted oral presenter |
14:45 - 15:00 | Oral Presentation: KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | Accepted oral presenter |
15:00 - 15:10 | Coffee Break | |
15:10 - 16:20 | Poster Session | Accepted poster presenters |
16:20 - 16:50 | Invited Talk | James Zou |
16:50 - 17:00 | Closing Remarks & Awards | Organizers |
To ensure the accessibility of our workshop for virtual attendees, we will stream all presentations and facilitate questions from online attendees via TBD.
We are pleased to announce the following accepted papers for the DataWorld 2025 workshop. All papers will be presented during the designated poster sessions.
Paper Title & Authors | Serial Number | Poster Session |
---|---|---|
UNREAL: Unlabeled Nodes Retrieval and Labeling for Heavily-imbalanced Node Classification Liang Yan, Shengzhong Zhang, Bisheng Li, Menglin Yang, Chen Yang, Min Zhou, Weiyang Ding, Zengfeng Huang |
3 | Morning Session |
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights Tzu-Heng Huang, Manjot Bilkhu, Frederic Sala |
6 | Morning Session |
DataS3: Dataset Subset Selection for Specialization Neha Hulkund, Alaa Maalouf, Levi Cai, Daniel Yang, Abigail O'Neill, Timm Haucke, Sandeep Mukherjee, Vikram V. Ramaswamy, Judy Hanwen Shen, Gabriel Tseng, Mike Walmsley, Tsun-Hsuan Wang, Hannah Kerner, Irene Y. Chen, Yogesh Girdhar + 3 more authors |
7 | Morning Session |
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao |
8 | Morning Session |
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems Elad Levi, Ilan Kadar |
9 | Morning Session |
DCA-Bench: A Benchmark for Dataset Curation Agents Benhao Huang, Yingzhuo Yu, Jin Huang, Xingjian Zhang, Jiaqi W. Ma |
10 | Morning Session |
Towards Cross-Modal Error Detection with Tables and Images Olga Ovcharenko, Sebastian Schelter |
13 | Morning Session |
Lookahead Bias in Pretrained Language Models Suproteem K Sarkar, Keyon Vafa |
14 | Morning Session |
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee |
15 | Morning Session |
R&B: Breaking the Data Mixing Bottleneck with Just 0.01% Overhead Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala |
19 | Morning Session |
Leveraging Base Language Models for Few-Shot Synthetic Data Generation Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia |
22 | Morning Session |
Embrace the Diversity: Avoiding Mode Collapse with Polarized Curation in Generative Retraining Ali Falahati, Mohammad Mohammadi Amiri, Kate Larson, Lukasz Golab |
23 | Morning Session |
Do Data Valuations Make Good Data Prices? Dongyang Fan, Tyler J. Rotello, Sai Praneeth Karimireddy |
25 | Morning Session |
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data Elicits LLM Personalization to Real Users Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn |
26 | Morning Session |
Faithful Group Shapley Value Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang |
27 | Morning Session |
LARP: Learner-Agnostic Robust Data Prefiltering Kristian Minchev, Dimitar Iliev Dimitrov, Nikola Konstantinov |
28 | Morning Session |
Filter, Augment, Forecast: Online Data Selection for Robust Time Series Forecasting Ege Onur Taga, Halil Alperen Gozeten, Kutay Tire, Rahul Dalvi, Reinhard Heckel, Samet Oymak |
29 | Morning Session |
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs Dongyang Fan, Vinko Sabolčec, Matin Ansaripour, Ayush Kumar Tarun, Martin Jaggi, Antoine Bosselut, Imanol Schlag |
32 | Morning Session |
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment for Code Elyas Obbad, Brando Miranda, Iddah Mlauzi, Rylan Schaeffer, Kamal Obbad, Suhana Bedi, Sanmi Koyejo |
33 | Morning Session |
Aquilon: Towards Building Multimodal Weather LLMs Sumanth Varambally, Veeramakali Vignesh Manivannan, Yasaman Jafari, Luyu Han, Zachary Novack, Zhirui Xia, Salva Rühling Cachay, Srikar Eranky, Ruijia Niu, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu |
34 | Morning Session |
Robust Reward Modeling via Causal Rubrics and Synthetic Data Curation Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup |
35 | Morning Session |
Inferring the Invisible: Neuro-Symbolic Rule Discovery for Missing Value Imputation Wendi Ren, Ke Wan, Junyu Leng, Shuang Li |
36 | Morning Session |
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework Can Polat, HASAN KURBAN, Erchin Serpedin, Mustafa Kurban |
37 | Morning Session |
What Variables Affect Out-of-Distribution Generalization in Pretrained Models? Md Yousuf Harun, Kyungbok Lee, Jhair Gallardo, Giri P Krishnan, Christopher Kanan |
39 | Morning Session |
SIEVE: A Scalable and General Purpose Data Filtering System for Large Language Models Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert D Nowak |
40 | Morning Session |
Multimodal-Guided Dynamic Dataset Pruning for Robust and Generalizable Data-Centric Learning Suorong Yang, Peijia Li, Yujie Liu, Xu Zhiming, Peng Ye, Wanli Ouyang, Furao Shen, Dongzhan Zhou |
42 | Morning Session |
Active sample selection with stable reversible graph convolutional networks Hichem Sahbi |
43 | Morning Session |
f-INE: Influence Estimation using Hypothesis Testing Subhodip Panda, Shashwat Sourav, Prathosh AP, Sai Praneeth Karimireddy |
46 | Afternoon Session |
EvalX: A Platform for Code LLM Evaluation in the Wild Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar |
47 | Afternoon Session |
EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits Wayne Chi, Valerie Chen, Ryan Shar, Aditya Mittal, Jenny Liang, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica, Graham Neubig, Ameet Talwalkar, Chris Donahue |
48 | Afternoon Session |
AutoDavis: Automatic and Dynamic Evaluation Protocol of Large Vision-Language Models on Visual Question-Answering Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuying Chen, Yue Zhao, Tianyi Zhou, Mohamed Elhoseiny, Xiangliang Zhang |
52 | Afternoon Session |
Daunce: Data Attribution through Uncertainty Estimation Xingyuan Pan, Chenlu Ye, Joseph Melkonian, Jiaqi W. Ma, Tong Zhang |
53 | Afternoon Session |
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh |
54 | Afternoon Session |
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Nigel Collier, Dirk Hovy, Paul Röttger |
56 | Afternoon Session |
How to Recommend a Dataset for Model Training Team? Rethinking Proxy-Model-based Technique Jiachen T. Wang, Tong Wu, Kaifeng Lyu, Dawn Song, Ruoxi Jia, Prateek Mittal |
58 | Afternoon Session |
Domain-Constrained Diffusion Models to Synthesize Tabular Data: A Case Study in Power Systems Milad Hoseinpour, Vladimir Dvorkin |
59 | Afternoon Session |
HiLWS: A Human-in-the-Loop Weak Supervision Framework for Curating Clinical and Home Video Data for Neurological Assessment Atefeh Irani, Maryam Mirian, Alexander Lassooij, Reshad Hosseini, Hadi Moradi, Martin J. McKeown |
61 | Afternoon Session |
Core Knowledge Deficits in Multi-Modal Language Models Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haiyun Lyu, Haoran Sun, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng |
62 | Afternoon Session |
DataDecide: How to Predict Best Pretraining Data with Small Experiments Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge |
63 | Afternoon Session |
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck |
67 | Afternoon Session |
General and Estimable Learning Bound Unifying Covariate and Concept Shifts Hongbo Chen, Li C Xia |
68 | Afternoon Session |
Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings Lucas Mattioli, Youness Ait Hadichou, Sabrina Chaouche, Martin Gonzalez |
69 | Afternoon Session |
Less is More? Data Specialization for Self-Supervised Remote Sensing Models Alvard Barseghyan, Ani Vanyan, Hakob Tamazyan, Evan Shelhamer, Hrant Khachatrian |
70 | Afternoon Session |
Quantifying the Importance of Data Alignment in Downstream Model Performance Krrish Chawla, Aryan Sahai, Mario DePavia, Sudharsan Sundar, Brando Miranda, Elyas Obbad, Sanmi Koyejo |
73 | Afternoon Session |
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning Soham Kulkarni, Raayan Dhar, Yuchen Cui |
74 | Afternoon Session |
SNAC-DB: The Hitchhiker's Guide to Building Better Predictive Models of Antibody & NANOBODY® VHH–Antigen Complexes Abhinav Gupta, Bryan Munoz Rivero, Jorge Roel-Touris, Ruijiang Li, Norbert Furtmann, Yves Fomekong Nanfack, Maria Wendt, Yu Qiu |
75 | Afternoon Session |
Evaluating Deepfake Detectors in the Wild Viacheslav Pirogov |
76 | Afternoon Session |
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran |
79 | Afternoon Session |
VISUALSPHINX: Large-Scale Synthetic Vision Logic Puzzles for RL Yichen Feng, Zhangchen Xu, Fengqing Jiang, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran |
80 | Afternoon Session |
The BrainApp Study: Engineering a New Frontier in Brain Tumor Speech Research N. AIZAAN ANWAR, Elias Allara, Lucia Specia, Matt Williams |
81 | Afternoon Session |
FAIM: Fair Imputation with Adversarial Training for Mitigating Bias in Missing Data Rasta Tadayon, Haewon Jeong, Ramtin Pedarsani |
85 | Afternoon Session |
Pearls from Pebbles: Improved Confidence Functions for Auto-labeling Harit Vishwakarma, Yi Chen, Sui Jiet Tay, Satya Sai Srinath Namburi GNVV, Frederic Sala, Ramya Korlakai Vinayak |
87 | Afternoon Session |
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li |
88 | Afternoon Session |
How to Get Your LLM to Generate Challenging Problems for Evaluation Arkil Patel, Siva Reddy, Dzmitry Bahdanau |
91 | Afternoon Session |
DEETS: Detailed Evaluation of Image Text Specificity Yasumasa Onoe, Hailey Joren, Cyrus Rashtchian, Su Wang, Olivia Wiles, Yonatan Bitton, Brian Gordon, Keran Rong, Austin Waters, Jason Michael Baldridge, Roopal Garg, Radu Soricut, Jordi Pont-Tuset |
94 | Afternoon Session |
Note: The poster sessions are scheduled as follows: