KDD Cup 2026 · Official Competition

Data Agents
for Complex
Data Analysis

Build autonomous AI agents that decompose complex analytical questions, orchestrate multi-step reasoning over heterogeneous data sources, and deliver accurate answers.

Prize Pool
~$30,000 USD
Competition
Apr – Aug 2026
News

News & Updates

Latest announcements and important updates.

Mar 1, 2026Announcement

Official Website Launched

The KDD Cup 2026 competition website is now live. Stay tuned for sample data and baselines on March 15.

01 / Overview

Why Data Agents?

Traditional Data+AI systems have made significant strides in optimizing specific tasks, but they still rely heavily on human experts to orchestrate the end-to-end pipeline. This manual orchestration is a major bottleneck, limiting the scalability and adaptability of data analysis.

We define a Data Agent as a holistic architecture designed to orchestrate Data+AI ecosystems by tackling data-related tasks through integrated knowledge comprehension, reasoning, and planning capabilities. This competition challenges you to build truly autonomous data analysis systems that go far beyond single-shot question answering.

Decompose & Plan

Break down high-level analytical questions into multi-step, executable plans autonomously.

Tool Selection & Invocation

Select and invoke appropriate tools — Python scripts, SQL queries, API calls — at each reasoning step.

Heterogeneous Data Reasoning

Reason over structured tables, unstructured documents, charts, and multi-modal data sources.

Result Synthesis

Synthesize intermediate results across multiple steps to arrive at a final, accurate answer.

Broader Impact

Robust Data Agents have the potential to revolutionize how we interact with data. They can democratize data science by enabling non-experts to perform sophisticated analyses through natural language. For enterprises, they can automate the work of data analysts and database administrators, leading to massive efficiency gains. This competition will stimulate new research in agent architectures, planning algorithms, tool use, and self-reflection for AI systems.

02 / Benchmark

DataAgent-Bench

Each task in DataAgent-Bench presents a self-contained data analysis challenge. The agent receives a heterogeneous data package and a high-level natural language question, and must autonomously orchestrate a complex reasoning process to produce the final answer.

Input: Data Package

Heterogeneous, multi-modal data sources

task_001/data/
database.sqlite
regional_report.pdf
product_catalog.json
quarterly_targets.png
business_handbook.docx

Non-Linear Reasoning Topology

Multi-step data analysis pipeline with branching and loops

Unlike simple linear chains, real-world data analysis often requires branching (parallel sub-queries), loops (iterative refinement), and convergence (merging results from multiple paths). DataAgent-Bench captures this complexity with DAG-structured reasoning graphs.

Reasoning Topology Patterns

Sequential Chain
ABCD

Each step depends on the previous step's output. Errors propagate downstream.

Branching & Merging
A
B₁
B₂
C (merge)

Parallel sub-queries across different data sources, then merge results.

Iterative Loop
A
B
C

Iterative refinement where the agent revisits and corrects intermediate results.

Example Task

Natural Language Query

"Our Q3 regional market analysis report identifies the region with the strongest year-over-year growth. For that region, pull the total actual sales revenue of all Electronics products from our sales database. Then, compare this figure against the quarterly sales target shown in the performance dashboard chart. Report the percentage difference."

Expected Reasoning Graph

This example demonstrates a branching pattern: after identifying the target region, the agent spawns two parallel sub-tasks (database query and chart analysis), then merges results for the final computation.

ADocument QA

Read PDF report → identify top-growth region

"East Asia"
B₁Text-to-SQL

Query sales WHERE region = "East Asia" AND category = "Electronics"

$4,200,000
B₂Image Analysis

Read performance dashboard chart → extract Q3 target

$3,800,000
merge
CPython Computation

Compute percentage difference: (4,200,000 - 3,800,000) / 3,800,000

+10.5%
Final Answer:+10.5%

Difficulty Levels

LevelStepsData SourcesTopology
Easy1–2DB + 1 documentLinear chain
Medium2–3DB + 2 documentsLinear / branching
Hard4+DB + 3 documents (incl. image)Branching + merging
PhD5+DB + multi-source, cross-referencingDAG with loops
03 / Evaluation

Scoring & Evaluation

Our evaluation framework combines automated scoring with expert human review. The scoring system penalizes hallucination, encouraging the development of trustworthy and reliable agents. The final score is a macro-average across all questions.

1.0

Perfect

Correctly and completely answers the question with no hallucinated content.

0.5

Acceptable

Provides a useful answer but may contain minor, non-harmful errors.

0

Missing

The agent responds that it does not know the answer.

-1.0

Incorrect

The response provides wrong or irrelevant information. Penalizes hallucination.

Two-Stage Evaluation Process

Stage 1: Automated Evaluation

Throughout the competition, a public leaderboard provides real-time feedback based on automated scoring against a hidden test set. This uses the benchmark's automatic correctness checks for immediate, objective assessment.

Stage 2: Human Evaluation

The final ranking of the top 10 teams will be determined by a panel of expert human evaluators. This ensures that the nuances of answer quality and real-world impact are properly assessed beyond what automated metrics can capture.

04 / Timeline

Competition Timeline

The competition runs from April to August 2026, with a preview release in March and two competitive phases designed to identify and challenge the strongest teams.

PreviewMar 15, 2026

Sample Data & Baselines Release

Partial DataAgent-Bench examples and baseline implementations are publicly released.

Phase 1Apr 1, 2026

Competition Officially Opens

Full dataset release, starter kit available, and public leaderboard goes live.

Phase 1Apr 1 – May 15, 2026

Phase 1: Open Competition

Open competition with public leaderboard for all registered teams.

Phase 2May 20, 2026

Phase 2 Begins

Top teams from Phase 1 are invited to the final round.

Phase 2May 31, 2026

Registration Freeze

Team formation and registration deadline.

Phase 2Jun 30, 2026

Final Submission Deadline

Phase 2 ends. All final submissions must be completed.

FinalJul 15, 2026

Winners Notified

Top teams are notified of their results.

FinalAug 9, 2026

KDD 2026 Announcement

Formal announcement of winners at KDD 2026.

05 / Prizes

Awards & Recognition

We offer a substantial prize pool to encourage broad and enthusiastic participation from the global research community.

Total Prize Pool

~$30,000USD

The prize distribution among top-performing teams is to be determined. Details will be announced on this page as they are finalized.

Distribution TBD

Beyond Prizes

KDD Cup Workshop Presentation

Winning teams will have the opportunity to present their solutions at the KDD Cup Workshop at KDD 2026, a dedicated half-day session providing significant visibility for their work to the broader data mining and AI community.

Community Recognition

Top-performing teams will be recognized at the formal KDD 2026 Winners Announcement ceremony, gaining visibility among leading researchers and practitioners in the field.

06 / Organizers

Organizing Team

The organizing team is a collaboration of leading researchers from Tsinghua University and HKUST (Guangzhou), with extensive expertise in Data+AI systems, large language models, and agentic AI.

Boyan Li profile photo

Boyan Li

Primary Contact

PhD Student

HKUST (Guangzhou)

Research focuses on Text-to-SQL and Data Agents. Published 14 papers in top venues including KDD, ICML, NeurIPS, and VLDB.

Guoliang Li profile photo

Guoliang Li

Professor

Tsinghua University

ACM Fellow and IEEE Fellow. Research focuses on learning-based databases and data-centric AI. VLDB 2017 Early Research Contribution Award recipient. Served as SIGMOD 2021 General Co-Chair and ICDE 2027 PC Co-Chair.

Nan Tang profile photo

Nan Tang

Associate Professor

HKUST (Guangzhou) & HKUST

ACM Distinguished Member. Research interests include AI4DB and data-centric AI. Recipient of the VLDB 2010 Best Paper Award and the SIGMOD 2024 Research Highlight Award. Co-organized the KDD Cup 2024 CRAG Challenge.

Yuyu Luo profile photo

Yuyu Luo

Primary Contact

Assistant Professor

HKUST (Guangzhou) & HKUST

Research at the intersection of Data and AI, focusing on Data Agents and Data-centric AI. 50+ publications in top-tier DB and AI venues (SIGMOD, VLDB, KDD, ICML, NeurIPS, ICLR). Best-of-SIGMOD 2023 Papers recipient. Co-organized the LLM+Vector Data Workshop at ICDE 2026, the Agentic Data System Workshop at VLDB 2026, and presented Data Agent tutorials at SIGMOD and VLDB.

07 / FAQ

Frequently Asked Questions

08 / Sponsors

Sponsors & Partners

We are grateful to our sponsors for their generous support. Sponsorship details will be announced soon.

Gold Sponsors

Coming Soon
Coming Soon

Silver Sponsors

Coming Soon
Coming Soon
Coming Soon

Interested in sponsoring KDD Cup 2026?

Contact us for sponsorship opportunities