Sitemap

MOSTLY AI Launches $100K Synthetic Data Challenge for Open-Source Innovators

2 min readJun 17, 2025

--

MOSTLY AI has announced a global competition offering a total of $100,000 in cash prizes to participants who can generate the most accurate and privacy-safe synthetic datasets. The Synthetic Data Challenge 2025 is now live and will run until July 3, 2025. The contest includes two independent challenges, each awarding a $50,000 prize to a single winner.

Two Tracks to Compete In

Participants may enter one or both of the following challenges:

  • FLAT DATA Challenge: Requires a CSV submission of 100,000 records across 80 columns (60 numeric, 20 categorical).
  • SEQUENTIAL DATA Challenge: Involves submitting 20,000 grouped records with 10 columns (7 numeric, 3 categorical), where each group contains 5–10 entries.

Each task involves generating a novel synthetic dataset that mirrors the statistical patterns of the original training set without overfitting, ensuring submissions are not significantly closer to the released training data than to the withheld holdout dataset.

How It Works

Participants can use any open-source tools, including libraries like SynthCity, Reprosyn, or MOSTLY AI’s own Synthetic Data SDK. Submissions must be reproducible, executable within six hours on standard cloud infrastructure, and fully open source.

Stage 1 involves submitting the generated data in CSV format. The data will be validated using the Synthetic Data Quality Assurance toolkit, comparing distributions to both the training and hidden holdout datasets.

To appear on the leaderboard, entries must meet two privacy thresholds:

  • DCR Share below 52%
  • NNDR Ratio above 0.5

Only the top-ranked submission per participant counts toward the leaderboard.

Stage 2 and Evaluation

The top five entries in each challenge will advance to Stage 2 (July 4–5) and must submit complete, executable code under an OSI-approved open-source license. Final scoring will consider five factors: accuracy, privacy, usability, computational efficiency, and generalizability. Winners will be announced on July 9, 2025.

Encouraging Open Data and Safe Collaboration

The competition reflects MOSTLY AI’s mission to promote open and privacy-preserving access to high-value data. As the company notes, “The intelligence of tomorrow won’t be built on tweets and cat pictures alone.” Synthetic data represents a viable solution to sharing sensitive datasets across teams and organizations without compromising individual privacy.

Submission Rules and Bonus Entries

  • Submissions are limited to 3 per challenge per ISO week.
  • Participants can earn up to 5 additional entries by referring new users.
  • Only GitHub accounts created before May 14, 2025, are eligible.

Get Started

Data professionals interested in participating can submit their first entry now. Downloads for both training datasets are available on the competition site.

For questions, participants are encouraged to join the discussion on the official GitHub thread.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.

No responses yet