Genesis: Largest synthetic datasets for LLM training in STEM domains

Genesis provides the global AI community with the high-quality data needed to level the playing field, accelerating the development of open-source LLMs that compete with leading closed-source / proprietary models.

Genesis II

The second release of QVAC Genesis expands coverage to 10 new domains, for example chemistry, computer science, statistics, machine learning, astronomy, and econometrics, while also introducing an improved methodology that produces higher-quality synthetic datasets.

More than a scale increase, our research aims to empower the community to develop models that reason and explain, grounding intelligence in understanding not imitation. A deliberate shift in how educational AI data should be built.

Genesis I

We start with Genesis I, a synthetic dataset purpose-built for education-specific content, offering deep and comprehensive coverage across key STEM domains.

The high-quality dataset has been rigorously validated across multiple educational benchmarks, demonstrating superior performance across school and college-level subjects like Logical Deduction, Mathematics, Biology, and Medicine.

FAQ

QVAC Genesis is a family of synthetic datasets developed with the focused goal of improving language models in areas where they struggle to reason, generalize, or solve problems. It captures systematically identified weaknesses and transforms them into high-quality, domain-specific learning instances.

With the release of QVAC Genesis II, the QVAC Genesis family now totals 148 billion tokens across 19 educational domains. It distinguishes itself by moving beyond simple "imitation" of fluency to ground intelligence in understanding. This is achieved through a deliberate shift in how data is built, helping models explain why something is true rather than just predicting what sounds right.