1. Introduction

Overview

Recent advances in large language model (LLM) pretraining have increased the focus on curating high-quality web-scale datasets. Synthetic data, generated to emulate real-world text distributions, has become critical for LLM development. Microsoft’s Phi model ^[1] demonstrated the value of large-scale synthetic datasets, generating billions of tokens for pretraining. HuggingFace subsequently developed Cosmopedia ^[2] to replicate Phi-1.5. Despite these efforts, existing synthetic datasets remain insufficiently refined for training state-of-the-art LLMs that can compete with leading closed-source/proprietary models.

Furthermore, generating high-quality pre-training synthetic datasets is resource-intensive and expensive. As a result, dataset creation has been limited to well-funded corporations and major research institutions, restricting broader participation in the AI research community.

Motivation for Large-Scale Synthetic Data

There is a need for publicly available, large-scale synthetic datasets that are rigorously curated. Such datasets can lower the barrier to entry for academic institutions, small research labs, and public organizations, enabling wider experimentation and innovation. Moreover, synthetic data can be tailored to cover critical educational and scientific domains, supporting specialized training aligned with real-world learning objectives.

Key Contributions and Objectives

To address these challenges, Tether Data, S.A. de C.V. (Tether Data, we, us, our) introduces QVAC Genesis I, a large-scale multi-domain educational synthetic dataset designed to support open, high-quality LLM pretraining. Our contributions include:

Largest publicly available synthetic dataset. We generated 41 billion text tokens using a pipeline seeded by domain-labelled text from high-quality sources across critical educational domains, including mathematics, medicine, physics, and biology. This is the largest public pre-training synthetic dataset to date.
Education domain-specific datasets. We created synthetic data covering all critical educational topics including: college-level general medicine, college-level professional medicine, college-level biology, college-level mathematics, college-level physics, high school-level biology, high-level Mathematics, high-level physics, and high-level conceptual physics.
Validated top-quality dataset. Ablation studies on global benchmarks, such as MMLU, show that our datasets outperform existing synthetic datasets, achieving state-of-the-art accuracy across multiple educational topics.
Open-source contribution. We are making QVAC Genesis I available to you under the CC-BY-NC 4.0 (Creative Commons Attribution–NonCommercial 4.0). By doing so, QVAC Genesis I democratizes access to high-quality pretraining data, enabling participation from public institutions, small research labs, and the academic community, fostering a more inclusive AI research ecosystem.

2. Methodology

Our methodology consists of a four-stage pipeline designed to generate high-quality synthetic educational content through systematic error analysis and correction. The approach leverages state-of-the-art language models to create domain-specific educational materials that address common misconceptions and learning gaps.

Learning From Failures Pipeline Diagram

Figure 1. Diagram of the pipeline for generating synthetic data: Seeds Data are used as input for the Quality Filter, whose output becomes the input for the Scaling QA phase, in which 4 questions + options + target are generated for each seed. Each of these questions moves on to phase two (Model Answering), where a proposed solution to that question is generated using LLM. Finally, only proposed solutions that differ from the target (Compare to Gold Label) move on to the last phase (Failure Analysis), where an analysis of the incorrect answer and the correct solution to the question is generated in four different styles (educational textbook, web articles, qa, conversational dialogue).

2.1 Seed Data Acquisition

Web-Based Source Selection Criteria

Seed corpus: We evaluated several open-source datasets—including DCLM, FineWeb-Edu, and others—but they offered limited control over domain coverage, which was critical for our goals. After further analysis, we chose FineFineWeb ^[3], built on FineWeb, a state-of-the-art open source dataset that exposes 60+ curated categories.
This let us target domain-specific slices—especially mathematics, physics, medicine, and biology—aligned with our objectives.
Domain scope: From FineFineWeb, we extract STEM seeds exclusively in the following subdomains:
- Biology
- Medicine
- Physics
- Maths

Curation and Filtering of Seed Content

Domain extraction: We subset FineFineWeb to the listed STEM subdomains (biology, medicine, physics, maths), which we then used to generate a total of 9 specific subdomains:
- College Biology, High School Biology
- College Medicine, Professional Medicine
- College Mathematics, High School Mathematics
- College Physics, High School Physics, Conceptual Physics
Quality filtering: We score each document with the Ultra-FineWeb-classifier ^[4], a lightweight fastText model trained within Ultra-FineWeb’s verification-based filtering pipeline. During the construction of Ultra-FineWeb, the team applied this pipeline to FineWeb datasets. Documents that passed the pipeline’s verification checks became positive training examples; those filtered out served as negatives. The resulting classifier predicts the probability that a page is “high-quality” according to these verified labels and is optimized for throughput at web scale. In our setup, we run the classifier over our candidate subset and retain only seeds whose score exceeds the recommended high-quality threshold.

2.2 Prompt Engineering

Prompt Design Strategies

Objective. From the seed pools, generate multiple-choice questions per domain/level (e.g., college_biology) and then use a small SOTA model to produce an answer. Only incorrect model answers are forwarded to the failure-analysis stage. Where a final text is generated in four different styles (educational textbook, question-answers, web articles, and conversation dialogue), in which the incorrect solution is first analysed and then the correct solution is given.

Scaling QA Methodology

Our approach focuses on systematically generating large synthetic question–answer (QA) data from unstructured scientific text. We begin with domain-specific seed passages drawn from medical, biological, physical, and mathematical sciences, ensuring broad conceptual coverage across diverse knowledge areas. Using a scaled prompting strategy, a large-capacity language model is instructed to generate multiple-choice QA pairs inspired by the topics of each seed passage. Each pair consists of a question, four options, and one correct answer.

The prompting process is dynamically adjusted to produce different levels of conceptual complexity, ranging from high-school fundamentals to college-level analytical reasoning. By modifying the prompt design, the same framework can be extended to generate domain-specific data of varying difficulty, enabling rapid expansion of high-quality training material for any scientific discipline. The resulting synthetic QA corpus is employed as annealing data, helping to refine the model during late-stage pretraining or fine-tuning for task alignment. These data are employed to perform inference-time evaluation and failure analysis across existing language models. By analyzing the types of questions where models consistently underperform such as reasoning-intensive, multi-concept, or numerically grounded items, we can systematically identify the weaknesses of each model.

This methodology demonstrates how scalable prompting can be leveraged to create domain-balanced, complexity-controlled synthetic QA data that supports both model assessment and future pretraining efforts. It bridges the gap between raw scientific text and structured evaluation resources, helping to reveal capability gaps in large language models across critical scientific domains. For detailed information about the prompt used see Appendix (Prompt Templates).

Answer Generation and Extraction Methodology

Our answer generation and extraction approach focuses on systematically identifying and analyzing where state-of-the-art models fail, providing valuable insights into model limitations and creating targeted training data. We employ a sophisticated LLM-as-a-Judge framework to extract answers from model responses, enabling comprehensive analysis of model performance across different problem types and complexity levels.

Objective: The primary goal is to observe the output of state-of-the-art models and systematically extract question-answer pairs where they fail, creating a rich dataset of model weaknesses and misconceptions that can be used for targeted training and improvement.

Methodology: We use a three-stage process for answer generation and extraction:

Model Response Generation: State-of-the-art models generate complete responses to evaluation questions across multiple domains and complexity levels
Answer Extraction: A specialized LLM judge extracts the final answer from the model’s complete response using our sophisticated extraction framework
Failure Identification: We systematically identify cases where model responses differ from ground truth, capturing various types of model failures

This methodology enables us to systematically capture model failures across different domains, creating a comprehensive dataset of model weaknesses that can be used for targeted training and improvement. For detailed information about response categories, extraction processes, and the complete LLM-as-a-Judge framework, see Section 4.2, and for evaluation prompt see Appendix (Prompt Templates).

Failure Analysis Methodology

Our failure analysis approach focuses on creating high-quality educational content by systematically analyzing where state-of-the-art models fail and generating comprehensive explanations that not only provide correct answers but also analyze the reasoning behind model failures. This creates rich, pedagogically valuable content that addresses common misconceptions and learning gaps.

Objective: The primary goal is to create high-quality synthetic data in four different styles where not only the correct answer is provided, but also a thorough analysis of state-of-the-art model failures is included, creating comprehensive educational content that addresses misconceptions and learning gaps.

Methodology: We employ a systematic approach to failure analysis that generates synthetic educational content in four distinct styles:

Educational Textbook Style: Formal, comprehensive explanations that provide both correct solutions and analysis of common errors
Question-Answer Format: Structured Q&A content that addresses specific failure patterns and misconceptions
Web Articles Style: Accessible, engaging content that explains complex concepts through failure analysis
Conversational Dialogue Style: Natural tutoring sessions that guide learners through error analysis and correct reasoning

All four styles are generated from MCQ, the model’s wrong answer, and the correct label.

This methodology demonstrates how systematic failure analysis can be leveraged to create domain-balanced, pedagogically-rich synthetic data that supports both model assessment and educational content generation. For detailed information about the four-style content generation process and specific prompt templates, see Appendix (Prompt Templates).

Diversity and Coverage Optimization

Domain-Level Balance:

Per-domain/level generation. For each of the domains/levels listed above, generate items so that every domain/level is represented with equal weight to ensure comprehensive coverage across all educational domains.

Error Distribution Strategy:

Balanced error collection. Select only incorrect model answers for the next stage, ensuring that errors are gathered across all domains/levels rather than concentrating on a single area. This approach maximizes learning opportunities by addressing misconceptions across the entire educational spectrum.

Format Standardization:

MCQ format consistency. Keep the same four-option structure and answer label format across domains/levels to maintain comparable items and ensure consistent evaluation metrics.

Quality Assurance Measures:

Content validation: Automated checks for answer key consistency, option overlap, and length ratios
Semantic deduplication: Removal of near-duplicate content to prevent overfitting
Difficulty calibration: Balanced distribution of question complexity within each domain/level
Expert review: Manual validation of edge cases and ambiguous content

2.3 Synthetic Data Generation

Tooling. We orchestrate the end-to-end pipeline using distilabel ^[5] running against a vLLM inference server (vLLM Team, 2024).

Pipeline Orchestration:

distilabel (orchestration & AI feedback). We employ distilabel (Argilla, 2024), a framework for synthetic data and AI feedback designed to build fast, reliable, and scalable pipelines. It models workflows as a DAG of steps (e.g., generate → judge → filter), comes with ready-made tasks for common patterns like LLM-as-a-judge, and integrates smoothly with Argilla for storing datasets and optional human-in-the-loop review. In practice, we use distilabel to define prompt templates, spawn “generator” models, attach “judge” steps to rate outputs (helpfulness, correctness, etc.), and write back structured records plus scores for downstream filtering and evaluation.
vLLM (serving). We host the LLMs behind vLLM (vLLM Team, 2024), benefiting from its standard-compatible API, streaming responses, continuous batching, and PagedAttention for high-throughput, memory-efficient inference. This lets us scale generation and judging steps without changing our pipeline code.
Integration. distilabel sends generation and evaluation requests to vLLM; results flow back into the DAG where we apply rubric-based filters and retain only examples that meet target quality thresholds. The same setup lets us reuse judge steps to clean or re-rank data created in earlier runs, keeping the pipeline reproducible and easy to iterate.

Model Architecture. We used the following open-source models in the various stages:

Generation model: QwQ-32B ^[6] for question generation and failure analysis
Answer stage model: Qwen3-1.7B-Base ^[7] for model answering
Failure-analysis stage: QwQ-32B for generating educational content from incorrect responses

Flow

Seed → Item generation (QwQ-32B). For each domain/level (e.g., college_biology), using distilabel + vLLM, QwQ-32B generates the question, options A–D, and the gold label from the seed.
Answer stage (Qwen3-1.7B-Base). The MCQ is posed to Qwen3-1.7B-Base with the fixed template shown above.
Answer extraction and error routing (QwQ-32B). We use a sophisticated LLM-as-a-Judge framework for answer extraction that can handle various response patterns and edge cases. This approach represents a significant advancement over traditional log-likelihood-based evaluation methods. If the extracted answer ≠ gold label, we pass (problem, model response, correct label) to failure analysis.
Failure analysis (QwQ-32B). We prompt QwQ-32B to generate the analysis in one of the four styles using the problem, proposed solution, and correct answer. For detailed information about the four-style content generation process and specific prompt templates, see Appendix (Prompt Templates) and Section 4.2.

3. Pre-training Setup

3.1 Model Architecture and Parameters

We pre-train a 1.7B-parameter transformer (Qwen3 family) initialized from scratch with BF16 mixed precision and context length 4,096. Tokenization uses the Qwen3 tokenizer; data are stored in HuggingFace Datasets (Arrow). The corpus totals 41B tokens (multi-domain) and is traversed for 1 epoch via a PyTorch DataLoader. To aid stability and throughput expected in technical deployments, we enable activation checkpointing, fused kernels where available (fused attention/optimizer), enable FlashAttention2 on H100, and torch.compile (safe mode) once the run is stable.

Optimization follows AdamW (weight decay 0.01), learning rate 2e-4, warmup 600 steps, gradient clipping 1.0, and seed 42. Per-GPU micro-batch is 4 with gradient accumulation 8 across 480 GPUs, yielding an effective global batch of 4×8×480=15,360 samples/step. We log train metrics every 50 steps, validate every 500 steps (20 eval iters), checkpoints are created every 1000 steps, and support resume with exact optimizer/state restoration. We achieved a total training throughput of 1.5 seconds per step (). We note common failure modes and mitigations: BF16 overflow (addressed via dynamic loss scaling), NCCL stalls (timeouts and interface pinning), and fragmentation (CUDA max_split_size_mb=512, expandable segments, GC threshold 0.8).

The resulting pre-trained model is publicly released and available at https://huggingface.co/qvac/genesisI-model

3.2 Multi‑node GPU Setup

We made multiple training runs on 60 nodes with 8× NVIDIA H100 80GB per node (480 GPUs total), 8 CPUs per task, ~800 GB RAM per node, Slurm priority partition, exclusive allocation, and 72-hour time limit. We launch with srun using PyTorch DDP (world size 480), auto-detect the master from Slurm, and bind ranks to GPUs via Slurm’s environment. Stdout/stderr are streamed to logs_training/qvac_60node_training_%j.{out,err}; checkpoints are sharded and saved periodically for robust resume.

Networking is NCCL over InfiniBand with UCX transports. We use infiniband and set NCCL_IB_DISABLE=0 , NCCL_IB_HCA="mlx5", NCCL_SOCKET_IFNAME, and NCCL_BLOCKING_WAIT=1 with a 720-second watchdog to fail fast on fabric issues. UCX is configured for multi-device transport; we also pin file system threads and enable asynchronous I/O prefetch to keep GPUs fed.

Reliability & observability: W&B captures metrics, system traces, and artifacts; we additionally export structured logs (throughput, TFLOPs/GPU, GPU/host memory, step time. For reproducibility, we fix seeds, log exact launch scripts and env, and report effective tokens/step and utilization.

4. Evaluation and Results

4.1 Dataset Statistics

Volume, Diversity, and Domain Coverage:

Domain	Number of Samples	No of Tokens (in B)
High school biology	3,818,070	4.511
College biology	3,286,648	3.927
Professional medicine	1,552,474	1.884
College medicine	5,164,247	6.218
High school mathematics	3,244,240	4.277
College mathematics	5,895,052	8.243
High school physics	2,277,880	3.061
College physics	4,281,062	5.814
Conceptual physics	2,354,184	2.973
Total	31,873,857	40.906

4.2 LLM-as-a-Judge Evaluation

Figure 2. Histogram showing the results obtained using LLM-as-a-Judge method using Opencompass framework. The different educational domains of the MMLU dataset on the x-axis and the score on the y-axis. We can see that Qvac Genesis I performs better on average than the current largest synthetic dataset, Cosmopedia, and also in all individual topic and level domains except college physics.

Figure 3. Different representation of results obtained using LLM as a judge via the OpenCompass framework.

Methodology and Framework

We developed a robust and stable evaluation framework using OpenCompass ^[8] that leverages LLM-as-a-Judge methodology to extract answers from model outputs. This approach represents a significant advancement over traditional log-likelihood-based evaluation methods commonly used in benchmarking.

Traditional Log-Likelihood Limitations:

Relies on next-token probability prediction, which may not capture the model’s true reasoning capabilities
Models may require multiple tokens to arrive at the correct answer or may self-correct during generation
Cannot handle cases where the model fails to provide a clear answer or provides multiple conflicting responses
Does not account for the model’s ability to reason through complex problems step-by-step

Our LLM-as-a-Judge Approach: Our evaluation framework addresses these limitations by implementing a three-stage process:

Response Generation: The model generates a complete response to the evaluation question
Answer Extraction: A specialized LLM judge extracts the final answer from the model’s complete response
Exact Matching: The extracted answer is compared against the ground truth using exact string matching

This methodology provides several advantages:

Captures the model’s complete reasoning process rather than just next-token predictions
Handles cases where models self-correct or require multiple reasoning steps
Provides clear evaluation of cases where models cannot provide definitive answers
Enables more nuanced assessment of model capabilities across different problem types

LLM as a Judge Evaluation Pipeline Diagram

Figure 4. Diagram of our evaluation pipeline. Stage 1: The model to be evaluated generates a complete response to the evaluation question. Stage 2: A specialized LLM judge extracts the final answer from the model’s complete response. Stage 3: The extracted answer is compared against the ground truth using exact string matching. In the end, for each output we will have: Correct, Incorrect, Multiple Answer or No Answer.

Answer Extraction Framework

We implemented a sophisticated answer extraction system that can handle various response patterns and edge cases:

Response Categories:

Valid Answer: Single, clear choice (A, B, C, or D)
MULTIPLE_ANSWERS: When the model provides conflicting or multiple different answers
NO_ANSWER: When no clear answer can be identified in the response

Extraction Process: The LLM judge analyzes the complete model response to identify:

Explicit answer statements (e.g., “ANSWER: A”, “The answer is B”)
Boxed format answers (e.g., \boxed{A})
Standalone letter choices in conclusions
Self-corrections and final settled answers
Generated questions vs. original question answers

Evaluation Prompt Template

Our evaluation system uses a carefully designed prompt template that ensures consistent and reliable answer extraction. For detailed information about the complete prompt template, see Prompt Templates.

Scoring Criteria and Metrics

Primary Metrics:

Accuracy: Percentage of correctly answered questions (exact match between extracted answer and ground truth)
No Answer Rate: Percentage of responses classified as NO_ANSWER
Multiple Answer Rate: Percentage of responses classified as MULTIPLE_ANSWERS

Quality Assurance:

Inter-annotator agreement on answer extraction
Manual validation of edge cases
Consistency checks across different model outputs
Robustness testing with various response formats

Advantages Over Traditional Methods

Comprehensive Evaluation: Captures the full reasoning process rather than just next-token predictions
Handles Edge Cases: Properly categorizes ambiguous, multiple, or missing answers
Real-world Applicability: Reflects how models actually perform in practical scenarios
Fair Comparison: Provides consistent evaluation across different model architectures and training approaches
Interpretability: Clear categorization of model response types enables better understanding of model capabilities and limitations

This evaluation framework provides a more accurate and comprehensive assessment of model performance, particularly for complex reasoning tasks where traditional log-likelihood methods may not capture the full extent of model capabilities.

4.3 Next-Token Prediction Performance

Benchmark Tasks and Datasets
Accuracy and Generalization Analysis

Figure 5. Histogram showing the results obtained using the Loglikelihood method and LM-Harness framework. Even here we can see that Qvac Genesis I performs better on average than the current largest synthetic dataset, Cosmopedia, and also in all individual topic and level domains except college physics.

Figure 6. Different representation of results obtained using the Loglikelihood method and LM-Harness framework.

5. Conclusion

Summary of Findings: We built the largest synthetic datasets to date with 41 billion tokens for 10 critical educational topics and intend to generate and publish more tokens to have a total coverage of all other domains. We achieved superior performance (i.e., accuracy, quality) when compared to Cosmopedia v2. the state-of-the-art synthetic datasets. This was demonstrated from the MMLU benchmark of 10 selected critical topics where we obtained SOTA performance over the sota cosmopedia v2, current sota.
Implications for Future Pre-training: Public, researchers, academics, research institutions, practitioners and AI community can make use of the datasets to build SOTA base model. This will set the base for a strong foundation of the base model for post-training too.
Limitations and Next Steps: We initially currently focus on these key critical educational content: medicine, math, physics and biology. We plan to generate synthetic domains to have total coverage on all other STEM domains from the FineFineWeb.

This full version of the article including all the associated assets and appendix can be found on our HuggingFace blog post