This is how you get to better foundation models.
Disclosing the full list of datasets used to train IBM LLMs Granite 3.0. This is true transparency - no other LLM provider shares such detailed information about their training datasets. WEB Data - FineWeb: More than 15T tokens of cleaned and deduplicated English data from CommonCrawl. - Webhose: Unstructured web content in English converted into machine-readable data. - DCLM-Baseline: A 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. CODE - Code Pile: Sourced from publicly available datasets like GitHub Code Clean and StarCoderdata. - FineWeb-Code: Contains programming/coding-related documents filtered from the FineWeb dataset using annotation. - CodeContests: Competitive programming dataset with problems, test cases, and human solutions in multiple languages. DOMAIN - USPTO: Collection of US patents granted from 1975 to 2023. - Free Law: Public-domain legal opinions from US federal and state courts. - PubMed Central: Biomedical and life sciences papers. - EDGAR Filings: Annual reports from US publicly traded companies over 25 years. MULTILINGUAL - Multilingual Wikipedia: Data from 11 languages to support multilingual capabilities. - Multilingual Webhose: Multilingual web content converted into machine-readable data feeds. - MADLAD-12: Document-level multilingual dataset covering 12 languages. INSTRUCTIONS - Code Instructions Alpaca: Instruction-response pairs about code generation problems. - Glaive Function Calling: Dataset focused on function calling in real scenarios. ACADEMIC - peS2o: A collection of 40M open-access academic papers for pre-training. - arXiv: Scientific paper pre-prints posted to arXiv. Full author acknowledgement can be found here. - IEEE: Technical content from IEEE acquired by IBM. TECHNICAL - Wikipedia: Technical articles sourced from Wikipedia. - Library of Congress Public Domain Books: More than 140,000 public domain English books. - Directory of Open Access Books: Publicly available technical books from the Directory of Open Access Books. - Cosmopedia: Synthetic textbooks, blog posts, stories, and WikiHow articles. MATH - OpenWebMath: Mathematical text from the internet, filtered from 200B HTML files. - Algebraic-Stack: Mathematical code dataset including numerical computing and formal mathematics. - Stack Exchange: User-contributed content from the Stack Exchange network. - MetaMathQA: Dataset of rewritten mathematical questions. - StackMathQA: A curated collection of 2 million mathematical questions from Stack Exchange. - MathInstruct: Focused on chain-of-thought (CoT) and program-of-thought (PoT) rationales for mathematical reasoning. - TemplateGSM: Collection of over 7 million grade-school math problems with code and natural language solutions. BOOM!
Principal at dbInsight LLC
1moTransparency is a huge differentiator