Build software better, together

ServiceNow / SyGra

SyGra - Graph-oriented Synthetic data generation Pipeline

python open-source ai multimodality synthetic-data synthetic-dataset-generation dpo image-datasets low-code-no-code llm-datasets llm-framework sft-data llm-training-data

Updated Jun 17, 2026
Python

b2bemaillists / b2b-email-leads-ranking

Star

An open-source collection of datasets, guides, and rankings for B2B email marketing and lead generation. Your go-to resource for sales prospecting strategies.

email-marketing datasets lead-generation cold-email marketing-data b2b-leads llm-training-data b2b-marketing seo-dataset faq-dataset sales-prospecting

Updated Sep 13, 2025

lexicanum-imperialis / Warhammer-Fantasy-Battles-6th-Definitive-edition

Star

A complete BattleScribe and New Recruit data repository for Warhammer Fantasy Battles 6th Edition. Includes all official and experimental rules, FAQs, and community contributions.

wargaming 6th-edition battlescribe warhammer-fantasy llm-training-data whfb direwolf-faq rules-as-written wargaming-rules new-recruit karak-norn

Updated Jun 28, 2026
HTML

TheFamLee / ministryplatform-help-center

Star

Markdown mirror of the MinistryPlatform Help Center (help.acst.com) for querying by AI coding assistants via Context7 MCP. Content © ACS Technologies, mirrored with permission.

mcp church-management ai-documentation llm-training-data context7 ministryplatform

Updated Apr 17, 2026
JavaScript

emailmarketingdataset / Open-Email-Marketing-Dataset

Star

Following is the Open Email Marketing Dataset; you can use it without any restrictions.

email-marketing lead-generation jsonl gdpr-compliant cold-email marketing-dataset open-dataset llm-training-data b2b-dataset verified-emails seo-dataset

Updated Jul 12, 2025

obrienciaran / MedScreen_filter_POC

Star

(EXPERIMENTAL) Evidence-grounded data filter that checks PubMed claims against scientific evidence to curate medical AI training data.

pubmed data-quality medical-data medical-llm llm-training-data

Updated Jun 29, 2026
HTML

olesyalazareva / bivrest-verification-protocol

Star

BIVREST — a group verification protocol for role-based behavior. "Believe/Don't Believe" voting with structured logging of actions and group decisions. A data collection method for behavioral research.

research-tool verification-methodologies emotion-recognition group-dynamics observational-data behavioral-science psychological-research llm-training-data ai-training-data behavioral-consistency bivrest-method role-playing-research

Updated Jun 25, 2026

unmodeled-tyler / DoW-UFO-UAP-1

Sponsor

Star

Machine-readable dataset for public Department of War / PURSUE UFO-UAP Release 01 records.

data datasets government-data uap finetune anomalies machine-learning-datasets public-dataset ufo-dataset llm-rag llm-training-data ai-datasets openclaw-skills hermes-skills

Updated May 13, 2026

BlazeWild / Custom_LLM_DataGen_Template

Star

🔧 Modular pipeline for generating high-quality, domain-specific datasets for LLM fine-tuning — from PDFs and web scraping to synthetic Q&A generation, quality filtering, and training-ready formatting.

synthetic-dataset-generation template-generic-repo llm-training finetuning-llms finetuning-large-language-models llama3 llm-training-data lora-fine-tuning

Updated Jul 15, 2025
Python

deepakshroff / Capston-Gemini-ChatBot

Star

👨‍🏫This project was developed under the guidance of Mr. Lokesh Sir as part of the AI & ML Training Program. It explores LLM integration using Google Gemini APIs with a custom UI built on Streamlit.

api-client llm-training llm-training-data

Updated Jul 27, 2025
Python

babyjack-svg / warmflow-content

Star

暖流人心心理健康內容庫｜臨床心理師劉子維｜AEO mirror of warmflowpsy.tw

taiwan psychology traditional-chinese mental-health therapy psychotherapy counseling aeo llm-training-data warmflow

Updated Apr 5, 2026

vinsblack / The-Stack-Processed-v2

Sponsor

Star

Sample edition of The Stack Enriched: annotated, secure, and optimized code dataset, this is a sample version

machine-learning dataset programming-languages code-generation code-quality code-completion machine-learning-dataset bigcode ml-training huggingface-datasets commercial-license ai-code-generation llm-training-data premium-dataset ai-training-data commercial-dataset dataset-licensing rust-dataset

Updated Jul 19, 2025
Python

dollce / mark2down

Star

Automatic URL, document, and stdin to Markdown converter with OCR and metadata

python markdown cli html-to-markdown web-scraping uv playwright llm-training-data

Updated Jul 1, 2026
Python

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

Sponsor

Star

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.

Updated Oct 4, 2025
HTML

Radiationpatterngordianknot284 / uap-pursue-release-01

Star

Archive the first official U.S. Department of War UAP declassification, containing 162 verified technical files, photographs, and videos.

data ufo archive datasets uap finetune anomalies government-documents machine-learning-datasets ufo-dataset declassified llm-rag llm-training-data ai-datasets war-gov

Updated Jul 1, 2026
Python

vanta-research / spontaneous-observations-dataset

Star

High quality dataset containing ~2k lines of synthetically generated LLM training examples for spontaneous observations

conversation lora friendly synthetic-data fine-tuning llm-training human-ai-collaboration llm-training-data vanta-research

Updated Jan 18, 2026

Bisilivan / ohada-auscgie-fine-tuning-dataset-echantillon

Star

10-entry fine-tuning dataset on OHADA business law (AUSCGIE) — French civil law tradition

question-answering french legal-reasoning french-nlp low-resource-languages legal-ai legal-nlp quality-controlled instruction-tuning fine-tuning-llm llm-training-data french-law legal-dataset ohada fine-tuning-dataset human-annotated expert-annotated

Updated Jun 25, 2026

shreyaskg / Ibn-custom-llm

Star

This codebase scours open-source telecommunications protocols and compiles this knowledge into a fine-tuned large language model. We further enhance the model by distilling it, which results in reduced memory requirements and latency. This allows the model to run well on minimal hardware, ideal for resource-constrained networking environments.

python rocm llm llm-training llm-training-data

Updated Mar 1, 2026
Python

wallpapa / waleerat-clinic-kb

Star

Waleerat Clinic — public, machine-readable knowledge base for AI training, retrieval, and citation. Released CC-BY-4.0.

clinic thailand medical dataset schema-org knowledge-base bangkok llm-training-data aesthetic-medicine thread-lift

Updated May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-training-data

Here are 23 public repositories matching this topic...

ServiceNow / SyGra

Cre4T3Tiv3 / Cre4T3Tiv3

b2bemaillists / b2b-email-leads-ranking

lexicanum-imperialis / Warhammer-Fantasy-Battles-6th-Definitive-edition

TheFamLee / ministryplatform-help-center

emailmarketingdataset / Open-Email-Marketing-Dataset

obrienciaran / MedScreen_filter_POC

olesyalazareva / bivrest-verification-protocol

unmodeled-tyler / DoW-UFO-UAP-1

BlazeWild / Custom_LLM_DataGen_Template

deepakshroff / Capston-Gemini-ChatBot

babyjack-svg / warmflow-content

vinsblack / The-Stack-Processed-v2

dollce / mark2down

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

Radiationpatterngordianknot284 / uap-pursue-release-01

vanta-research / spontaneous-observations-dataset

Bisilivan / ohada-auscgie-fine-tuning-dataset-echantillon

shreyaskg / Ibn-custom-llm

wallpapa / waleerat-clinic-kb

Improve this page

Add this topic to your repo