From 10,000 to 50,000 Strains: The Growth Story Behind Our Database
The journey from our initial 10,000 strains to over 50,874 β milestones, AI discovery, data quality improvements, and the road to 100,000.
Table of Contents
When we launched the Strain Database, our collection numbered roughly 10,000 cannabis strains. It was a respectable start β more comprehensive than most directories available at the time β but we knew it was only a fraction of what existed in the global cannabis gene pool. Today, we catalogue 50,874 strains from 1,821 breeders, linked with 89,350 effect relationships and 58,788 flavor connections. This is the story of how we got here, and where we are headed next.
The Foundation: Manual Curation (0β10,000)
Every database has to start somewhere. Our initial dataset was assembled through painstaking manual research β combing through seedbank catalogues, breeder websites, cannabis cup records, and community forums. Each strain was individually verified: its name, breeder attribution, genetics (indica, sativa, hybrid), parent lineage, and basic characteristics.
This phase taught us critical lessons about data quality. We discovered early on that the cannabis world is plagued by naming inconsistencies. The same strain can appear under slightly different names across different sources β "Girl Scout Cookies" versus "GSC," "OG Kush" versus "Original Gangster Kush." Duplicate detection and name normalization became one of our earliest engineering challenges, and one that continues to evolve.
By the time we reached 10,000 strains, we had established the core data model that still powers the platform: a normalized relational schema with strains, breeders, effects, and flavors as first-class entities, connected through junction tables that capture the many-to-many relationships inherent in cannabis genetics.
The Scaling Challenge: Semi-Automated Collection (10,000β25,000)
Manual curation does not scale. To grow beyond 10,000 strains, we developed automated scraping pipelines that could extract strain information from dozens of online sources simultaneously. These early scrapers were relatively simple β HTML parsers that targeted specific elements on seedbank product pages.
The challenge was not just collecting data, but maintaining quality. Raw scraped data is messy: inconsistent formatting, missing fields, contradictory information between sources. We built a multi-source verification system that cross-references data points across at least two independent sources before accepting them. If Seedfinder reports a strain as 70% indica but a breeder's website says 60/40, we flag the discrepancy for human review rather than blindly accepting either value.
During this phase, we also built our enrichment pipeline. A strain entry with only a name and breeder is minimally useful. Users need effects, flavors, THC/CBD ranges, flowering times, and descriptions. Our enrichment system queues incomplete strains and systematically fills in missing data through targeted web research and AI-assisted extraction.
The Quality Score System
To track enrichment progress quantitatively, we introduced a data quality score ranging from 0 to 100 for every strain. The score weights several factors: completeness of basic fields (name, type, breeder), presence of a description, number of linked effects and flavors, availability of cannabinoid data, and source diversity. Today, our average quality score stands at 50.4 across all strains, with 4,058 strains achieving "excellent" status (80+) and 6,625 rated "good" (60β79).
The AI Revolution: Discovery Agent (25,000β50,000)
The leap from 25,000 to 50,000 strains was powered by our AI Discovery System β an autonomous agent built on the Ollama framework using the qwen3:8b language model. The Discovery Agent operates as a sophisticated web researcher: it crawls 78+ sources through a Tor proxy network, extracts strain names and details using natural language processing, and performs triple-check deduplication against the production database before staging discoveries for human review.
The system's architecture reflects our commitment to data quality. Discovered strains are never inserted directly into the production database. Instead, they land in a staging table where our team can review, edit, and verify each entry before approving it. This human-in-the-loop approach ensures that AI speed does not come at the cost of data integrity.
Source diversity has been another key factor. Our Discovery Agent draws from seedbank catalogues, breeder websites, cannabis databases, forums, news outlets, and cannabis cup winner lists. Each source category tends to surface different types of strains β legacy genetics from forums, new releases from breeders, trending varieties from dispensary aggregators. This breadth ensures our database is not biased toward any single segment of the market.
Data Quality: The Enrichment Pipeline
Raw strain records are just the beginning. Our AI-powered enrichment pipeline processes strains to add detailed descriptions, link relevant effects from our 240-effect taxonomy, associate flavor profiles from 405 defined flavors, estimate THC/CBD percentages from multiple sources, and calculate data quality scores. The pipeline uses multiple AI models for extraction and validation, with the qwen3:8b model serving as our primary extraction engine.
Enrichment is not a one-time process. As new sources become available and our AI models improve, previously enriched strains are re-evaluated and updated. We currently have 27,948 strains (55% of the database) at quality scores of 50 or above, and this number grows daily as the enrichment pipeline continues its work.
The Human Element
Despite the AI-heavy infrastructure, human curation remains essential. Our team reviews every staged discovery before it enters production. Community submissions from users who know strains firsthand add context that no web scraper can capture β personal growing notes, regional naming variations, and historical context about a strain's development.
User reviews contribute another dimension of quality. When a user reports that a strain's listed effects do not match their experience, that signal feeds into our quality assessment pipeline. Over time, community feedback has helped us correct hundreds of data points and flag questionable entries for re-verification.
Key Milestones
- 10,000 strains: Foundation established through manual curation. Core data model and quality standards defined.
- 20,000 strains: Automated scraping pipelines operational. Multi-source verification system deployed.
- 30,000 strains: Quality score system introduced. Enrichment pipeline processing thousands of strains daily.
- 40,000 strains: AI Discovery Agent launched. 78+ sources integrated with Tor proxy infrastructure.
- 50,000 strains: Full ecosystem operational β AI discovery, human review, community submissions, and continuous enrichment working in concert.
The Road to 100,000
Our next major milestone is 100,000 strains. While the number itself is ambitious, the real goal is coverage completeness β ensuring that any strain a grower, patient, or researcher encounters in the real world has a corresponding, high-quality entry in our database. We are expanding our source network, improving AI extraction accuracy, and deepening our breeder partnerships to verify genetic lineages directly.
We are also investing in retroactive quality improvement. It is more valuable to have 50,000 deeply enriched strains than 100,000 bare-minimum entries. Our target is to bring the average quality score above 65, which means every strain would have a description, at least five linked effects, three linked flavors, and verified cannabinoid data.
The Terpene Explorer, the Strain Comparison Tool, and the Medical Strain Finder all become more powerful as the underlying data grows richer. Every strain we add, every effect we link, every data point we verify makes the entire ecosystem more useful for everyone β from the casual browser to the clinical researcher.
If you have strain knowledge to contribute, we welcome your input through our submission system. Every data point matters, and the path from 50,000 to 100,000 is one we are building together.