2. The Problem: AI's Lineage Crisis
An open-source researcher releases a novel computer vision model trained on carefully curated datasets. Within months, dozens of companies fine-tune it for commercial applications generating millions in revenue. The original creator receives nothing and cannot even prove they built the foundation.
When questioned about training data sources, these companies provide vague documentation because their fine-tuned models inherited lineage from a base model with no provenance trail. Regulators investigating potential bias find dead ends. Investors conducting due diligence cannot verify IP ownership. This pattern repeats across every corner of the AI economy.
Opaque Ownership in a Derivative Economy
Model weights travel through the ecosystem like unmarked containers.
A researcher downloads a checkpoint from HuggingFace, fine-tunes it on proprietary data, merges it with another model using mergekit, quantizes the result for deployment, and shares it on their company's internal platform. At each step, lineage information exists only in fragmented commit messages, Slack conversations, or individual memory. Six months later, when compliance teams ask "where did this model come from," nobody can provide a definitive answer.
The Due Diligence Problem
This opacity creates practical problems daily. An enterprise acquires a startup whose core IP includes several production AI models. During due diligence, they discover no documentation proving the startup trained these models or legally inherited them from open-source predecessors. The acquisition stalls while lawyers attempt reconstruction through git history and employee interviews.
Software has solved this problem through Git, where every line of code traces back to an author and timestamp. Package managers like npm show complete dependency trees. Model weights exist in a pre-Git era.
The Self-Reporting Gap
HuggingFace hosts over 500,000 models, most with incomplete provenance metadata. Model card fields like base_model remain optional and self-reported. Tools for model merging create derivatives with no standardized attribution. When developers compose models by averaging their weights, the lineage lives only in their local scripts. The infrastructure for attribution exists in software ecosystems but hasn't reached machine learning at scale.
Rising IP and Data Disputes
The legal system struggles to resolve AI provenance questions through courtroom discovery rather than transparent protocol.
Copyright and Training Data Litigation
The New York Times sued OpenAI for training on copyrighted articles without compensation. Getty Images sued Stability AI over training data that included watermarked stock photos. Artists filed class action suits claiming Stable Diffusion used their copyrighted works without permission. Each case requires expensive litigation to answer basic questions about training data origins.
Dataset Creators Left Behind
Data providers face similar challenges. Research institutions curate valuable datasets and release them for academic use. Years later, they discover companies built commercial models using this data, violating license terms. Proving usage requires forensic analysis of model behavior, an imperfect process. Dataset creators cannot track where their contributions flow through the AI supply chain, unlike open-source library authors who see download statistics and attribution in dependent projects.
Base Model Attribution Conflicts
Creators of base models fare no better. A researcher releases a pre-trained language model under a non-commercial license. Companies fine-tune it for enterprise products, claiming the fine-tuned weights constitute a new work. Without provenance infrastructure, establishing inheritance relationships requires legal interpretation rather than technical verification. The friction discourages open research and channels innovation toward closed ecosystems with tighter control.
Absence of Universal Registries
Open-source software succeeded because infrastructure supported attribution. Git tracks every commit with cryptographic signatures. GitHub visualizes fork networks and contribution graphs. Package managers like npm enforce dependency declarations. Developers who build on others' work do so transparently, creating audit trails that enable both credit and debugging.
Machine learning operates differently.
Fragmented Platform Solutions
HuggingFace provides model cards for documentation but implementation remains inconsistent. Users fill out templates voluntarily, self-reporting base models and training procedures. No mechanism verifies accuracy. MLflow tracks experiment lineage within organizations but creates siloed records that don't travel with models across institutional boundaries. Weights & Biases captures artifact ancestry during training but only for users within their platform. DVC versions datasets and models alongside code but relies on local configuration without a universal registry.
None of these tools provide what the AI economy needs: a cross-platform registry with immutable lineage, verifiable provenance, and financial attribution.
The Binary Blob Problem
Imagine if npm didn't maintain a centralized package registry and developers manually tracked which libraries they used. That describes the current state of model weights. Files pass between organizations as binary blobs with variable documentation quality, hoping metadata travels alongside them.
Regulatory Requirements Without Infrastructure
Regulation is arriving before infrastructure can support it.
EU AI Act Documentation Requirements
The EU AI Act mandates comprehensive documentation for high-risk AI systems deployed in healthcare, law enforcement, and critical infrastructure. Article 11 requires technical documentation including training data sources and methodologies. Article 13 demands training data documentation with descriptions of data sources and collection procedures. Article 12 requires record keeping with automatic logs of system operations. Organizations building AI for these sectors face compliance requirements they cannot meet with existing tools.
Healthcare Compliance Challenges
Healthcare provides stark examples. The EU AI Act classifies medical diagnostic models as high-risk, requiring full provenance documentation. Hospitals deploying such systems must demonstrate training data complies with GDPR, medical privacy regulations, and ethical guidelines. When a diagnostic model derives from a base model that incorporates publicly available medical images, the hospital needs provenance documentation extending through multiple generations. Current infrastructure cannot provide this chain of custody.
GDPR and Model Provenance
GDPR compounds these challenges through requirements for lawfulness of processing and records of processing activities. Article 30 mandates documentation of data processing operations, including purposes and data sources. For AI models, this extends to training data origins and transformations through fine-tuning pipelines. The right to explanation for automated decisions requires understanding how models make predictions, which depends on knowing their lineage and training provenance.
Organizations face regulatory obligations without the technical infrastructure to fulfill them.
Global Regulatory Convergence
Financial regulators in the US and Asia signal similar directions. The Federal Reserve explores AI governance frameworks requiring model validation and documentation. Asian markets considering AI deployment in banking and insurance examine how regulations can ensure fair and transparent systems. Every jurisdiction converges on the same requirement: traceable provenance from training data through deployment. The compliance market for AI provenance grows while the infrastructure remains fragmented and voluntary.
Why Current Solutions Fall Short
Academic research proposes provenance tracking through cryptographic signatures and blockchain registries, but prototypes remain disconnected from production ML workflows.
Academic vs Production Gap
Papers describe ideal systems that require developers to change their entire toolchain. Industry tools provide partial solutions, each optimizing for their specific use case while creating interoperability gaps. A researcher using DVC for dataset versioning, HuggingFace for model distribution, and MLflow for experiment tracking must manually coordinate provenance across three systems with no shared standards.
Blockchain Projects Miss the Mark
Blockchain AI projects target different problems. Ocean Protocol focuses on data marketplaces, not model lineage. SingularityNET builds AI service marketplaces. Fetch.ai creates autonomous agent networks. Bittensor distributes AI training and inference. None establish comprehensive model provenance as their primary mission.
The infrastructure gap persists because no solution combines universal registry, cross-platform integration, financial attribution, and regulatory compliance in a single protocol. This gap defines Origyn's opportunity.
Last updated