4.1 Model Genome Layer: DAG Structure

Why DAG Instead of Linear Blockchain

AI models exhibit complex lineage patterns that linear blockchain structures cannot capture naturally.

A fine-tuned model has one parent. A merged model combining specialist networks has multiple parents. Quantized versions, distilled variants, and LoRA adaptations all represent different derivative relationships. A directed acyclic graph represents these patterns elegantly where nodes are models and edges encode derivation types with metadata.

Linear blockchains like Bitcoin create single chains where each block references one parent. This works for financial transactions with sequential ordering. Model evolution requires a structure where Model C can inherit from both Model A and Model B simultaneously through merging, where contribution weights determine how much each parent influenced the result.

Git pioneered this approach for source code with merge commits pointing to multiple parent commits. The Linux kernel repository contains over one million commits organized in a DAG that tracks contributions from thousands of developers. Origyn applies this proven model to AI weights.

The architecture combines three technologies. From Git comes the parent-pointer structure enabling efficient graph traversal algorithms for ancestry queries. From IPFS comes content-addressed immutability where identical content produces identical content identifiers (CIDs), making tampering cryptographically detectable. From Ethereum Layer 2 networks comes the smart contract layer providing registration logic, royalty distribution, and validator economics at costs 10-100x lower than mainnet.

Alternatives like IOTA Tangle and Hedera Hashgraph were considered but rejected. IOTA's experimental consensus and lack of Ethereum compatibility introduced risks. Hedera's permissioned governance through a corporate council contradicts decentralization principles. The Git+IPFS+Ethereum L2 combination balances proven technology, true decentralization, and economic feasibility.

Node Structure and Schema

Each model registered in Origyn becomes a node in the global provenance DAG.

The node structure captures essential information while optimizing for cost and immutability:

{
  "modelId": "bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi",
  "parentIds": [
    {
      "id": "bafkreieqecdpr7a3o4mgcbvmwcxl2gvr6xghfdhbvzurlzieoxbltlzqgm",
      "derivationType": "fine-tune",
      "contribution": 1.0
    }
  ],
  "metadata": {
    "name": "GPT-Medical-v2",
    "creator": "0x742d35Cc6634C0532925a3b844Bc9e7595f0bEb5",
    "timestamp": 1698345678,
    "architecture": "transformer",
    "parameters": "7B",
    "datasetCID": "bafybeigvgzoolc3drupxhlevdp2ugqcrbcsqfmcek2zxiw5wctk3xjpjwy",
    "trainingConfig": {
      "epochs": 3,
      "learningRate": 1e-5,
      "batchSize": 16,
      "optimizer": "AdamW"
    },
    "license": "MIT",
    "royaltyRate": 0.05,
    "description": "Fine-tuned for medical diagnosis from clinical notes"
  },
  "provenanceHash": "0x8f3e9a2b7c1d4f6e8a9b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2",
  "signature": "0x7d8e9f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9"
}

The modelId field holds an IPFS CIDv1 that addresses the complete model card content. This content-addressed identifier ensures immutability: changing any metadata produces a different CID. The format uses base32 encoding (bafy...) with SHA-256 hashing and dag-pb codec, following IPFS conventions. Anyone can verify model card authenticity by computing the CID from the content and comparing it to the registered value.

Parent relationships live in the parentIds array. Base models like GPT-3 or Llama 2 have empty arrays since they represent training from scratch. Fine-tuned models list one parent with derivationType: "fine-tune" and contribution: 1.0 indicating full inheritance. Merged models list multiple parents with fractional contribution weights summing to 1.0. A model created by averaging two specialist networks would show contribution: 0.5 for each parent.

The derivationType field captures the semantic relationship (fine-tune, merge, quantize, distill, lora, prune) enabling queries like "show all quantized derivatives" or "find models merged from these two parents."

The metadata object contains human-readable information and technical specifications. Creator addresses link to Ethereum wallets providing cryptographic proof of authorship. Timestamps establish precedence for IP claims. Architecture and parameter counts help users find models matching their requirements. The datasetCID references training data stored on IPFS, creating a complete chain from raw data through model weights.

Training configuration enables reproducibility and debugging. License terms clarify usage rights. Royalty rates specify what percentage flows to this creator when derivatives monetize.

Storage architecture optimizes for cost and immutability through a three-tier system. On-chain storage on Ethereum Layer 2 contains only the provenanceHash (32 bytes), a SHA-256 digest of the full node structure. At approximately $0.01 per transaction on networks like Arbitrum or Optimism, this makes registration affordable. IPFS stores the complete JSON (~1-10 KB depending on metadata richness) addressed by the modelId CID. Pinning services ensure availability without requiring permanent payment. Arweave provides permanent backup for critical provenance records at a one-time cost of approximately $0.0084 per KB.

This hybrid approach balances immutability (blockchain hashes), availability (IPFS), and long-term preservation (Arweave).

The signature field contains an ECDSA signature from the creator's Ethereum wallet, proving they authorized the registration. This prevents impersonation attacks where malicious actors claim ownership of popular models. During registration, the smart contract verifies the signature matches the creator address before accepting the transaction. The signature covers the entire node structure including modelId, parentIds, and metadata, ensuring integrity of the complete provenance record.

Edge Types and Derivative Semantics

Model derivatives take multiple forms, each with distinct technical characteristics and royalty implications:

Derivative Type

Parents

Description

Technical Characteristics

Royalty Flow

base

Original pre-trained model from scratch

Full training run, no inheritance

No upstream royalties

fine-tune

Specialized through additional training

Updates all or subset of parameters

5% to parent (default)

merge

Combined weights from multiple models

Weighted averaging or SLERP interpolation

Split by contribution weights

quantize

Reduced numerical precision

FP32 → FP16 → INT8 conversion

5% to parent

distill

Knowledge transfer to smaller architecture

Student learns from teacher's outputs

5% to parent (teacher)

lora

Low-rank adapter layers added

Frozen base + small trainable adapters

5% to parent (base)

prune

Removed less important parameters

Structured or unstructured pruning

5% to parent

Fine-tuning represents the most common derivative operation.

A practitioner takes a pre-trained model like GPT-3 and continues training on a specialized dataset. This updates model parameters to optimize for the new domain. Full fine-tuning adjusts all weights. Parameter-efficient fine-tuning (PEFT) methods like prompt tuning or adapter layers modify only a small subset. From Origyn's perspective, both register as fine-tune derivatives since they build directly on the parent model's learned representations.

Model merging combines multiple specialists into a unified system. Techniques include simple weight averaging (parameters are averaged element-wise), SLERP interpolation (spherical linear interpolation in weight space), and task arithmetic (adding or subtracting task vectors). A merged model created from 60% Model A and 40% Model B would register both as parents with corresponding contribution weights.

When this merged model generates revenue, 60% of the 5% base royalty flows to Model A's creator and 40% flows to Model B's creator. The DAG structure makes these multi-parent relationships explicit and auditable.

Quantization reduces model precision from 32-bit floating point to 16-bit or 8-bit integers. This decreases memory footprint and increases inference speed with minimal accuracy loss. Techniques like GPTQ and AWQ optimize quantization for large language models. The quantized variant inherits weights from its full-precision parent.

A Llama-7B-INT8 model derives from Llama-7B-FP32. Deployment environments with limited resources prefer quantized versions, but the intellectual contribution comes from the original training. Origyn captures this relationship explicitly.

Distillation transfers knowledge from a large "teacher" model to a smaller "student" model. The student trains to match the teacher's output distributions rather than the original training labels. This creates compact models that approximate larger models' performance. GPT-4-Mini might be distilled from GPT-4, inheriting knowledge without copying weights directly.

The parent relationship links student to teacher since the teacher's capabilities enable the student's performance.

LoRA (Low-Rank Adaptation) adds small adapter matrices to a frozen base model. Instead of updating billions of parameters, LoRA trains only millions of adapter weights. The base model remains unchanged. During inference, adapter outputs combine with frozen base outputs. This enables efficient fine-tuning and model sharing: distribute the small adapter while users provide their own base model.

The parent relationship points to the frozen base since adapters are meaningless without it.

Pruning removes parameters deemed less important through magnitude pruning (remove weights with small absolute values), structured pruning (remove entire neurons or channels), or lottery ticket hypothesis approaches (find sparse subnetworks). A 7B parameter model might become a 3B parameter model after pruning. The pruned variant derives from the original since the pruning process depends on the trained weights' structure and importance.

These edge types enable semantic queries across the provenance graph. A regulator could query "show all fine-tuned derivatives of Model X deployed in healthcare" to audit a base model's downstream usage. An enterprise could ask "which datasets influenced this model through its entire ancestry" to verify compliance. A researcher could explore "what models were created by merging model families A and B" to understand cross-pollination between research directions.

Multi-Parent Handling and Contribution Weights

Model merging creates nodes with multiple parents requiring attribution logic that respects each parent's contribution.

Origyn implements this through contribution weights: non-negative decimal values between 0.0 and 1.0 that sum to 1.0 across all parents. When Model C merges from Model A (60%) and Model B (40%), the registration specifies:

"parentIds": [
  {"id": "bafyModelA", "derivationType": "merge", "contribution": 0.6},
  {"id": "bafyModelB", "derivationType": "merge", "contribution": 0.4}
]

Royalty calculations multiply the base rate by each parent's contribution weight.

If Model C generates $10,000 in revenue with a 5% total royalty burden, Model A's creator receives $300 (5% × 0.6 × $10,000) and Model B's creator receives $200 (5% × 0.4 × $10,000). The smart contract distributes these amounts automatically when Model C's operator triggers a royalty payment.

Determining contribution weights objectively remains challenging. For weight averaging methods, the averaging coefficients provide natural weights. If Model C = 0.6×ModelA + 0.4×ModelB, use 0.6 and 0.4 as contribution weights. For more complex merging strategies like task arithmetic or SLERP, creators must estimate each parent's influence.

Origyn allows self-reported weights during registration, enabling creators to attribute fairly according to their methodology. The validator challenge mechanism prevents egregious misattribution: if a creator claims 0.01 contribution for a parent that clearly dominates the merged model, challengers can dispute the registration.

Contribution weights compound through multiple generations. If Model D fine-tunes from Model C, and Model C merged from Models A and B, Model D's ancestry includes all three predecessors with diluted attribution. Model D pays 5% royalty to Model C. Model C then redistributes part of its received royalty upstream: 60% to Model A and 40% to Model B according to its contribution weights.

This cascading attribution ensures base model creators receive compensation proportional to their influence even through multiple derivation steps.

The system handles edge cases gracefully. Merges with many parents (ensembling techniques might average 5-10 models) simply list all parent CIDs with corresponding weights. Three-way merges, four-way merges, and beyond work identically. The graph traversal algorithms remain efficient because they walk edges in topological order without requiring special cases for different parent counts.

Graph Traversal and Ancestry Algorithms

Querying model lineage requires efficient graph traversal.

Origyn implements standard DAG algorithms adapted from Git and graph theory:

Breadth-First Search (BFS) explores ancestors level by level, visiting all direct parents before grandparents. This finds the shortest path to any ancestor and identifies the closest common ancestor of two models. Implementation uses a queue data structure: start with the target model, enqueue its parents, visit each parent and enqueue their parents, continue until reaching base models.

BFS complexity is O(V + E) where V is vertex count (models) and E is edge count (parent relationships). A typical query examining 20 generations of a model with average branching factor 2 visits fewer than 1,000 nodes.

Depth-First Search (DFS) explores one path completely before backtracking. This enumerates all paths from a model back to base models, useful for complete lineage documentation. Implementation uses a stack (or recursion): start with target model, visit first parent, recursively visit that parent's first parent, backtrack when reaching a base model, continue with next parent.

DFS also runs in O(V + E) time but with different memory characteristics. DFS requires stack depth proportional to the longest path, while BFS requires queue size proportional to the widest level.

Topological Sort orders models so that parents always appear before children. This enables efficient royalty calculation: process models in topological order, computing each model's royalty based on already-computed parent royalties. Kahn's algorithm implements this through in-degree tracking: start with base models (in-degree 0), remove them from the graph while decrementing their children's in-degrees, repeat with newly zero-in-degree nodes until all models are processed.

Linear time O(V + E) with guarantees that acyclic graphs produce valid orderings.

Lowest Common Ancestor (LCA) finds where two lineages converge, identifying the most recent shared predecessor. This helps detect when two models derive from the same base model or identify licensing conflicts. Tarjan's offline LCA algorithm preprocesses the graph in O(V + E) time, then answers LCA queries in near-constant amortized time.

For online queries without preprocessing, binary lifting achieves O(log V) per query after O(V log V) preprocessing.

Smart contracts implement simplified versions of these algorithms to keep gas costs manageable. Complex queries run off-chain on indexed graph databases with on-chain verification. The blockchain stores the DAG structure while indexers like The Graph maintain queryable representations. Users query the indexer for results, then verify critical paths on-chain through Merkle proofs.

This separation keeps on-chain costs bounded while enabling sophisticated analysis off-chain.

Edge Case Handling

DAGs can encounter pathological cases requiring explicit handling:

Cycle Detection: A cycle in the lineage graph would violate acyclicity and enable paradoxes like Model A deriving from Model B which derives from Model A. The registration smart contract prevents this through ancestry checks. Before accepting a new registration linking Model B to Parent A, the contract verifies Parent A is not a descendant of Model B.

This check runs in O(V) time using DFS from Parent A, rejecting registration if Model B appears in the ancestry. While expensive in gas costs, cycle detection is essential for graph integrity. Optimizations include caching ancestry hashes and using bloom filters for fast negative checks.

False Parentage Claims: Malicious actors might claim their model derives from a popular base model to gain legitimacy or royalty obligations. Three defense mechanisms address this: First, the challenge mechanism allows anyone to dispute registrations by staking tokens. Second, validator review flags suspicious registrations for community examination. Third, model fingerprinting compares weight distributions to verify claimed relationships.

If Model B claims fine-tuning from Model A but shows zero correlation in weight space, this suggests fraud. While proving negative relationships is harder than proving positive ones, statistical tests can flag likely false claims.

Orphaned Nodes: If a parent model's IPFS content becomes unavailable through pinning failures, child models retain valid registrations but lose access to parent metadata. The on-chain provenance hash remains immutable, preserving the claim that a parent existed. Arweave backups mitigate this for critical records.

For less critical models, orphaned nodes continue functioning: the child model's metadata stands independent of parent availability. The DAG structure shows a parent relationship existed even if parent details are temporarily unavailable. When parents reappear (pinning services restore content), the links restore automatically.

Ambiguous Derivations: Some derivatives blur the lines between derivation types. A model might involve fine-tuning, then merging, then quantization in sequence. Origyn captures the immediate parent relationship: Model D derives from quantizing Model C (which itself has a lineage). The graph naturally represents multi-step processes through intermediate registrations.

If creators skip intermediate steps for convenience, they can still register the final result with the most relevant derivation type. The metadata field allows free-text explanation of complex processes.

These mechanisms ensure the DAG remains a trusted source of truth even as malicious actors attempt manipulation and real-world ambiguities complicate clean categorizations.

Previous4. Technical Architecture Next4.2 Registry Contracts: Smart Contract Layer

Last updated 1 month ago

Good evening