Skip to content

Knowledge Synthesis and Wiki Layer (Proposal)

Status: draft proposal. Not part of the canonical numbered set. If accepted, the schema sections fold into 04 Data Model, the compiler folds into 05 Persistence, Storage & Ingestion, and the decision record (ADR-022 below) is appended to the decision log. Do not run the docs index/changelog maintainer against this file until it is promoted.

Precis

Thinklio today is, in the terms of the OpenBrain/Karpathy debate, a faithful query-time structured store: facts and document chunks are captured and reasoned over on demand, with provenance and confidence attached. What it lacks is the second half of that debate, the compiled view: a browsable, cross-referenced knowledge base that reads cleanly and answers fast.

This proposal adds that compiled layer without giving up the structured store as source of truth. The design rests on one principle: the wiki is derived data, never a source. Nothing edits a wiki page directly. The database stays authoritative, and pages are regenerated from it by a single writer, so the wiki can never drift from ground truth, never silently smooth away a contradiction, and never compound an error into the next cycle.

It introduces four new layers above the existing atoms:

  1. A canonical entity graph (entity), which is both the resolution layer for facts and the hierarchical taxonomy of the business.
  2. An app-level typed-relation vocabulary (relation_type, relation): verbs such as supports, contradicts, supersedes, explains, grouped into families that instruct the compiler.
  3. Visible derivations (derivation): applications such as "use A to explain Y" or "what does A mean in the context of Z", generated query-time and promoted to stored content when reused.
  4. A compiled wiki (wiki_page, wiki_section, wiki_source_link, wiki_revision): the human- and agent-readable projection, section-incremental and fully source-cited.

Contradiction analysis is not a separate feature; it is the evidential relation family surfaced rather than resolved.

1. Background and the design fork

The single question every AI knowledge system answers is when does the AI do the hard thinking, at write time or at query time? Karpathy's wiki does it at write time (synthesise on ingest, browse pre-built understanding). OpenBrain does it at query time (store faithfully, reason when asked). Each breaks differently: the wiki breaks under teams, multi-agent writes, high volume and fast-moving data, and it hides what it drops; the structured store is weaker at deep synthesis, has no browsable artifact, and lets contradictions sit silently in adjacent rows.

Thinklio sits on the query-time side and should stay there for its system of record. This proposal adds a write-time projection on top, so that:

  • slow-moving, high-value knowledge gets compiled once and kept current (the wiki strength), while
  • fast-moving operational data and precise queries stay query-time against the structured store (the OpenBrain strength), and
  • the projection is always rebuilt from the authoritative database, so the failure modes of a stand-alone wiki (drift, smoothed contradictions, error compounding, merge conflicts) do not apply.

The mental model: the database is the filing cabinet and the librarian; the wiki is a study guide the librarian rewrites from the cabinet whenever the contents change, and throws away and rewrites rather than patching.

2. What already exists (and is reused)

This proposal builds on, and does not replace, the current model:

  • knowledge_fact (doc 04 §9) remains the atom of structured knowledge (subject/predicate/value, confidence, sourceInteractionId, embedding, four-layer scope).
  • media, library, library_item (doc 04 §15, doc 05 §13) remain the document and chunk store.
  • note, item, contact, task, tag/entity_tag (doc 04 §8) remain the structured operational records.
  • The extract step (doc 02), the fact_extract processor (doc 04 §8.8) and the optional document "derive" step (doc 02) remain the selective-distillation paths that feed atoms.
  • The Fact Checker agent (agent-specs §03) is reused as the publish gate for the compiler.
  • The Convex Workflow component (Tier 2, ADR-018) runs the compiler durably.

The new tables sit above these. Nothing here changes the hot interactive path.

3. The entity graph (taxonomy plus resolution)

Today knowledge_fact.subject is a free string, so "Acme", "Acme Corp" and "ACME" are three different subjects and there is no tree to hang a wiki on. The entity table fixes both problems at once: it is the canonical referent that facts point to, and (via a self-referential parent) the hierarchical taxonomy of the business with detail at the leaves.

entity: defineTable({
  accountId: v.id("account"),
  scope: v.union(v.literal("account"), v.literal("team")),  // never user: this is a shared KB
  scopeId: v.string(),
  kind: v.union(
    v.literal("domain"),     // top-level business area (taxonomy node)
    v.literal("topic"),      // concept/subject area (taxonomy node)
    v.literal("person"),
    v.literal("org"),
    v.literal("project"),
    v.literal("product"),
    v.literal("policy"),
    v.literal("concept"),
  ),
  name: v.string(),
  slug: v.string(),
  aliases: v.array(v.string()),                  // drives entity resolution / dedup
  description: v.optional(v.string()),
  primaryParentId: v.optional(v.id("entity")),   // the spanning tree, for breadcrumbs
  linkedContactId: v.optional(v.id("contact")),  // person/org reuse the CRM, not duplicate it
  embedding: v.array(v.float64()),               // resolution + retrieval
  factCount: v.number(),                         // eligibility signal
  status: v.union(v.literal("active"), v.literal("merged"), v.literal("archived")),
  mergedIntoId: v.optional(v.id("entity")),      // dedup target when two entities merge
  updatedAt: v.number(),
})
  .index("by_account", ["accountId"])
  .index("by_account_kind", ["accountId", "kind"])
  .index("by_parent", ["primaryParentId"])
  .index("by_account_slug", ["accountId", "slug"])
  .vectorIndex("by_embedding", {
    vectorField: "embedding",
    dimensions: 1536,
    filterFields: ["accountId", "kind"],
  }),

Rules.

  • kind = domain | topic are the navigational interior nodes of the taxonomy. The other kinds are the leaf referents that carry detail.
  • primaryParentId defines a single spanning tree so navigation and breadcrumbs are unambiguous. A node legitimately belonging under two parents keeps one primary parent here and gets a secondary part_of edge in relation (section 4). The taxonomy is therefore a DAG with a designated tree.
  • knowledge_fact gains an optional entityId (and keeps subject as the surface string), so facts resolve to a canonical entity. Resolution uses aliases plus embedding similarity.
  • Person and org entities link to contact via linkedContactId rather than duplicating CRM data. The entity is the knowledge-graph node; the contact remains the operational record. (See open question 22.2.)
  • Merges are non-destructive: a duplicate is set status = merged with mergedIntoId, and reads follow the pointer.

4. Typed relations: verbs as compiler instructions

A relation is a directed, typed edge. The verb does not merely label the edge; its family tells the compiler what to do when it encounters the edge. The vocabulary is defined at the application level because it is largely structural, with controlled, admin-only extension per account. It is never arbitrarily editable by end users or agents.

4.1 The vocabulary registry

relation_type: defineTable({
  verb: v.string(),                              // "supports", "contradicts", "supersedes", ...
  inverseVerb: v.optional(v.string()),           // "supported_by"; null when symmetric
  symmetric: v.boolean(),
  family: v.union(
    v.literal("structural"),   // builds the taxonomy: is_a, part_of, instance_of, related_to
    v.literal("evidential"),   // truth/confidence: supports, corroborates, challenges, contradicts, refutes
    v.literal("temporal"),     // currency: supersedes, upgrades, updates, deprecates, refines
    v.literal("corrective"),   // corrects, clarifies
    v.literal("causal"),       // reasoning: causes, enables, explains, implies, depends_on
  ),
  polarity: v.union(v.literal("positive"), v.literal("negative"), v.literal("neutral")),
  affectsCurrency: v.boolean(),                  // true => newer side wins, older kept as history
  compilerDirective: v.string(),                 // how the wiki renders this family/verb
  isSystem: v.boolean(),                         // true = platform-seeded, immutable
  accountId: v.optional(v.id("account")),        // null = platform-global; set = account extension
  description: v.string(),
})
  .index("by_verb", ["verb"])
  .index("by_family", ["family"])
  .index("by_account", ["accountId"]),

Governance. Platform-seeded verbs are isSystem = true, accountId = null, and immutable. An account admin may add account-scoped verbs (isSystem = false, accountId set) through a governed admin path only; the policy middleware denies creation or mutation of relation_type by end users and by agents. Agents may create relations using the existing vocabulary; they may not extend the vocabulary.

Seed set (representative, not exhaustive).

Family Verbs Directive to compiler
structural is_a / has_instance, part_of / has_part, instance_of, example_of, related_to (symmetric) Build the taxonomy tree and "related" cross-links.
evidential supports / supported_by, corroborates (sym), challenges / challenged_by, contradicts (sym), refutes / refuted_by, questions Raise/lower confidence; contradicts/refutes force the dual-attribution block, never a blended consensus.
temporal supersedes / superseded_by, upgrades, updates, deprecates, refines Prefer the newer claim; footnote the older as history.
corrective corrects / corrected_by, clarifies Like temporal, but asserts the prior claim was wrong, not merely stale.
causal causes / caused_by, enables, explains / explained_by, implies, depends_on Feed derivations (section 5); render as reasoning links.

4.2 The edge

relation: defineTable({
  accountId: v.id("account"),
  relationTypeId: v.id("relation_type"),
  verb: v.string(),                              // denormalised for query
  family: v.string(),                            // denormalised for sweeps
  sourceType: v.union(v.literal("fact"), v.literal("entity"), v.literal("derivation")),
  sourceId: v.string(),
  targetType: v.union(v.literal("fact"), v.literal("entity"), v.literal("derivation")),
  targetId: v.string(),
  confidence: v.number(),
  severity: v.optional(v.number()),              // contradiction severity, 0..1
  rationale: v.optional(v.string()),             // one line: why this edge holds
  detectedBy: v.union(v.literal("classifier"), v.literal("agent"), v.literal("human")),
  status: v.union(v.literal("proposed"), v.literal("confirmed"), v.literal("dismissed")),
  sourceInteractionId: v.optional(v.id("interaction")),
  createdByAgent: v.optional(v.id("agent")),
  updatedAt: v.number(),
})
  .index("by_account", ["accountId"])
  .index("by_source", ["sourceType", "sourceId"])
  .index("by_target", ["targetType", "targetId"])
  .index("by_verb", ["verb"])
  .index("by_family_status", ["family", "status"]),  // contradiction audit sweeps

Rules.

  • Endpoints are polymorphic so the same table carries fact-to-fact, entity-to-entity (structural), and derivation-linked edges.
  • Symmetric verbs are stored once; the reader treats them as bidirectional via relation_type.symmetric.
  • Machine-detected edges land status = proposed; confirmation can be by human, by corroborating evidence, or by policy threshold. Dismissed edges are kept, not deleted, for audit.
  • Cross-layer conflicts (where the four-layer precedence account > agent > team > user would otherwise silently pick a winner) are written as explicit contradicts edges and surfaced, not resolved invisibly.

5. Derivations: applications as visible content

Extraction pulls a claim out of a source (truth-preserving, grounded). A derivation projects an idea onto a question or context: "use A to explain Y", "what does A mean in the context of Z" (generative, expansive). These are epistemically weaker and more fragile than facts, so they are modelled separately, always marked inferred, and grounded by reasoning lineage rather than quotation. Per the design decision, derivations are visible first-class content, not a hidden agent-only layer.

derivation: defineTable({
  accountId: v.id("account"),
  scope: v.union(v.literal("account"), v.literal("team")),
  scopeId: v.string(),
  kind: v.union(
    v.literal("explanation"),   // use A to explain Y
    v.literal("application"),    // what does A mean in the context of Z
    v.literal("implication"),
    v.literal("comparison"),
    v.literal("summary"),
  ),
  question: v.string(),                          // the lens/prompt
  body: v.string(),                              // generated markdown
  entityId: v.optional(v.id("entity")),          // primary subject (the "A")
  contextEntityId: v.optional(v.id("entity")),   // the context (the "Z")
  inputRefs: v.array(v.object({                  // reasoning lineage = provenance
    type: v.string(),                            // fact | entity | library_item | derivation
    id: v.string(),
  })),
  truthStatus: v.literal("inferred"),            // always weaker than a sourced fact
  confidence: v.number(),
  status: v.union(v.literal("ephemeral"), v.literal("promoted")),  // query-time vs stored
  accessCount: v.number(),
  lastAccessed: v.number(),
  inputsAsOf: v.number(),                        // staleness watermark
  reviewState: v.union(v.literal("machine"), v.literal("human_reviewed"), v.literal("flagged")),
  embedding: v.array(v.float64()),
  createdByAgent: v.optional(v.id("agent")),
  updatedAt: v.number(),
})
  .index("by_account", ["accountId"])
  .index("by_entity", ["entityId"])
  .index("by_status_access", ["status", "accessCount"])  // promotion candidates
  .vectorIndex("by_embedding", {
    vectorField: "embedding",
    dimensions: 1536,
    filterFields: ["accountId", "scope", "scopeId"],
  }),

Lifecycle (the write-time/query-time bridge). A derivation is generated query-time when asked (the OpenBrain "think when needed" mode), stored ephemeral. When accessCount crosses a threshold it is promoted (status = promoted), gains a review pass, and becomes eligible to surface as an Applications section on the relevant wiki page (the Karpathy "compile once, keep current" mode). The promotion threshold is the dial between the two paradigms.

6. The wiki tables

A wiki page is a 1:1 projection of a page-worthy entity, broken into sections so recompilation is localised. Pages carry embeddings and become a third retrieval source alongside library_item and knowledge_fact (doc 05 §14.2).

wiki_page: defineTable({
  accountId: v.id("account"),
  scope: v.union(v.literal("account"), v.literal("team")),
  scopeId: v.string(),
  entityId: v.id("entity"),
  slug: v.string(),
  title: v.string(),
  summary: v.string(),
  embedding: v.array(v.float64()),
  status: v.union(
    v.literal("draft"),
    v.literal("published"),
    v.literal("stale"),        // inputs changed since last compile; surfaced, not hidden
    v.literal("superseded"),
  ),
  reviewState: v.union(v.literal("machine"), v.literal("human_reviewed"), v.literal("flagged")),
  confidence: v.number(),
  coverage: v.number(),         // fraction of the entity's atoms actually cited
  inputsAsOf: v.number(),       // watermark: max updatedAt of sources at compile time
  version: v.number(),
  supersedesPageId: v.optional(v.id("wiki_page")),
  dirty: v.boolean(),           // needs recompile
  lockToken: v.optional(v.string()),  // single-writer lease
  lockedAt: v.optional(v.number()),
  compiledAt: v.optional(v.number()),
  updatedAt: v.number(),
})
  .index("by_account", ["accountId"])
  .index("by_entity", ["entityId"])
  .index("by_account_slug", ["accountId", "slug"])
  .index("by_status", ["status"])
  .index("by_dirty", ["dirty"])
  .vectorIndex("by_embedding", {
    vectorField: "embedding",
    dimensions: 1536,
    filterFields: ["accountId", "scope", "scopeId"],
  }),

wiki_section: defineTable({
  pageId: v.id("wiki_page"),
  accountId: v.id("account"),
  heading: v.string(),
  order: v.number(),
  sectionKind: v.union(
    v.literal("overview"),
    v.literal("facts"),
    v.literal("relationships"),
    v.literal("contradictions"),
    v.literal("applications"),   // promoted derivations
    v.literal("sources"),
  ),
  body: v.string(),              // markdown
  inputHash: v.string(),         // hash of source ids+versions; unchanged => skip LLM
  inputsAsOf: v.number(),
  confidence: v.number(),
  dirty: v.boolean(),
  updatedAt: v.number(),
})
  .index("by_page_order", ["pageId", "order"])
  .index("by_dirty", ["dirty"]),

wiki_source_link: defineTable({
  sectionId: v.id("wiki_section"),
  pageId: v.id("wiki_page"),
  accountId: v.id("account"),
  sourceType: v.union(
    v.literal("fact"),
    v.literal("media"),
    v.literal("library_item"),
    v.literal("note"),
    v.literal("item"),
    v.literal("derivation"),
  ),
  sourceId: v.string(),
  claim: v.optional(v.string()),  // the specific claim used from this source
  confidence: v.number(),
})
  .index("by_section", ["sectionId"])
  .index("by_page", ["pageId"])
  .index("by_source", ["sourceType", "sourceId"]),  // reverse lookup: dirty pages when a source changes

wiki_revision: defineTable({
  pageId: v.id("wiki_page"),
  accountId: v.id("account"),
  version: v.number(),
  snapshot: v.any(),              // page + sections at compile time
  diffSummary: v.optional(v.string()),  // what changed vs prior version
  compiledAt: v.number(),
})
  .index("by_page_version", ["pageId", "version"]),

Why section-level granularity matters. When one fact changes, the reverse wiki_source_link.by_source lookup marks only the sections that cite it dirty. The compiler then recompiles those sections, not the whole page, via the inputHash short-circuit. This is the main defence against the cost and drift of full-wiki rebuilds, and against the "every change ripples across a dozen pages" failure the structured-vs-wiki debate warns about.

7. The compiler

The compiler is a single-writer, durable Tier 2 Workflow. Many agents write atoms concurrently; only the compiler writes pages, so there are no merge conflicts.

7.1 Triggers

  • Event-driven dirty marking (cheap, immediate). On knowledge.extracted, on source-document update, and on new evidential/temporal relations, mark the affected sections dirty via wiki_source_link.by_source and relation.by_target. No synthesis yet.
  • Debounced scheduled compile (Tier 2). A daily/weekly sweep recompiles dirty sections only. Debounce means a topic touched forty times today compiles once, not forty times. This is what keeps fast-moving data from thrashing the wiki.
  • On-demand. "Recompile this topic now."

7.2 Per-section compile steps

selectDirty        -> dirty sections, debounced and batched
acquireLock        -> per-page lease (lockToken); single writer
gatherInputs       -> entity facts + relations + promoted derivations + source rows
hashInputs         -> if inputHash unchanged, clear dirty and STOP (no LLM)
synthesise         -> generate body under the grounding contract + verb-family directives
factCheckGate      -> Fact Checker agent; block on unsourced/contradicted claims
coverageCheck      -> cited vs available atoms; record coverage; flag uncited atoms
diffAndSnapshot    -> write wiki_revision with diffSummary
publishOrFlag      -> reviewState policy: publish, or flag for human review
finalise           -> set inputsAsOf, clear dirty, release lock, emit wiki.page.compiled

7.3 Accuracy safeguards

  • Grounding contract. The compiler may assert only claims backed by a source row, each written to wiki_source_link. If it cannot cite, it does not write. This is the primary anti-hallucination control.
  • Staleness is surfaced. If a source changes after compile, the page goes stale (visible) rather than reading as confident but wrong. This directly targets the worst wiki failure mode, where staleness masquerades as authority.
  • Fix-at-source only. Errors are corrected by fixing the fact and recompiling, so a mistake never compounds into the next cycle.
  • Coverage flagging. Uncited atoms on a topic are flagged ("page is missing X"), never silently dropped. Reuses the KB agent's coverage-check capability.
  • Review gate. Account policy decides which scopes/topics require human review before publish; everything else auto-publishes.
  • Audit diff. wiki_revision.diffSummary makes every cycle's changes visible, so drift is inspectable rather than invisible.

8. Eligibility: what becomes a page

Bias the wiki toward slow-moving, high-value knowledge; leave fast-moving operational data query-only.

  • Promote: account/team procedural knowledge ("how we do things"); entities whose factCount exceeds a threshold; canonical library documents; note_type = decision; facts with high accessCount (people keep needing them).
  • Hold back: user-scoped/private knowledge (this is a shared KB); low-confidence one-offs; fast-churning items such as open tickets (the "ticket speed" trap, where constant churn makes synthesis punishing and pointless).
  • A page forms only when its entity meets a mass threshold, which also stops the wiki sprawling.

9. Contradiction analysis

Contradiction handling is the evidential relation family, surfaced rather than resolved. Sometimes the contradiction is the most valuable thing in the store (engineering says twelve weeks, sales promised eight); resolving it into a blended ten-week narrative destroys the signal leadership needs.

  • Detect at extraction. When a fact is extracted, find candidate conflicts (same entity and predicate, or near neighbours by embedding), run a cheap classifier to label the relation, and write a relation in the evidential family with severity, confidence and a one-line rationale, status = proposed.
  • Never auto-resolve. Both claims are kept. Cross-layer precedence flags conflicts as explicit edges rather than silently letting the higher layer win.
  • Surface in the wiki. The Contradictions section renders both sides with attribution and a conflict marker. The compiler must never average conflicting sources into a false consensus.
  • Audit sweep. A scheduled contradiction-audit Workflow scans relation.by_family_status for unresolved high-severity contradictions and raises them through user_comm digests and attention surfacing. This is the native equivalent of OpenBrain's contradiction plugin.

10. Retrieval integration

Add wiki_page as a third retrieval source in the doc 05 §14 pipeline. An agent turn queries the wiki first for fast pre-synthesised answers (with citations), and falls through to knowledge_fact and library_item when it needs precision or freshness. The retrieval ranker should down-weight stale pages and never treat a derivation as a sourced fact.

11. Build sequence

  1. entity plus knowledge_fact.entityId, with entity resolution on the existing extract/fact_extract paths. Nothing else works well until atoms resolve to canonical nodes.
  2. relation_type (seed the vocabulary) and relation, with the extraction-time classifier writing evidential and temporal edges.
  3. Contradiction surfacing and the audit sweep (delivers value before the wiki exists).
  4. derivation, query-time first, with promotion.
  5. The wiki_* tables and the compiler, section-incremental, behind the review gate.
  6. Wiki as a retrieval source.

12. Decision record (draft ADR-022)

To be appended to decision-log.md on acceptance.


ADR-022: Compiled Wiki Layer over the Structured Knowledge Store

Date: 2026-06-02 Status: Proposed

Context: Thinklio is a query-time structured knowledge store with provenance and four-layer scope, but it has no browsable, pre-synthesised view, no canonical entity graph, no typed relations between facts, and no contradiction surfacing. A stand-alone wiki (text files synthesised at ingest) would give browsability but breaks under teams, multi-agent writes, volume and fast-moving data, and it hides what it drops.

Decision: Add a compiled wiki layer as derived data over the structured store, governed by three rules: the database is the single source of truth; the wiki is never edited directly, only regenerated; every compiled claim cites a source row. Introduce a canonical entity graph (also the business taxonomy), an app-level typed-relation vocabulary, visible derivation content for applications, and a single-writer, section-incremental compiler running as a Tier 2 Workflow. Contradiction analysis is the evidential relation family surfaced, not resolved.

Reasoning: This keeps the structured store's strengths (precise queries, multi-agent concurrent writes, scale, provenance, faithful storage of conflicting facts) while adding the wiki's strengths (browsable, fast, cross-referenced understanding). Because the wiki is always rebuilt from the authoritative database, the stand-alone-wiki failure modes do not apply: no drift (staleness is surfaced and pages are regenerated), no smoothed contradictions (the compiler renders both sides), no error compounding (fix at source and recompile), no merge conflicts (single writer for pages; many writers for atoms). Section-level granularity plus debounced compilation prevents the cost and ripple problems of full rebuilds and keeps fast-moving data from thrashing the wiki.

Implications:

  • New tables: entity, relation_type, relation, derivation, wiki_page, wiki_section, wiki_source_link, wiki_revision; knowledge_fact gains entityId.
  • The relation vocabulary is platform-seeded and immutable to end users and agents; admin-only account extension is permitted.
  • Derivations are first-class visible content, generated query-time and promoted on reuse, always marked inferred.
  • The Fact Checker agent becomes the compiler's publish gate.
  • The wiki becomes a third retrieval source after library_item and knowledge_fact.

13. Open questions

  1. Table naming convention. CLAUDE.md states the Convex convention is now plural table names, but 04 Data Model §2 and the entire current schema use singular (knowledge_fact, library_item, note). This proposal follows the singular convention for consistency with the file it extends. The two sources must be reconciled before any of these tables ship.
  2. Entity vs contact. This proposal models person/org entities as linking to contact via linkedContactId (knowledge node points at operational record). Confirm this rather than absorbing contact into entity or vice versa.
  3. Page scope. Pages are allowed account or team scope. Confirm team-scoped pages are wanted, given the extra visibility surface.
  4. Promotion thresholds. The accessCount thresholds for derivation promotion and for entity page eligibility need values, ideally policy-configurable per account.
  5. Embedding dimensions. Fixed at 1536 here to match the existing tables; confirm against the platform default in platform_config.
  6. Compiler model tiering. Which model tier synthesises sections, and whether the extraction-time contradiction classifier is a cheap model with an escalation path.

14. Revision history

Date Change
2026-06-02 Initial draft proposal.