Knowledge Synthesis and Wiki Layer (Proposal)¶
Status: draft proposal. Not part of the canonical numbered set. If accepted, the schema sections fold into 04 Data Model, the compiler folds into 05 Persistence, Storage & Ingestion, and the decision record (ADR-022 below) is appended to the decision log. Do not run the docs index/changelog maintainer against this file until it is promoted.
Precis¶
Thinklio today is, in the terms of the OpenBrain/Karpathy debate, a faithful query-time structured store: facts and document chunks are captured and reasoned over on demand, with provenance and confidence attached. What it lacks is the second half of that debate, the compiled view: a browsable, cross-referenced knowledge base that reads cleanly and answers fast.
This proposal adds that compiled layer without giving up the structured store as source of truth. The design rests on one principle: the wiki is derived data, never a source. Nothing edits a wiki page directly. The database stays authoritative, and pages are regenerated from it by a single writer, so the wiki can never drift from ground truth, never silently smooth away a contradiction, and never compound an error into the next cycle.
It introduces four new layers above the existing atoms:
- A canonical entity graph (
entity), which is both the resolution layer for facts and the hierarchical taxonomy of the business. - An app-level typed-relation vocabulary (
relation_type,relation): verbs such as supports, contradicts, supersedes, explains, grouped into families that instruct the compiler. - Visible derivations (
derivation): applications such as "use A to explain Y" or "what does A mean in the context of Z", generated query-time and promoted to stored content when reused. - A compiled wiki (
wiki_page,wiki_section,wiki_source_link,wiki_revision): the human- and agent-readable projection, section-incremental and fully source-cited.
Contradiction analysis is not a separate feature; it is the evidential relation family surfaced rather than resolved.
1. Background and the design fork¶
The single question every AI knowledge system answers is when does the AI do the hard thinking, at write time or at query time? Karpathy's wiki does it at write time (synthesise on ingest, browse pre-built understanding). OpenBrain does it at query time (store faithfully, reason when asked). Each breaks differently: the wiki breaks under teams, multi-agent writes, high volume and fast-moving data, and it hides what it drops; the structured store is weaker at deep synthesis, has no browsable artifact, and lets contradictions sit silently in adjacent rows.
Thinklio sits on the query-time side and should stay there for its system of record. This proposal adds a write-time projection on top, so that:
- slow-moving, high-value knowledge gets compiled once and kept current (the wiki strength), while
- fast-moving operational data and precise queries stay query-time against the structured store (the OpenBrain strength), and
- the projection is always rebuilt from the authoritative database, so the failure modes of a stand-alone wiki (drift, smoothed contradictions, error compounding, merge conflicts) do not apply.
The mental model: the database is the filing cabinet and the librarian; the wiki is a study guide the librarian rewrites from the cabinet whenever the contents change, and throws away and rewrites rather than patching.
2. What already exists (and is reused)¶
This proposal builds on, and does not replace, the current model:
knowledge_fact(doc 04 §9) remains the atom of structured knowledge (subject/predicate/value, confidence,sourceInteractionId, embedding, four-layer scope).media,library,library_item(doc 04 §15, doc 05 §13) remain the document and chunk store.note,item,contact,task,tag/entity_tag(doc 04 §8) remain the structured operational records.- The
extractstep (doc 02), thefact_extractprocessor (doc 04 §8.8) and the optional document "derive" step (doc 02) remain the selective-distillation paths that feed atoms. - The Fact Checker agent (agent-specs §03) is reused as the publish gate for the compiler.
- The Convex Workflow component (Tier 2, ADR-018) runs the compiler durably.
The new tables sit above these. Nothing here changes the hot interactive path.
3. The entity graph (taxonomy plus resolution)¶
Today knowledge_fact.subject is a free string, so "Acme", "Acme Corp" and "ACME" are three different subjects and there is no tree to hang a wiki on. The entity table fixes both problems at once: it is the canonical referent that facts point to, and (via a self-referential parent) the hierarchical taxonomy of the business with detail at the leaves.
entity: defineTable({
accountId: v.id("account"),
scope: v.union(v.literal("account"), v.literal("team")), // never user: this is a shared KB
scopeId: v.string(),
kind: v.union(
v.literal("domain"), // top-level business area (taxonomy node)
v.literal("topic"), // concept/subject area (taxonomy node)
v.literal("person"),
v.literal("org"),
v.literal("project"),
v.literal("product"),
v.literal("policy"),
v.literal("concept"),
),
name: v.string(),
slug: v.string(),
aliases: v.array(v.string()), // drives entity resolution / dedup
description: v.optional(v.string()),
primaryParentId: v.optional(v.id("entity")), // the spanning tree, for breadcrumbs
linkedContactId: v.optional(v.id("contact")), // person/org reuse the CRM, not duplicate it
embedding: v.array(v.float64()), // resolution + retrieval
factCount: v.number(), // eligibility signal
status: v.union(v.literal("active"), v.literal("merged"), v.literal("archived")),
mergedIntoId: v.optional(v.id("entity")), // dedup target when two entities merge
updatedAt: v.number(),
})
.index("by_account", ["accountId"])
.index("by_account_kind", ["accountId", "kind"])
.index("by_parent", ["primaryParentId"])
.index("by_account_slug", ["accountId", "slug"])
.vectorIndex("by_embedding", {
vectorField: "embedding",
dimensions: 1536,
filterFields: ["accountId", "kind"],
}),
Rules.
kind = domain | topicare the navigational interior nodes of the taxonomy. The other kinds are the leaf referents that carry detail.primaryParentIddefines a single spanning tree so navigation and breadcrumbs are unambiguous. A node legitimately belonging under two parents keeps one primary parent here and gets a secondarypart_ofedge inrelation(section 4). The taxonomy is therefore a DAG with a designated tree.knowledge_factgains an optionalentityId(and keepssubjectas the surface string), so facts resolve to a canonical entity. Resolution usesaliasesplus embedding similarity.- Person and org entities link to
contactvialinkedContactIdrather than duplicating CRM data. The entity is the knowledge-graph node; the contact remains the operational record. (See open question 22.2.) - Merges are non-destructive: a duplicate is set
status = mergedwithmergedIntoId, and reads follow the pointer.
4. Typed relations: verbs as compiler instructions¶
A relation is a directed, typed edge. The verb does not merely label the edge; its family tells the compiler what to do when it encounters the edge. The vocabulary is defined at the application level because it is largely structural, with controlled, admin-only extension per account. It is never arbitrarily editable by end users or agents.
4.1 The vocabulary registry¶
relation_type: defineTable({
verb: v.string(), // "supports", "contradicts", "supersedes", ...
inverseVerb: v.optional(v.string()), // "supported_by"; null when symmetric
symmetric: v.boolean(),
family: v.union(
v.literal("structural"), // builds the taxonomy: is_a, part_of, instance_of, related_to
v.literal("evidential"), // truth/confidence: supports, corroborates, challenges, contradicts, refutes
v.literal("temporal"), // currency: supersedes, upgrades, updates, deprecates, refines
v.literal("corrective"), // corrects, clarifies
v.literal("causal"), // reasoning: causes, enables, explains, implies, depends_on
),
polarity: v.union(v.literal("positive"), v.literal("negative"), v.literal("neutral")),
affectsCurrency: v.boolean(), // true => newer side wins, older kept as history
compilerDirective: v.string(), // how the wiki renders this family/verb
isSystem: v.boolean(), // true = platform-seeded, immutable
accountId: v.optional(v.id("account")), // null = platform-global; set = account extension
description: v.string(),
})
.index("by_verb", ["verb"])
.index("by_family", ["family"])
.index("by_account", ["accountId"]),
Governance. Platform-seeded verbs are isSystem = true, accountId = null, and immutable. An account admin may add account-scoped verbs (isSystem = false, accountId set) through a governed admin path only; the policy middleware denies creation or mutation of relation_type by end users and by agents. Agents may create relations using the existing vocabulary; they may not extend the vocabulary.
Seed set (representative, not exhaustive).
| Family | Verbs | Directive to compiler |
|---|---|---|
| structural | is_a / has_instance, part_of / has_part, instance_of, example_of, related_to (symmetric) |
Build the taxonomy tree and "related" cross-links. |
| evidential | supports / supported_by, corroborates (sym), challenges / challenged_by, contradicts (sym), refutes / refuted_by, questions |
Raise/lower confidence; contradicts/refutes force the dual-attribution block, never a blended consensus. |
| temporal | supersedes / superseded_by, upgrades, updates, deprecates, refines |
Prefer the newer claim; footnote the older as history. |
| corrective | corrects / corrected_by, clarifies |
Like temporal, but asserts the prior claim was wrong, not merely stale. |
| causal | causes / caused_by, enables, explains / explained_by, implies, depends_on |
Feed derivations (section 5); render as reasoning links. |
4.2 The edge¶
relation: defineTable({
accountId: v.id("account"),
relationTypeId: v.id("relation_type"),
verb: v.string(), // denormalised for query
family: v.string(), // denormalised for sweeps
sourceType: v.union(v.literal("fact"), v.literal("entity"), v.literal("derivation")),
sourceId: v.string(),
targetType: v.union(v.literal("fact"), v.literal("entity"), v.literal("derivation")),
targetId: v.string(),
confidence: v.number(),
severity: v.optional(v.number()), // contradiction severity, 0..1
rationale: v.optional(v.string()), // one line: why this edge holds
detectedBy: v.union(v.literal("classifier"), v.literal("agent"), v.literal("human")),
status: v.union(v.literal("proposed"), v.literal("confirmed"), v.literal("dismissed")),
sourceInteractionId: v.optional(v.id("interaction")),
createdByAgent: v.optional(v.id("agent")),
updatedAt: v.number(),
})
.index("by_account", ["accountId"])
.index("by_source", ["sourceType", "sourceId"])
.index("by_target", ["targetType", "targetId"])
.index("by_verb", ["verb"])
.index("by_family_status", ["family", "status"]), // contradiction audit sweeps
Rules.
- Endpoints are polymorphic so the same table carries fact-to-fact, entity-to-entity (structural), and derivation-linked edges.
- Symmetric verbs are stored once; the reader treats them as bidirectional via
relation_type.symmetric. - Machine-detected edges land
status = proposed; confirmation can be by human, by corroborating evidence, or by policy threshold. Dismissed edges are kept, not deleted, for audit. - Cross-layer conflicts (where the four-layer precedence account > agent > team > user would otherwise silently pick a winner) are written as explicit
contradictsedges and surfaced, not resolved invisibly.
5. Derivations: applications as visible content¶
Extraction pulls a claim out of a source (truth-preserving, grounded). A derivation projects an idea onto a question or context: "use A to explain Y", "what does A mean in the context of Z" (generative, expansive). These are epistemically weaker and more fragile than facts, so they are modelled separately, always marked inferred, and grounded by reasoning lineage rather than quotation. Per the design decision, derivations are visible first-class content, not a hidden agent-only layer.
derivation: defineTable({
accountId: v.id("account"),
scope: v.union(v.literal("account"), v.literal("team")),
scopeId: v.string(),
kind: v.union(
v.literal("explanation"), // use A to explain Y
v.literal("application"), // what does A mean in the context of Z
v.literal("implication"),
v.literal("comparison"),
v.literal("summary"),
),
question: v.string(), // the lens/prompt
body: v.string(), // generated markdown
entityId: v.optional(v.id("entity")), // primary subject (the "A")
contextEntityId: v.optional(v.id("entity")), // the context (the "Z")
inputRefs: v.array(v.object({ // reasoning lineage = provenance
type: v.string(), // fact | entity | library_item | derivation
id: v.string(),
})),
truthStatus: v.literal("inferred"), // always weaker than a sourced fact
confidence: v.number(),
status: v.union(v.literal("ephemeral"), v.literal("promoted")), // query-time vs stored
accessCount: v.number(),
lastAccessed: v.number(),
inputsAsOf: v.number(), // staleness watermark
reviewState: v.union(v.literal("machine"), v.literal("human_reviewed"), v.literal("flagged")),
embedding: v.array(v.float64()),
createdByAgent: v.optional(v.id("agent")),
updatedAt: v.number(),
})
.index("by_account", ["accountId"])
.index("by_entity", ["entityId"])
.index("by_status_access", ["status", "accessCount"]) // promotion candidates
.vectorIndex("by_embedding", {
vectorField: "embedding",
dimensions: 1536,
filterFields: ["accountId", "scope", "scopeId"],
}),
Lifecycle (the write-time/query-time bridge). A derivation is generated query-time when asked (the OpenBrain "think when needed" mode), stored ephemeral. When accessCount crosses a threshold it is promoted (status = promoted), gains a review pass, and becomes eligible to surface as an Applications section on the relevant wiki page (the Karpathy "compile once, keep current" mode). The promotion threshold is the dial between the two paradigms.
6. The wiki tables¶
A wiki page is a 1:1 projection of a page-worthy entity, broken into sections so recompilation is localised. Pages carry embeddings and become a third retrieval source alongside library_item and knowledge_fact (doc 05 §14.2).
wiki_page: defineTable({
accountId: v.id("account"),
scope: v.union(v.literal("account"), v.literal("team")),
scopeId: v.string(),
entityId: v.id("entity"),
slug: v.string(),
title: v.string(),
summary: v.string(),
embedding: v.array(v.float64()),
status: v.union(
v.literal("draft"),
v.literal("published"),
v.literal("stale"), // inputs changed since last compile; surfaced, not hidden
v.literal("superseded"),
),
reviewState: v.union(v.literal("machine"), v.literal("human_reviewed"), v.literal("flagged")),
confidence: v.number(),
coverage: v.number(), // fraction of the entity's atoms actually cited
inputsAsOf: v.number(), // watermark: max updatedAt of sources at compile time
version: v.number(),
supersedesPageId: v.optional(v.id("wiki_page")),
dirty: v.boolean(), // needs recompile
lockToken: v.optional(v.string()), // single-writer lease
lockedAt: v.optional(v.number()),
compiledAt: v.optional(v.number()),
updatedAt: v.number(),
})
.index("by_account", ["accountId"])
.index("by_entity", ["entityId"])
.index("by_account_slug", ["accountId", "slug"])
.index("by_status", ["status"])
.index("by_dirty", ["dirty"])
.vectorIndex("by_embedding", {
vectorField: "embedding",
dimensions: 1536,
filterFields: ["accountId", "scope", "scopeId"],
}),
wiki_section: defineTable({
pageId: v.id("wiki_page"),
accountId: v.id("account"),
heading: v.string(),
order: v.number(),
sectionKind: v.union(
v.literal("overview"),
v.literal("facts"),
v.literal("relationships"),
v.literal("contradictions"),
v.literal("applications"), // promoted derivations
v.literal("sources"),
),
body: v.string(), // markdown
inputHash: v.string(), // hash of source ids+versions; unchanged => skip LLM
inputsAsOf: v.number(),
confidence: v.number(),
dirty: v.boolean(),
updatedAt: v.number(),
})
.index("by_page_order", ["pageId", "order"])
.index("by_dirty", ["dirty"]),
wiki_source_link: defineTable({
sectionId: v.id("wiki_section"),
pageId: v.id("wiki_page"),
accountId: v.id("account"),
sourceType: v.union(
v.literal("fact"),
v.literal("media"),
v.literal("library_item"),
v.literal("note"),
v.literal("item"),
v.literal("derivation"),
),
sourceId: v.string(),
claim: v.optional(v.string()), // the specific claim used from this source
confidence: v.number(),
})
.index("by_section", ["sectionId"])
.index("by_page", ["pageId"])
.index("by_source", ["sourceType", "sourceId"]), // reverse lookup: dirty pages when a source changes
wiki_revision: defineTable({
pageId: v.id("wiki_page"),
accountId: v.id("account"),
version: v.number(),
snapshot: v.any(), // page + sections at compile time
diffSummary: v.optional(v.string()), // what changed vs prior version
compiledAt: v.number(),
})
.index("by_page_version", ["pageId", "version"]),
Why section-level granularity matters. When one fact changes, the reverse wiki_source_link.by_source lookup marks only the sections that cite it dirty. The compiler then recompiles those sections, not the whole page, via the inputHash short-circuit. This is the main defence against the cost and drift of full-wiki rebuilds, and against the "every change ripples across a dozen pages" failure the structured-vs-wiki debate warns about.
7. The compiler¶
The compiler is a single-writer, durable Tier 2 Workflow. Many agents write atoms concurrently; only the compiler writes pages, so there are no merge conflicts.
7.1 Triggers¶
- Event-driven dirty marking (cheap, immediate). On
knowledge.extracted, on source-document update, and on new evidential/temporal relations, mark the affected sections dirty viawiki_source_link.by_sourceandrelation.by_target. No synthesis yet. - Debounced scheduled compile (Tier 2). A daily/weekly sweep recompiles dirty sections only. Debounce means a topic touched forty times today compiles once, not forty times. This is what keeps fast-moving data from thrashing the wiki.
- On-demand. "Recompile this topic now."
7.2 Per-section compile steps¶
selectDirty -> dirty sections, debounced and batched
acquireLock -> per-page lease (lockToken); single writer
gatherInputs -> entity facts + relations + promoted derivations + source rows
hashInputs -> if inputHash unchanged, clear dirty and STOP (no LLM)
synthesise -> generate body under the grounding contract + verb-family directives
factCheckGate -> Fact Checker agent; block on unsourced/contradicted claims
coverageCheck -> cited vs available atoms; record coverage; flag uncited atoms
diffAndSnapshot -> write wiki_revision with diffSummary
publishOrFlag -> reviewState policy: publish, or flag for human review
finalise -> set inputsAsOf, clear dirty, release lock, emit wiki.page.compiled
7.3 Accuracy safeguards¶
- Grounding contract. The compiler may assert only claims backed by a source row, each written to
wiki_source_link. If it cannot cite, it does not write. This is the primary anti-hallucination control. - Staleness is surfaced. If a source changes after compile, the page goes
stale(visible) rather than reading as confident but wrong. This directly targets the worst wiki failure mode, where staleness masquerades as authority. - Fix-at-source only. Errors are corrected by fixing the fact and recompiling, so a mistake never compounds into the next cycle.
- Coverage flagging. Uncited atoms on a topic are flagged ("page is missing X"), never silently dropped. Reuses the KB agent's coverage-check capability.
- Review gate. Account policy decides which scopes/topics require human review before publish; everything else auto-publishes.
- Audit diff.
wiki_revision.diffSummarymakes every cycle's changes visible, so drift is inspectable rather than invisible.
8. Eligibility: what becomes a page¶
Bias the wiki toward slow-moving, high-value knowledge; leave fast-moving operational data query-only.
- Promote: account/team procedural knowledge ("how we do things"); entities whose
factCountexceeds a threshold; canonical library documents;note_type = decision; facts with highaccessCount(people keep needing them). - Hold back: user-scoped/private knowledge (this is a shared KB); low-confidence one-offs; fast-churning items such as open tickets (the "ticket speed" trap, where constant churn makes synthesis punishing and pointless).
- A page forms only when its entity meets a mass threshold, which also stops the wiki sprawling.
9. Contradiction analysis¶
Contradiction handling is the evidential relation family, surfaced rather than resolved. Sometimes the contradiction is the most valuable thing in the store (engineering says twelve weeks, sales promised eight); resolving it into a blended ten-week narrative destroys the signal leadership needs.
- Detect at extraction. When a fact is extracted, find candidate conflicts (same entity and predicate, or near neighbours by embedding), run a cheap classifier to label the relation, and write a
relationin the evidential family withseverity,confidenceand a one-linerationale,status = proposed. - Never auto-resolve. Both claims are kept. Cross-layer precedence flags conflicts as explicit edges rather than silently letting the higher layer win.
- Surface in the wiki. The Contradictions section renders both sides with attribution and a conflict marker. The compiler must never average conflicting sources into a false consensus.
- Audit sweep. A scheduled contradiction-audit Workflow scans
relation.by_family_statusfor unresolved high-severity contradictions and raises them throughuser_commdigests and attention surfacing. This is the native equivalent of OpenBrain's contradiction plugin.
10. Retrieval integration¶
Add wiki_page as a third retrieval source in the doc 05 §14 pipeline. An agent turn queries the wiki first for fast pre-synthesised answers (with citations), and falls through to knowledge_fact and library_item when it needs precision or freshness. The retrieval ranker should down-weight stale pages and never treat a derivation as a sourced fact.
11. Build sequence¶
entityplusknowledge_fact.entityId, with entity resolution on the existing extract/fact_extractpaths. Nothing else works well until atoms resolve to canonical nodes.relation_type(seed the vocabulary) andrelation, with the extraction-time classifier writing evidential and temporal edges.- Contradiction surfacing and the audit sweep (delivers value before the wiki exists).
derivation, query-time first, with promotion.- The
wiki_*tables and the compiler, section-incremental, behind the review gate. - Wiki as a retrieval source.
12. Decision record (draft ADR-022)¶
To be appended to decision-log.md on acceptance.
ADR-022: Compiled Wiki Layer over the Structured Knowledge Store¶
Date: 2026-06-02 Status: Proposed
Context: Thinklio is a query-time structured knowledge store with provenance and four-layer scope, but it has no browsable, pre-synthesised view, no canonical entity graph, no typed relations between facts, and no contradiction surfacing. A stand-alone wiki (text files synthesised at ingest) would give browsability but breaks under teams, multi-agent writes, volume and fast-moving data, and it hides what it drops.
Decision: Add a compiled wiki layer as derived data over the structured store, governed by three rules: the database is the single source of truth; the wiki is never edited directly, only regenerated; every compiled claim cites a source row. Introduce a canonical entity graph (also the business taxonomy), an app-level typed-relation vocabulary, visible derivation content for applications, and a single-writer, section-incremental compiler running as a Tier 2 Workflow. Contradiction analysis is the evidential relation family surfaced, not resolved.
Reasoning: This keeps the structured store's strengths (precise queries, multi-agent concurrent writes, scale, provenance, faithful storage of conflicting facts) while adding the wiki's strengths (browsable, fast, cross-referenced understanding). Because the wiki is always rebuilt from the authoritative database, the stand-alone-wiki failure modes do not apply: no drift (staleness is surfaced and pages are regenerated), no smoothed contradictions (the compiler renders both sides), no error compounding (fix at source and recompile), no merge conflicts (single writer for pages; many writers for atoms). Section-level granularity plus debounced compilation prevents the cost and ripple problems of full rebuilds and keeps fast-moving data from thrashing the wiki.
Implications:
- New tables:
entity,relation_type,relation,derivation,wiki_page,wiki_section,wiki_source_link,wiki_revision;knowledge_factgainsentityId. - The relation vocabulary is platform-seeded and immutable to end users and agents; admin-only account extension is permitted.
- Derivations are first-class visible content, generated query-time and promoted on reuse, always marked
inferred. - The Fact Checker agent becomes the compiler's publish gate.
- The wiki becomes a third retrieval source after
library_itemandknowledge_fact.
13. Open questions¶
- Table naming convention.
CLAUDE.mdstates the Convex convention is now plural table names, but 04 Data Model §2 and the entire current schema use singular (knowledge_fact,library_item,note). This proposal follows the singular convention for consistency with the file it extends. The two sources must be reconciled before any of these tables ship. - Entity vs contact. This proposal models person/org entities as linking to
contactvialinkedContactId(knowledge node points at operational record). Confirm this rather than absorbingcontactintoentityor vice versa. - Page scope. Pages are allowed
accountorteamscope. Confirm team-scoped pages are wanted, given the extra visibility surface. - Promotion thresholds. The
accessCountthresholds for derivation promotion and for entity page eligibility need values, ideally policy-configurable per account. - Embedding dimensions. Fixed at 1536 here to match the existing tables; confirm against the platform default in
platform_config. - Compiler model tiering. Which model tier synthesises sections, and whether the extraction-time contradiction classifier is a cheap model with an escalation path.
14. Revision history¶
| Date | Change |
|---|---|
| 2026-06-02 | Initial draft proposal. |