Persistence, Storage & Ingestion¶
Precis¶
Thinklio's persistence layer is Convex. All application data, from user profiles through messages, jobs, policies, and knowledge facts, lives as documents in Convex tables. The file layer is Cloudflare R2, with metadata tracked in Convex so uploads, ingestion, and governance all stay inside the same transactional boundary. There is no separate relational database, no Redis cache, no Supabase. Convex's reactive query engine makes caching structural rather than opt-in: every client subscription automatically reflects the latest database state, so there is no cache to invalidate and no background refresh to orchestrate.
This document is the authoritative reference for how Thinklio uses those two storage systems. It covers index design for the query shapes Thinklio actually performs, reactive query and mutation patterns, native vector search for knowledge retrieval, the R2 bucket architecture and presigned-URL upload flow, the media processing pipeline that turns uploaded files into indexed, retrievable content, the document intelligence layer that handles parsing and chunking, the library system that gives agents document-grounded context, backup and retention, and the execution tiering rules that keep ingestion within Convex's concurrency budget.
Entity definitions, scope rules, and field-level schemas are defined in 04 Data Model and are not repeated here. Convex platform capabilities are catalogued in 11 Convex Reference. The system-level architecture that places persistence inside the Convex-first design is in 02 System Architecture. The retired Postgres/Supabase/Redis design is preserved under archive/legacy-postgres-persistence-design.md and referenced where concepts still inform current design.
Table of contents¶
- Purpose and scope
- Storage architecture overview
- The Convex database as platform storage
- Schema, indexes, and query patterns
- Reactive queries and live data flow
- Writes, transactions, and concurrency
- Vector search and semantic retrieval
- Cloudflare R2 architecture
- File metadata, uploads, and presigned URLs
- Media system
- Media processing pipeline
- Document intelligence
- Library system
- Retrieval integration
- Media and library API
- Data lifecycle, archiving, and retention
- Backup and recovery
- Ingestion under the execution tier budget
- Monitoring
- Implementation phases
- Revision history
1. Purpose and scope¶
This document describes how the entities defined in 04 Data Model physically persist, how files in Cloudflare R2 are organised, uploaded, and linked to the data model, and how raw files become retrievable knowledge through the media and library pipelines.
Scope:
- Convex database as the persistence layer for all structured application data.
- Cloudflare R2 as the file storage layer, with metadata tracked in Convex.
- Index design for query patterns common across agents, channels, jobs, and the catalogue.
- Reactive query and mutation semantics.
- Native Convex vector search for knowledge retrieval.
- Media, processing, document intelligence, and the library system for document-grounded retrieval.
- Data retention, soft-deletion, archiving, and backup.
- Operational concerns including monitoring and execution-tier budgeting.
Out of scope:
- Entity definitions, relationships, and scope rules live in 04 Data Model.
- Convex platform capabilities, component directory, and platform-level guarantees live in 11 Convex Reference.
- The event bus, durable execution harness, channels, and messaging UX live in 06 Events, Channels & Messaging.
- Row-level security, access policies, and governance sit in 07 Security & Governance. This document covers only the application-layer scoping enforced in every query and mutation.
Related archival material: the previous generation of this design, covering PostgreSQL schemas, RLS, Redis caching, connection pooling, and the Supabase platform, is preserved in archive/legacy-postgres-persistence-design.md.
2. Storage architecture overview¶
Three physical layers, each with a clear purpose:
| Layer | System | What lives there |
|---|---|---|
| Structured application state | Convex database | Every entity from doc 04: users, accounts, teams, agents, channels, messages, jobs, policies, knowledge facts, library items, credit ledger, audit events. |
| Files and binary blobs | Cloudflare R2 | Uploaded documents, avatars, message attachments, generated exports, audio samples. File bytes only. |
| Ephemeral working state | Convex reactive query cache (automatic) | No separate cache to configure. Convex memoises query results and invalidates them automatically on underlying writes. No application code runs against a cache. |
What was Redis in the legacy architecture is now intrinsic to Convex: query caching, pub/sub, session state, rate counters. What was Supabase is now Convex + Clerk. What was Postgres is now Convex. The operational surface is configuration of two managed services rather than provisioning of three database clusters.
2.1 Boundary between Convex and R2¶
Everything that supports indexed query, transactional write, or reactive subscription lives in Convex. Anything that is a raw blob, large enough to bust the 1 MB per-document size limit, or valuable to access via direct URL (for CDN, presigned download, or third-party ingestion) lives in R2.
Concretely:
- A message lives in Convex. Its attachment bytes live in R2. The Convex
messagedocument stores the R2 key. - An uploaded PDF: bytes in R2,
mediarecord in Convex, derivedlibrary_itemchunks in Convex with vector embeddings. - An avatar: bytes in R2, reference URL in the
user_profiledocument in Convex. - An audit event: entirely in Convex. Events are cheap, structured, and searchable.
2.2 Why not also use Convex file storage¶
Convex has built-in file storage (see 11 Convex Reference for the capability). Thinklio uses R2 instead for three reasons: R2 is cheaper at scale, R2 supports account-owned bring-your-own-bucket (BYOB) for enterprise deployments, and R2 can serve files directly via worker or signed URL without routing through Convex actions. Convex file storage remains available for short-lived internal artefacts where the operational simplicity is worth the cost.
2.3 Governance boundary¶
Every Convex query and mutation runs through the governance middleware defined in 07 Security & Governance. Account, team, and user scoping is enforced application-layer in-process: every helper that reads a record validates that the caller is permitted. There is no row-level security in Convex equivalent to Postgres RLS. The discipline shifts from "database denies access" to "every query function resolves the caller, verifies scope, and returns only what is permitted." The shift is not a regression: it is cheaper, more precise, and testable in a single language.
3. The Convex database as platform storage¶
Convex is a reactive document database with strong-consistency single-document writes, multi-document transactions inside mutations, and native reactivity. Rather than SQL, every interaction is a TypeScript function: a query reads, a mutation writes, an action is a non-transactional escape hatch for external I/O.
3.1 Shape of the database¶
- No migrations. Schema changes ship with code. Convex validates the schema on deploy. Adding a field is a non-event; removing one is a code-only migration pattern documented in 12 Developer Guide.
- Documents, not rows. Each record is a JSON-ish value. Arbitrary nesting is allowed but Thinklio flattens wherever a field is queried or indexed.
- IDs are strings. Every document has a
_idof the formj987abc..., with a_creationTimemillisecond timestamp. There is no surrogate integer key. - No joins. Relations are modelled by storing the foreign
_idand looking up with a dedicated query. This forces query shapes to be explicit and indexable. - Transactional writes. A mutation reads and writes many documents atomically within a single execution. Transactions do not span mutations.
3.2 Size and cardinality limits¶
Convex imposes a 1 MB per-document cap. Thinklio avoids it by design:
- Large text payloads (documents, summaries, notes over a threshold) store the body in R2 and a pointer in Convex.
- Audit events are compact and never exceed the cap in practice.
- Knowledge facts and library items are designed to be tens to hundreds of kilobytes, well inside the limit.
Total database size scales with Convex's storage tier. The Professional plan offers generous storage; beyond that self-hosted Convex removes the ceiling entirely.
3.3 Type system and validators¶
Every table has a v.* validator defined in convex/schema.ts. The validator serves three roles: schema enforcement on write, generated TypeScript types for queries and mutations, and documentation for anyone reading the schema. The full schema is version-controlled and is the canonical source of truth for entity shapes; 04 Data Model is the prose reference.
4. Schema, indexes, and query patterns¶
Thinklio defines the full database schema in convex/schema.ts. 04 Data Model is the human-readable reference; the file itself is the source of truth. This section covers the index design that makes Thinklio's query patterns efficient.
4.1 Index-before-filter discipline¶
Convex queries can do two things: read by document _id in O(1), or scan an index with an optional tail filter. The rule enforced in every query helper is: always scan an index, never a full table. A full-table scan in Convex is a linear read and a latency cliff. A properly indexed query is a range scan bounded by the index.
Every entity that is queried by an attribute other than _id carries an index for that attribute. Compound indexes are preferred over filters where a query has more than one constant predicate.
4.2 Standard index shapes¶
| Pattern | Shape | Where it applies |
|---|---|---|
| Tenant scope | .index("by_account", ["accountId"]) |
Every tenant-scoped entity. First thing every query helper checks. |
| Tenant + status | .index("by_account_status", ["accountId", "status"]) |
Open jobs for an account, active agents for an account, pending media ingestion. |
| Tenant + time | .index("by_account_created", ["accountId", "_creationTime"]) |
Recent messages, recent audit events, time-windowed listings. |
| User scope | .index("by_user", ["userId"]) |
Personal notes, user-scoped knowledge facts, user channels. |
| Channel scope | .index("by_chat", ["chatId"]) |
Messages in a channel, the most common hot-path query. |
| Ownership + parent | .index("by_parent", ["parentId", "_creationTime"]) |
Threaded replies, child jobs, sub-deliveries. |
4.3 Representative queries¶
Channel history for the client. The messaging UI subscribes to the last N messages in a channel:
// convex/messages.ts
export const list = query({
args: { chatId: v.id("chats"), limit: v.optional(v.number()) },
handler: async (ctx, { chatId, limit }) => {
await requireChatMember(ctx, chatId);
return ctx.db
.query("message")
.withIndex("by_chat", q => q.eq("chatId", chatId))
.order("desc")
.take(limit ?? 50);
},
});
The index scan is bounded. The helper requireChatMember performs the governance check.
Account-scoped agent listing. The agents page subscribes to the agents an account has deployed:
export const listAgentsForAccount = query({
args: {},
handler: async (ctx) => {
const { accountId } = await requireAccountMember(ctx);
return ctx.db
.query("agent")
.withIndex("by_account_status", q =>
q.eq("accountId", accountId).eq("status", "active")
)
.collect();
},
});
Paginated audit log. An admin drilling into audit history uses cursor pagination:
export const auditPage = query({
args: { cursor: v.union(v.string(), v.null()), pageSize: v.number() },
handler: async (ctx, { cursor, pageSize }) => {
const { accountId } = await requireAccountAdmin(ctx);
return ctx.db
.query("audit_event")
.withIndex("by_account_created", q => q.eq("accountId", accountId))
.order("desc")
.paginate({ cursor, numItems: pageSize });
},
});
Cursors are opaque strings that Convex generates. The client retains the cursor between pages; the server treats it as a continuation token over the index.
4.4 Filters as last resort¶
A .filter(q => ...) call runs in application code after the index scan. It is always slower than a tighter index and must only be used when the predicate is genuinely low-selectivity or dynamic. If a filter is used twice in the codebase for the same shape, the index is missing; add it.
4.5 Archiving vs. deleting¶
Soft-delete is the default. Most tenant-scoped entities carry a deletedAt timestamp. Queries scope to deletedAt === undefined. Hard delete is reserved for genuinely unused storage (expired upload stubs, session state, rate-counter windows). See section 16 for the full lifecycle model.
5. Reactive queries and live data flow¶
Convex's reactivity is the feature that makes the rest of Thinklio's UX possible. Every client that calls useQuery(api.messages.list, { chatId }) receives the current result, subscribes to it automatically, and re-renders when any write in the database invalidates that result. There is no pub/sub system, no server-sent events, no polling.
5.1 Subscription semantics¶
A query function is deterministic: given the same arguments and database state, it returns the same result. Convex tracks the documents read during a query's execution. When any of those documents is written (in a mutation), every active subscription that read them is re-evaluated. The new result is pushed to subscribed clients.
The consequence for Thinklio is that every read path the UI or an agent performs is automatically a live subscription. Writing to a channel's message table immediately updates the message list for every subscriber. Writing to a job's status field immediately updates the job-dashboard subscription.
5.2 Scoping subscriptions¶
Subscriptions can over-fire if the query reads broadly. Scoping discipline:
- Queries return only what the caller needs to render. Use projection-like helpers to trim fields when the full document is heavy.
- Avoid reading entities that change on unrelated writes (for example, avoid reading
user_profileinside a hot-path query if user metadata changes frequently). - Break hot queries into smaller composed queries when the read set is coarse.
5.3 Server-side consumers¶
Agents are also subscribers. When an agent harness runs inside a Workflow, it reads state through the same query helpers the UI uses. The reactive model therefore works identically for human clients and agent clients: agents see fresh data because their reads are reactive, not because they polled.
5.4 Useful patterns¶
- Derived state. A query that returns counts or aggregates reads the underlying data and aggregates in application code. For large aggregations, use the Aggregate Convex component (see 11 Convex Reference) which maintains rolling counts incrementally.
- Paged results with live updates. Paginated queries subscribe only to the current page. The client re-subscribes on page advance. The Convex
paginate()method handles this. - Conditional subscriptions. The client conditionally subscribes with
useQuery(api.X, args ?? "skip"). The "skip" sentinel avoids fetching when arguments are not yet known.
6. Writes, transactions, and concurrency¶
6.1 Mutations are atomic¶
A mutation reads and writes arbitrarily many documents inside a single transactional execution. All writes commit together or not at all. There is no explicit begin/commit; the mutation body is the transaction.
export const sendMessage = mutation({
args: {
chatId: v.id("chats"),
body: v.string(),
},
handler: async (ctx, { chatId, body }) => {
const member = await requireChatMember(ctx, chatId);
const messageId = await ctx.db.insert("message", {
chatId,
authorId: member.userId,
body,
attachments: [],
});
await ctx.db.patch(chatId, { lastActivityAt: Date.now() });
await ctx.db.insert("audit_event", {
accountId: member.accountId,
actor: member.userId,
kind: "message.sent",
targetKind: "message",
targetId: messageId,
at: Date.now(),
});
return messageId;
},
});
One mutation: three inserts and a patch, all atomic. If any step throws, nothing commits.
6.2 Optimistic concurrency¶
Convex uses optimistic concurrency. If two mutations race to modify the same document, one wins and the other retries automatically, up to a Convex-managed limit. Mutations should be idempotent where possible because retries are transparent.
6.3 Actions for external I/O¶
Actions (action({ ... })) run outside the transactional model. They can call external APIs, generate embeddings, perform crypto operations, or invoke LLMs. Actions commonly wrap a read-do-write pattern: read via ctx.runQuery, act externally, persist results via ctx.runMutation. Every external call Thinklio makes originates from an action.
Actions are not atomic. A failure between external call and persistence is a distributed systems problem; the durable workflow model described in 06 Events, Channels & Messaging wraps actions in step-level durability for critical paths.
6.4 When to use mutations vs. scheduled functions vs. workflows¶
- Mutations for immediate, transactional writes in response to a user or agent action.
- Scheduled functions (
ctx.scheduler.runAfter,ctx.scheduler.runAt) for deferred work that is atomic in itself but should run after the current mutation commits, or on a cron. - Workflow component for multi-step durable processes that must survive crashes, span hours, and compose actions with mutations. Ingestion pipelines use this; see section 11.
6.5 Hot-path writes¶
Messaging is the highest-volume hot path. Writing a message:
- Insert into
message(small, fixed schema). - Patch
channel.lastActivityAt. - Insert
audit_event. - If the channel has observers (agents or bots), insert
eventrows for the event bus in doc 06.
All inside one mutation. Latency in the tens of milliseconds end-to-end in the Convex cloud. No cache to warm, no queue to drain.
7. Vector search and semantic retrieval¶
Convex has native vector search. Vectors live as a first-class field type on documents, and an index is declared in the schema. Searches return ranked, optionally filtered results.
7.1 Where vector search applies¶
Three places in Thinklio:
- Knowledge facts. Each fact carries an embedding of its body. Semantic retrieval surfaces related facts during an agent's turn.
- Library items. Uploaded documents are chunked and every chunk is embedded. Retrieval serves document-grounded answers.
- Message similarity (planned). For the support-triage and duplicate-detection agents, message embeddings support "find related past messages" lookups.
7.2 Schema shape¶
library_item: defineTable({
libraryId: v.id("library"),
accountId: v.string(),
documentId: v.id("media"),
chunk: v.string(), // the text chunk
chunkIndex: v.number(),
embedding: v.array(v.float64()), // 1536-dim or model-dependent
metadata: v.any(),
})
.index("by_library", ["libraryId"])
.vectorIndex("by_embedding", {
vectorField: "embedding",
dimensions: 1536,
filterFields: ["libraryId", "accountId"],
}),
filterFields let a search restrict by account or library before ranking — this is the only efficient multi-tenant search pattern.
7.3 Query shape¶
export const searchLibrary = action({
args: {
libraryId: v.id("library"),
query: v.string(),
k: v.optional(v.number()),
},
handler: async (ctx, { libraryId, query, k }) => {
const accountId = await requireLibraryReader(ctx, libraryId);
const embedding = await embed(query); // external LLM call
return await ctx.vectorSearch("library_item", "by_embedding", {
vector: embedding,
limit: k ?? 10,
filter: q => q.eq("accountId", accountId).eq("libraryId", libraryId),
});
},
});
Vector search is in action because the embedding step hits an external model. The result is a list of document _ids with scores; a subsequent query loads the full chunk content.
7.4 Hybrid retrieval¶
Agents frequently combine vector retrieval (what is semantically similar) with metadata filtering and reranking (what is relevant to this turn). Thinklio composes:
- Vector search for the top-k candidate chunks.
- Metadata filter by recency, source, tag.
- LLM rerank (optional) for final ordering.
The RAG Convex component (see 11 Convex Reference) provides higher-level hybrid retrieval and is the recommended entry point for new agents.
7.5 Embedding freshness¶
When a library item is updated, the embedding must be regenerated. Thinklio does this eagerly in the ingestion pipeline (section 11) and lazily on explicit reindex. There is no periodic re-embedding sweep; the cost would be material and the content is stable once ingested.
8. Cloudflare R2 architecture¶
All file bytes live in Cloudflare R2. R2 is S3-compatible, globally replicated, and cheaper than Convex storage for the file-heavy paths (uploaded documents, attachments, exports).
8.1 Bucket tiers¶
Thinklio uses three bucket tiers:
| Tier | Purpose | Ownership | Key prefix |
|---|---|---|---|
| Platform-shared | Default bucket for all accounts on standard plans. | Thinklio. | shared/{accountId}/{context}/{fileId}.{ext} |
| Enterprise-dedicated | Account-specific bucket provisioned for enterprise tenants who want isolation. | Thinklio. | {bucketName}/{context}/{fileId}.{ext} |
| Account-supplied (BYOB) | The account points Thinklio at a bucket they own in their Cloudflare account. | Account. | Account-defined. |
The storage_bucket and account_storage_bucket entities in 04 Data Model track which bucket an account uses and carry the configuration (bucket name, region, credential reference in the secrets vault).
8.2 Key layout¶
The canonical key layout for the platform-shared bucket:
shared/{accountId}/{context}/{fileId}.{ext}
Examples:
shared/acc_abc/documents/med_xyz.pdf
shared/acc_abc/avatars/user_def.png
shared/acc_abc/attachments/msg_ghi.zip
shared/acc_abc/exports/report_jkl.xlsx
Context is one of documents, avatars, attachments, exports, ingestion-staging, generated. The context drives retention policy (section 16) and governs which code paths can read or write the key.
8.3 Access patterns¶
Thinklio never routes file bytes through Convex actions. Three access patterns:
- Upload: client requests a presigned PUT URL from a Convex mutation; client uploads directly to R2; client confirms via a Convex mutation that updates the
mediarecord. No bytes traverse Convex. - Download for humans: Convex mutation issues a presigned GET URL; the browser or mobile client fetches R2 directly. Short-lived URLs (default 15 minutes; configurable per context).
- Download for agents and pipelines: ingestion actions fetch directly from R2 using the account's credentials held in the secrets vault, since they run server-side.
8.4 Retention and lifecycle¶
R2 lifecycle rules, managed by Terraform in deploy/, mirror the retention policy encoded in the media entity (section 16). Ingestion-staging is deleted after 24 hours. Documents persist until the media record is hard-deleted. Avatars persist with the user. Exports persist for 30 days by default, configurable per account.
8.5 Encryption¶
All buckets use R2-managed encryption at rest. For enterprise tenants who require customer-managed keys, the BYOB path supports Cloudflare's customer key integration; this is documented per deployment rather than baked into the platform.
9. File metadata, uploads, and presigned URLs¶
Every file in R2 has a corresponding media record in Convex. The media record is the queryable handle for the file: it carries the key, MIME type, size, ingestion status, owner, and derived metadata. File bytes are opaque to Convex; everything about the file that matters to the application is in media.
9.1 media schema (summary)¶
See 04 Data Model for the full schema. The relevant fields for persistence:
media: defineTable({
accountId: v.string(),
uploaderId: v.string(), // Clerk user ID
context: v.union(
v.literal("document"),
v.literal("avatar"),
v.literal("attachment"),
v.literal("export"),
v.literal("generated"),
),
bucketId: v.id("account_storage_bucket"),
r2Key: v.string(),
fileName: v.string(),
fileType: v.string(), // MIME
fileSize: v.number(), // bytes
checksum: v.optional(v.string()), // sha256, computed in ingestion
ingestionStatus: v.union(
v.literal("pending"),
v.literal("uploading"),
v.literal("uploaded"),
v.literal("processing"),
v.literal("complete"),
v.literal("failed"),
),
ingestionError: v.optional(v.string()),
processingMeta: v.optional(v.any()),
deletedAt: v.optional(v.number()),
})
.index("by_account", ["accountId"])
.index("by_account_status", ["accountId", "ingestionStatus"])
.index("by_uploader", ["uploaderId"]);
9.2 Presigned upload flow¶
The presigned upload is the canonical pattern for adding a file to Thinklio. It decouples Convex from byte transfer.
1. Client calls api.media.requestUploadUrl({ context, fileName, fileType, fileSize })
- Convex mutation validates caller, storage quota, file-type allow-list.
- Creates media record with ingestionStatus="pending", allocates r2Key.
- Generates presigned PUT URL against R2 (TTL ~5 minutes).
- Returns { mediaId, uploadUrl, r2Key } to the client.
2. Client PUTs file bytes directly to uploadUrl.
- No Convex involvement. R2 enforces max size, content-type.
3. Client calls api.media.confirmUpload({ mediaId }).
- Convex mutation verifies the object exists in R2 (HEAD check).
- Updates media.ingestionStatus = "uploaded", records checksum.
- Enqueues the ingestion pipeline workflow if context = "document".
4. Ingestion workflow runs (section 11).
- Media.ingestionStatus transitions uploaded -> processing -> complete.
9.3 Presigned download flow¶
1. Client calls api.media.requestDownloadUrl({ mediaId }).
- Convex mutation validates caller is permitted to read the media.
- Generates presigned GET URL (TTL configurable per context, default 15 min).
- Returns { downloadUrl, mediaId, expiresAt }.
2. Client fetches downloadUrl directly from R2.
9.4 Upload failure recovery¶
If a client never confirms, the media record lingers in pending/uploading. A scheduled sweeper runs hourly to:
- List
mediawhereingestionStatus in ('pending','uploading') && _creationTime < now - 1h. - For each, check R2 for the key. If absent, hard-delete the media record. If present and older than 24h without confirmation, delete the R2 object and hard-delete the media record.
The sweeper is an action invoked by a cron defined in convex/crons.ts.
9.5 Streaming large files¶
R2's presigned URLs support ranged GET natively. The client can stream multi-gigabyte exports without buffering in memory. Convex never handles the bytes.
10. Media system¶
The media system is the concept of "a file plus everything Thinklio knows about it". It unifies all file contexts under a single media table so governance, retention, quota, and ingestion logic can be applied uniformly.
10.1 Roles¶
A media record participates in up to three roles depending on context:
- Document (ingested knowledge): bytes are parsed, chunked, and embedded into a library. The media record links to one or many
library_itemchunks. - Attachment (message payload): bytes are linked to a
messageorinteractionrecord. Retrieval is view-only; no ingestion. - Asset (UI resource): bytes are avatar, export, or generated output. Linked to the user, account, or job that produced them.
One media record has exactly one context. If a document is also shared as an attachment in a chat, a separate media record (pointing to a different R2 key, or a shared one with separate metadata) is created.
10.2 Quota enforcement¶
Every account carries storage quota tracked in the account document:
account: defineTable({
// ...
storage: v.object({
quotaBytes: v.number(),
usedBytes: v.number(), // denormalised, updated transactionally
}),
});
requestUploadUrl refuses an upload that would exceed quota. confirmUpload increments usedBytes atomically. Hard-delete decrements it. Ingestion never mutates usedBytes because it reads existing bytes.
10.3 Processing rules¶
Each media context has a rule set defining what processing runs on upload. Rules live in media_processing_rule and are applied via media_processing_job. The rule set for documents includes: checksum, OCR if image-scanned, text extraction, chunking, embedding, library-item insertion. The rule set for avatars is: resize to thumbnails, virus scan, reject on fail.
10.4 Virus scanning¶
All uploads run a lightweight signature scan as step one of processing. The scanner (an action calling a third-party API) marks the media record failed on hit and quarantines the R2 object under a quarantine/ prefix. Accounts with stricter needs can subscribe to additional scanners via the media processing rule configuration.
11. Media processing pipeline¶
The processing pipeline is the durable workflow that turns an uploaded file into something queryable. It lives in the Convex Workflow component so that it survives crashes, is observable from the admin UI, and is resumable from any step.
11.1 Pipeline shape¶
confirmUpload
-> workflow("media.ingest", { mediaId })
step 1: fetchFromR2 (action) reads bytes, computes sha256
step 2: virusScan (action) external scanner API
step 3: detectFormat (action) MIME + magic-bytes sniff
step 4: extractText (action) format-specific, see section 12
step 5: chunk (action) format-aware, see section 12
step 6: embed (action, batched) provider call
step 7: persistChunks (mutation) library_item inserts
step 8: finalise (mutation) media.ingestionStatus = "complete"
Each step writes its outcome to the workflow's state. If step 5 fails, rerun resumes from step 5 without re-fetching or re-embedding.
11.2 Step-level durability¶
The Workflow component persists each step's input and output to a dedicated table. A crash mid-step retries the step from its input. A crash between steps resumes at the next step. Steps are designed idempotent, so retries are safe.
11.3 Error handling¶
A step failure moves to a retry backoff schedule: 30s, 2m, 10m, 30m, then abandonment. After abandonment, media.ingestionStatus = "failed" with ingestionError populated and an audit event emitted. Admins can manually re-enqueue from the admin UI.
11.4 Backpressure and the execution-tier budget¶
Ingestion runs on Tier 2 (Workflow slots) per the execution-tier model in 02 System Architecture. The tier budget imposes a concurrency cap: if every slot is in use, new uploads queue until a slot frees. For batch ingestion (account-initial data loads, migrations), Tier 3 (external queue) is used instead; see section 18.
11.5 Per-account isolation¶
A runaway account cannot starve other accounts. The Workflow component is invoked with an accountId key that participates in a fairness policy: per-account slot limits ensure no one account consumes more than its share. The limits are set per plan tier in the platform config.
12. Document intelligence¶
Document intelligence is the set of parsers, chunkers, and format-aware transformations that turn a raw file into structured, queryable content. It is the most complex stage of ingestion because file formats vary wildly in structure.
12.1 Document classes¶
Thinklio classifies documents into four classes at ingestion:
| Class | Examples | Parser |
|---|---|---|
| Structured | HTML, Markdown, XML, JSON | Native parser; structure preserved. |
| Rich document | DOCX, PPTX, PDF with text layer | Parser extracts text and layout. |
| Scanned | PDF without text layer, image of a document | OCR (Tier 3 parser). |
| Tabular | CSV, XLSX, TSV | Structure-aware; each sheet or table becomes a logical section. |
Class drives parser selection and chunking strategy.
12.2 Parser tiers¶
Three parser tiers by cost and sophistication:
- Tier 1: Native. Language-specific libraries:
pdf-parse,mammoth(DOCX), Markdown parser. Fast, free, covers most files. Everything that can be parsed this way is. - Tier 2: Document AI. Cloud document AI APIs (Google Document AI, Anthropic Claude with PDFs, or similar) for structured extraction of complex layouts: multi-column PDFs, forms, tables within PDFs.
- Tier 3: OCR. Scanned documents fall through to OCR. Tesseract for self-hosted; cloud OCR for higher accuracy.
The parse orchestrator picks the cheapest tier that produces a satisfactory result, tracking confidence scores and falling through when a tier fails to meet a threshold.
12.3 Format-aware chunking¶
Chunking is not "split into 1000-character slices". Thinklio chunks in ways that preserve semantic boundaries:
- Markdown/HTML: split at heading boundaries, keeping heading context in the chunk.
- PDFs: split at paragraph boundaries within pages; include page number in metadata.
- DOCX: split at paragraph style breaks.
- Tables: each row is a mini-chunk carrying column headers as metadata.
- Code-heavy text: split at function or class boundaries.
Chunks are sized to 500 to 2000 tokens depending on the embedding model's sweet spot. Overlap is 100 to 200 tokens to avoid losing context at boundaries.
12.4 Metadata capture¶
Every chunk carries:
chunkIndex, position within the documentsource: page, section, heading pathpageNumberwhere applicableheadingPath: the breadcrumb of headings leading to this chunk, used for rerank and displaycreatedAt, for freshness scoring in retrieval
This metadata is what makes the library system more than a flat vector index.
12.5 Embedding model selection¶
The default embedding model is chosen at the platform level in platform_config and can be overridden per account. When the account supplies its own OpenRouter/OpenAI key, embeddings are billed to them; otherwise, to platform credits. The chosen model's dimensions are fixed for the account's library once created; changing model requires a reindex.
12.6 Generated document ingestion¶
Documents that Thinklio itself generates (reports, exports, agent-produced summaries) can be ingested back into a library. The same pipeline handles this, with the context = "generated" flag preserved through so governance can distinguish user-uploaded content from agent-produced content during retrieval.
13. Library system¶
A library is a named, scoped collection of ingested documents. Agents are configured with one or more library assignments; retrieval queries those libraries.
13.1 Library scopes¶
Libraries are account-scoped by default. Platform-scoped libraries (managed by Thinklio) exist for standard corpora (best-practice knowledge bases for a domain, for example). Enterprise accounts can also define team-scoped and user-scoped libraries for finer-grained knowledge.
| Scope | Purpose | Who manages |
|---|---|---|
| Platform | Shared across every tenant. | Thinklio. |
| Account | The default for a tenant. | Account admins. |
| Team | A subset of an account's knowledge. | Team leads. |
| User | Personal knowledge. | The user themselves. |
13.2 Library assignment¶
Agent templates declare library_assignments. At deployment, the assignment is resolved into concrete library references. The agent harness passes those references to retrieval on every turn:
library_assignment: v.array(v.object({
libraryId: v.id("library"),
weight: v.number(), // 0.0 to 1.0, rerank weight
mode: v.union(
v.literal("primary"), // agent prefers answers from this library
v.literal("supplementary"), // use only if primary yields nothing
v.literal("restrict"), // agent cannot answer outside this library
),
}));
13.3 Library composition¶
A library is typically composed of many documents uploaded over time. Documents can be added and removed; the library's vector index is updated transactionally. Deleting a document removes its chunks; the library's search continues working against remaining chunks.
13.4 Cross-account sharing¶
Libraries never cross account boundaries in the data layer. The closest approximation is a platform-scoped library, which Thinklio mirrors into every account's retrieval path. For account-to-account sharing (rare; primarily for enterprise agency-client relationships), the sharing is modelled explicitly via library_share records that grant named accounts read access.
13.5 Library hygiene¶
A library has health signals:
- Size distribution: chunk count per document, to flag unusually small or large documents.
- Freshness: last-updated timestamp per document.
- Hit rate: how often retrieval surfaces this document's chunks. Low hit rate may indicate irrelevant uploads; high hit rate on stale documents suggests a content refresh is warranted.
These signals drive the admin UI's library management view.
14. Retrieval integration¶
Retrieval is how an agent turns a user's question into a list of document chunks to ground its answer.
14.1 Retrieval pipeline¶
agent turn starts
-> resolveLibraryAssignments(agent) returns library refs with weights
-> for each library ref:
-> searchLibrary(query, libraryRef) vector search (section 7)
-> combine(results, weights) merge and dedupe
-> metadataFilter(results, turn) recency, source, tags
-> rerank(results, query) optional, LLM or cross-encoder
-> return top K chunks as grounding context
14.2 Hybrid with knowledge facts¶
Retrieval operates against two data sources in parallel: library items (document-grounded, chunk-sized) and knowledge facts (small, structured atoms learned through interaction). The harness merges results across both, weighted by the agent's configured balance. See 04 Data Model for the knowledge_fact entity and 03 Agent Architecture & Extensibility for how the harness uses retrieval.
14.3 Recency and freshness¶
For time-sensitive domains (news, policy updates), retrieval boosts chunks with recent createdAt. The boost is a configurable score adjustment, not a hard filter. For timeless domains, the boost is zero.
14.4 Citation¶
Every chunk returned to the agent carries its mediaId and chunkIndex. When the agent composes a response using a chunk, it cites by mediaId so the UI can surface a "view source" link. The library system exposes an API that resolves a mediaId to a presigned R2 download URL for the originating document, positioned to the citation's page where supported.
14.5 No retrieval fallback¶
If retrieval returns nothing, the agent does not silently fabricate. Depending on the agent's policy, it either asks a clarifying question, declares "I don't have grounding for this in my libraries", or proceeds without grounding and marks the response as ungrounded. The policy is configured per agent.
15. Media and library API¶
The Convex function API exposed to clients. Names are illustrative; see convex/media.ts and convex/library.ts for the canonical surface.
15.1 Media functions¶
| Function | Kind | Purpose |
|---|---|---|
media.requestUploadUrl |
mutation | Allocate a presigned upload URL. |
media.confirmUpload |
mutation | Finalise an upload; enqueue ingestion. |
media.requestDownloadUrl |
mutation | Generate a presigned GET URL. |
media.get |
query | Look up a media record by ID. |
media.listByAccount |
query | List for the current account; paginated. |
media.listByContext |
query | Filtered by context. |
media.softDelete |
mutation | Mark deleted; triggers R2 lifecycle scheduling. |
media.retryIngestion |
mutation | Re-enqueue a failed ingestion workflow. |
15.2 Library functions¶
| Function | Kind | Purpose |
|---|---|---|
library.create |
mutation | Create a new library at the appropriate scope. |
library.addDocument |
mutation | Attach a processed media record to a library. |
library.removeDocument |
mutation | Detach; removes all chunks and re-indexes. |
library.list |
query | List libraries the caller has access to. |
library.search |
action | Vector + metadata search. |
library.health |
query | Size, freshness, hit-rate signals. |
15.3 Governance in the API¶
Every function resolves the caller via Clerk (ctx.auth.getUserIdentity()), looks up the user's membership, and scopes reads and writes accordingly. This is the standard pattern across Thinklio and is documented in 12 Developer Guide.
16. Data lifecycle, archiving, and retention¶
16.1 Soft-delete is the default¶
Tenant-scoped entities carry a deletedAt timestamp. Setting it hides the record from normal queries. The record persists until retention expires or explicit hard-delete runs.
16.2 Retention classes¶
| Class | Examples | Retention |
|---|---|---|
| Active | Messages, agent responses, notes, tasks | Indefinite while account is active. |
| Audit | audit_event, access logs |
2 years minimum (compliance); streamed to long-term storage via Fivetran CDC. |
| Ephemeral | Rate counters, session state, upload stubs | Hours to days. Swept by a cron. |
| Soft-deleted | Any entity with deletedAt |
30 days by default, configurable per account; then hard-deleted. |
| Account-terminated | All tenant data after account closure | 30-day grace, then hard-delete everywhere including R2. |
16.3 Sweeper jobs¶
Two crons handle lifecycle:
sweeps/expired_soft_delete: daily. Hard-deletes records older than the retention window. R2 lifecycle rules pick up deleted objects separately.sweeps/upload_stubs: hourly. Cleans up unconfirmedmediarecords per section 9.4.
Both are actions invoking mutations to read and delete in bounded batches.
16.4 Account offboarding¶
When an account is terminated, a one-time workflow:
- Marks the account
status = "offboarded". - Starts a 30-day grace period. Users retain read access for data export.
- At grace expiry, a deletion workflow runs: enumerate every tenant-scoped table, delete all records where
accountId = {target}, delete all R2 objects undershared/{accountId}/*, delete BYOB bucket credentials, delete all Clerk org membership records. - Emit a final audit event recording the deletion, scoped to the platform rather than the (now-deleted) account.
16.5 Right to erasure¶
A user may request deletion of their personal data without the account being terminated (GDPR, CCPA, Australian Privacy Act equivalents). Scope:
user_profiledeleted; reassigned to a tombstone user for referential integrity in audit events.- User-scoped knowledge facts, notes, libraries: hard-deleted.
- User-authored messages: redacted (body replaced with a tombstone marker) rather than deleted, to preserve chat context for other participants.
- Avatars and user-uploaded assets: deleted.
The workflow is orchestrated via a Convex Workflow so it is auditable and resumable.
17. Backup and recovery¶
17.1 Convex backups¶
Convex provides snapshot backups and point-in-time recovery on the Professional plan and above:
- Daily snapshots, retained 30 days by default, taken automatically.
- Point-in-time recovery within the retention window, to restore any state in the past.
- Manual snapshots, triggered via the Convex dashboard or CLI for pre-migration safety.
Restoration is a Convex-assisted procedure; Thinklio does not run its own database backup infrastructure.
17.2 R2 backups¶
R2 does not have automatic cross-region replication by default; Thinklio configures it per bucket via Cloudflare's bucket replication rules. Primary buckets replicate to a secondary region for disaster recovery. Restore is a bucket-level operation orchestrated via Cloudflare's tooling.
17.3 Fivetran CDC¶
Change data capture to Fivetran streams every Convex write to a long-term warehouse (BigQuery, Snowflake, or equivalent per deployment). This serves three purposes: analytics, compliance audit, and a belt-and-braces backup from which the database could be rebuilt if Convex backups were ever insufficient. Fivetran configuration lives in deploy/.
17.4 Recovery drills¶
A recovery drill runs quarterly:
- Take a fresh manual Convex snapshot in staging.
- Restore to a scratch project.
- Verify schema integrity and sample-record consistency.
- Restore one R2 bucket from its replication target.
- Document any divergence in the runbook.
Drill outcomes are tracked in the decision log.
18. Ingestion under the execution tier budget¶
The execution tiering model in 02 System Architecture governs how Thinklio allocates Convex Workflow and Workpool slots. Ingestion is a major consumer of those slots, and this section sets the rules.
18.1 Tier assignments¶
| Workload | Tier | Rationale |
|---|---|---|
| Single document upload | Tier 2 (Workflow) | Durable, survives crash, user-initiated and responsive. |
| Batch upload (UI, 10-100 files) | Tier 2 | Same, per-document workflows. |
| Bulk import (1000+ files, migrations) | Tier 3 (external queue) | Would exhaust Tier 2 budget. Dispatched to an external worker that calls back into Convex mutations. |
| Periodic reindex (model upgrade) | Tier 3 | Scale-out job, no user-facing latency requirement. |
18.2 Backpressure signals¶
The Workflow component exposes slot occupancy metrics. When occupancy exceeds 80%, new Tier 2 ingestion workflows queue rather than running immediately. When occupancy exceeds 95%, Tier 2 ingestion is gated for 30 seconds to let in-flight work drain. A runaway account can trigger a per-tenant gate independently.
18.3 External queue integration¶
Tier 3 is an external queue (Cloud Tasks, SQS, or equivalent per deployment) that receives ingestion jobs, runs them as worker processes, and calls back into Convex via HTTP actions to persist chunks and update status. The worker runs the same parser, chunker, and embedder logic; the split is purely about where the compute executes.
18.4 Promotion criteria¶
A workload gets promoted from Tier 2 to Tier 3 when monitoring shows sustained Tier 2 saturation. The decision is operational, recorded in the decision log, and reversible. Thinklio starts every workload on the lowest tier that can carry it.
19. Monitoring¶
19.1 Key metrics¶
| Metric | What it tells you |
|---|---|
| Convex function latency (p50, p95, p99) | Query and mutation health. |
| Workflow step duration per kind | Ingestion bottlenecks. |
| Workflow success rate | Ingestion reliability. |
| Workflow retry counts | Flaky external dependencies. |
| Tier 2 slot occupancy | Proximity to the concurrency ceiling. |
| R2 upload error rate | Client connectivity, credential health. |
| Vector index search latency | Retrieval responsiveness. |
| Soft-delete/hard-delete throughput | Sweeper job health. |
| Storage used vs. quota per account | Quota breach risk. |
| Embedding provider error rate | LLM availability. |
19.2 Alerting¶
Alerts are defined in the Convex dashboard and Cloudflare monitoring. Key alerts:
- Tier 2 slot occupancy > 90% for 5 minutes.
- Ingestion success rate < 95% over a 15-minute window.
- Any workflow stuck > 6 hours without progress.
- R2 upload error rate > 2%.
- Convex storage growth > 20% week-over-week (unexplained).
19.3 Admin UI surfaces¶
The platform admin dashboard exposes per-account views of storage usage, ingestion pipeline state (active, queued, failed), library health, and recent error logs. See 12 Developer Guide for the admin UI architecture.
20. Implementation phases¶
Phase 1: Core persistence (complete)¶
- Convex project with the schema from 04 Data Model.
- R2 buckets provisioned, platform-shared tier live.
- Media upload flow (presigned URL, confirm, soft-delete).
- Single-document ingestion pipeline via Workflow.
- Basic vector search against library_item.
Phase 2: Ingestion hardening¶
- Document intelligence tier 2 (cloud document AI) enabled.
- Format-aware chunking across all four document classes.
- Library health signals surfaced in admin UI.
- Sweepers for upload stubs and soft-deletes.
- Fivetran CDC pipeline live.
Phase 3: Platform libraries and BYOB¶
- Platform-scoped libraries live for the standard domains.
- BYOB R2 integration live for enterprise tenants.
- Library hygiene dashboard.
- Customer-managed encryption keys for R2.
Phase 4: Scale-out ingestion¶
- Tier 3 external queue integration for bulk imports.
- Periodic reindex workflow for model upgrades.
- Load and performance testing across concurrent accounts.
Phase 5: Long-tail¶
- Cross-account library sharing (
library_shareflow). - Right-to-erasure workflow productionised.
- Quarterly recovery drill instrumented.
Status tracking is in 13 Implementation Plan & Status.
21. Revision history¶
| Date | Version | Change |
|---|---|---|
| 2026-04-17 | 1.0.0 | Initial consolidated release. Merged old docs 06 (Persistence & Caching v03), 20 (Document Ingestion & Storage v02), 24 (Supabase Platform, Database & Auth v02), and 41 (Document Intelligence v01). |
| 2026-04-17 | 2.0.0 | Full rewrite for Convex-first persistence. The previous version (Postgres + Supabase + Redis design) is preserved under archive/legacy-postgres-persistence-design.md. This version covers Convex as the platform database, R2 as the file layer, native vector search, the ingestion pipeline built on the Workflow component, and execution-tier budgeting for ingestion. |