Developer Guide¶
Overview¶
This document is the operational playbook for developers working on the Thinklio codebase. It covers:
- Part A: Repository and development environment. Monorepo layout, local setup, running Convex and Next.js in watch mode, mobile integration.
- Part B: Convex + Clerk setup guide. The canonical reference for wiring Convex (backend) and Clerk (auth) together across development and production environments, for web, mobile, and native apps.
- Part C: Programming conventions. Coding conventions in the Convex + TypeScript stack: query / mutation / action patterns, response shapes, error handling, logging, data access patterns, and a clearly labelled archival summary of the legacy Go backend conventions.
- Part D: Custom agent and integration developer guide. Building custom agents, registering external tools, subscribing to events, and integrating with the Thinklio platform as an external service.
- Part E: Testing and observability. Test strategy, CI pipeline, profiling, OpenTelemetry + Prometheus metrics, structured logging, and alerting rules.
- Part F: Deployment, administration, and operations. Initial deployment, admin dashboard, backup and recovery, upgrades, scaling, monitoring, routine maintenance, troubleshooting, and security operations.
The Thinklio stack at the time of this document: Convex (reactive TypeScript backend, database, server functions, durable workflows), Clerk (auth, organisations, RBAC), Next.js 15 (App Router, React 19, TypeScript, Tailwind CSS v4) for web, Flutter/Dart (planned) for mobile, Cloudflare R2 for object storage, OpenRouter / Anthropic API for LLM, Postmark for transactional email, Voyage AI for embeddings. The legacy Go services and PostgreSQL / Redis / Supabase stack are being retired; where legacy material remains useful (for example, the deployment topology for the Go monolith, or the Redis Streams event bus design), it is preserved in archival subsections and clearly flagged.
Table of contents¶
Part A: Repository and development environment¶
Part B: Convex + Clerk setup guide¶
- 5. Architecture summary
- 6. Convex setup
- 7. Clerk setup
- 8. Next.js integration
- 9. Mobile and native integration
- 10. User metadata
- 11. Deployment checklist
- 12. Cross-app identity (future option)
- 13. Troubleshooting Convex + Clerk
Part C: Programming conventions¶
- 14. Single reactive backend
- 15. Request lifecycle
- 16. Query, mutation, and action patterns
- 17. Response and error conventions
- 18. Logging conventions
- 19. Data access patterns
- 20. Legacy Go backend conventions (archival)
Part D: Custom agent and integration developer guide¶
Part E: Testing and observability¶
- 24. Testing strategy
- 25. CI pipeline
- 26. Profiling mode
- 27. OpenTelemetry and Prometheus
- 28. Structured logging in production
- 29. Testing and observability backlog
Part F: Deployment, administration, and operations¶
- 30. System components
- 31. Infrastructure and prerequisites
- 32. Initial deployment
- 33. Administration
- 34. Backup and recovery
- 35. Upgrades
- 36. Scaling
- 37. Monitoring and alerting
- 38. Routine maintenance
- 39. Troubleshooting production
- Revision history
Part A: Repository and development environment¶
Part A describes the monorepo, its conventions, and how to bring Thinklio up locally for development work. For operational deployment see Part F.
1. Repository structure¶
Thinklio lives in a single monorepo. The active layout is Convex-first: the Convex backend is the canonical server, the Next.js app consumes it via the Convex React client, and the Flutter mobile app (planned) consumes it via the Convex Dart client. Legacy Go services remain in the tree under cmd/ and internal/ for reference during the migration and are being retired incrementally.
thinklio/
├── convex/ # Convex backend: schema, queries, mutations, actions, HTTP routes
│ ├── schema.ts # Canonical schema (all tables, indexes, vector indexes)
│ ├── auth.config.ts # Clerk JWT issuer configuration
│ ├── http.ts # HTTP routes (Clerk webhook, Postmark inbound, tool callbacks)
│ ├── _generated/ # Generated types and API (do not edit)
│ └── <domain>.ts # One file per domain: agent, channel, knowledge, jobs, etc.
├── apps/
│ ├── web/ # Next.js 15 web app (React 19, TypeScript, Tailwind v4)
│ │ ├── src/app/ # App Router routes
│ │ ├── src/components/ # Shared UI components
│ │ ├── src/lib/ # Client-side utilities, Convex helpers
│ │ └── middleware.ts # Clerk middleware (required for production)
│ └── mobile/ # Flutter/Dart mobile app (planned)
├── packages/ # Shared packages (types, i18n bundles, UI primitives)
├── docs/ # Canonical documentation set (this file is 12)
│ └── product/ # Numbered product docs 01–14 + living docs
├── cmd/ # Legacy Go service entry points (being retired)
├── internal/ # Legacy Go packages (being retired)
├── pkg/ # Legacy Go public packages
├── migrations/ # Legacy SQL migrations (Supabase era)
├── deploy/ # Docker, Compose, reverse proxy configs (legacy; see Part F)
└── tests/ # Integration and end-to-end tests
The Convex backend is the source of truth for all new application logic and data. When you need to understand a domain, start at convex/schema.ts for the data shape and then the matching convex/<domain>.ts file for the server functions.
2. Prerequisites¶
| Tool | Version | Purpose |
|---|---|---|
| Node.js | 20 LTS or newer | Required for Convex CLI, Next.js, and TypeScript tooling |
| npm | 10+ (or pnpm 9+) | Package manager |
| Convex CLI | Latest | Installed via npm install -D convex; run with npx convex |
| Clerk account | any | Development instance for local dev, production instance for deployed environments |
| Git | 2.40+ | Source control |
| Flutter | 3.22+ | Only needed for mobile work (optional) |
golangci-lint |
any | Only needed if working on legacy Go code (optional) |
Cloud service accounts required for full functionality: Convex (reactive backend), Clerk (auth), Cloudflare R2 (object storage), OpenRouter or Anthropic (LLM), Voyage AI (embeddings), Postmark (email). Each has a free or low-cost dev tier sufficient to get the platform running end to end.
3. Local development¶
Two long-running processes drive the inner development loop. Run them in separate terminals.
# Terminal 1: Convex backend in watch mode (pushes on save)
npx convex dev
# Terminal 2: Next.js web app in dev mode
cd apps/web && npm run dev
npx convex dev watches convex/** and pushes changes to your Convex development deployment on save. Server logs stream to the terminal. The Next.js dev server runs at http://localhost:3000 by default, with hot module reload.
Common Convex CLI commands:
npx convex dev # Watch mode (default workflow)
npx convex dev --once # Push once and exit (CI, scripted setup)
npx convex deploy # Deploy to the configured deployment
npx convex run <fn> # Invoke a server function by name
npx convex import <file> # Bulk import data (dev only)
npx convex export # Export all tables to disk
npx convex dashboard # Open the Convex dashboard for this deployment
Common web app commands:
cd apps/web
npm run dev # Start Next.js dev server
npm run build # Production build (standalone output)
npm run start # Run the production build
npm run lint # Run ESLint
npx tsc --noEmit # Type check only
Convex + Next.js integration is detailed in Part B (sections 5 through 13). At minimum, the Next.js app needs NEXT_PUBLIC_CONVEX_URL, NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY, and CLERK_SECRET_KEY in .env.local; the Convex deployment needs CLERK_JWT_ISSUER_DOMAIN and CLERK_WEBHOOK_SECRET.
4. Environment configuration¶
Environment variables split into three groups by where they are read:
- Next.js build-time (
NEXT_PUBLIC_*). Baked into the JS bundle at build time. Required duringnext build. - Next.js runtime (server-only). Read at request time by server components, API routes, and middleware.
- Convex deployment env. Set in the Convex dashboard (Settings > Environment Variables) or via
npx convex env set. Available to server functions viaprocess.env.
Reference set for a Thinklio deployment:
| Variable | Location | Purpose |
|---|---|---|
NEXT_PUBLIC_CONVEX_URL |
Next.js build-time | Convex deployment URL for the React client |
NEXT_PUBLIC_CONVEX_SITE_URL |
Next.js build-time | Convex HTTP site URL (for webhook endpoints referenced by the app) |
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY |
Next.js build-time | Clerk publishable key (pk_test_* / pk_live_*) |
CLERK_SECRET_KEY |
Next.js runtime | Clerk secret key for server-side auth helpers |
CLERK_JWT_ISSUER_DOMAIN |
Convex deployment | Clerk JWT issuer (validated by Convex auth) |
CLERK_WEBHOOK_SECRET |
Convex deployment | Signing secret for /clerk-webhook verification |
OPENROUTER_API_KEY |
Convex deployment | LLM API access (alternative: ANTHROPIC_API_KEY) |
VOYAGE_API_KEY |
Convex deployment | Embedding model API access |
TAVILY_API_KEY |
Convex deployment | Web search tool |
POSTMARK_SERVER_TOKEN |
Convex deployment | Transactional email and inbound email channel |
R2_ENDPOINT, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, R2_BUCKET_NAME |
Convex deployment | Cloudflare R2 S3-compatible storage for documents |
TELEGRAM_BOT_TOKEN |
Convex deployment | Telegram channel adapter |
Appendix-style details for Clerk configuration, JWT templates, and webhook setup are in Part B.
Important: NEXT_PUBLIC_* values become part of the built JavaScript. Any Docker build for the web app must declare these as ARG directives so they are available during next build:
CLERK_SECRET_KEY and other runtime-only secrets are passed via container env at runtime, not baked into the image.
Part B: Convex + Clerk setup guide¶
Part B is the canonical setup reference for wiring Convex (backend) and Clerk (auth) together. It applies to every Novansa app, with Thinklio-specific notes called out where they differ from the common pattern. Currently each app (Thinklio, CalmerFlow) maintains its own Clerk instance; a future option for a shared instance is described in section 12.
5. Architecture summary¶
Browser / App → Client (Clerk auth) → Convex (validated JWT) → Database
↑
Clerk webhooks (user/org sync)
Key integration points:
- Clerk issues JWTs containing user and optional organisation claims.
- Convex validates JWTs using Clerk's JWKS endpoint (configured in
convex/auth.config.ts). - Clerk webhooks notify Convex of user and organisation lifecycle events.
- Next.js middleware handles Clerk's auth proxy for production domains.
6. Convex setup¶
6.1 Development deployment¶
Every Convex project has a development deployment created automatically the first time you run npx convex dev.
Environment variables (in the repo root .env.local):
CONVEX_DEPLOYMENT=dev:<deployment-slug>
NEXT_PUBLIC_CONVEX_URL=https://<deployment-slug>.<region>.convex.cloud
NEXT_PUBLIC_CONVEX_SITE_URL=https://<deployment-slug>.<region>.convex.site
Running locally:
6.2 Production deployment¶
Production deployments are created in the Convex dashboard under your project.
Deploy command:
npx convex deploy
# or, targeting a specific deployment:
CONVEX_DEPLOYMENT=prod:<deployment-slug> npx convex deploy
Required environment variables (set in Convex dashboard > Settings > Environment Variables):
| Variable | Purpose | Example |
|---|---|---|
CLERK_JWT_ISSUER_DOMAIN |
Clerk domain for JWT validation | https://clerk.example.com |
CLERK_WEBHOOK_SECRET |
Webhook signing secret from Clerk | whsec_... |
Additional app-specific env vars (LLM keys, payment provider keys, etc.) are set per deployment.
6.3 Auth configuration¶
Convex validates Clerk JWTs via convex/auth.config.ts:
export default {
providers: [
{
domain: process.env.CLERK_JWT_ISSUER_DOMAIN!,
applicationID: "convex",
},
],
};
applicationID: "convex" must match the name of the JWT template created in Clerk (see section 7.3).
7. Clerk setup¶
7.1 Development instance¶
Clerk development instances use test keys (pk_test_*, sk_test_*) and do not require DNS configuration.
Environment variables (in the web app's .env.local):
And for Convex (set in .env.local or Convex dashboard):
Development instances:
- Use Clerk's hosted UI at
*.clerk.accounts.dev - Support test users and test organisations
- JWT templates work identically to production
- Webhooks can be tested via Clerk's dashboard or through a tunnel (for example
ngrok)
7.2 Production instance¶
Production instances use live keys (pk_live_*, sk_live_*) and require DNS configuration.
Steps to create:
- In Clerk dashboard > your app > Enable Production.
- Enter your application domain (for example,
app.example.com). - Choose Primary application (Clerk API at
clerk.<your-domain>). - Add the DNS records Clerk provides (CNAME for
clerk.<domain>, plus email verification records). - Wait for SSL provisioning (usually a few minutes).
The Clerk Free plan (10,000 MAU) includes production mode. No paid plan is needed initially.
Environment variables for deployment (Coolify, Vercel, etc.):
| Variable | Type | Value |
|---|---|---|
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY |
Build arg | pk_live_... |
CLERK_SECRET_KEY |
Runtime env | sk_live_... |
NEXT_PUBLIC_* variables are baked into the JS bundle at build time in Next.js. CLERK_SECRET_KEY is runtime-only (server-side). The Dockerfile must declare NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY as an ARG:
7.3 JWT template for Convex¶
Clerk must have a JWT template named convex. Without this template, Convex cannot validate tokens.
Create in: Clerk dashboard > JWT Templates > New Template.
Settings:
- Name:
convex - Issuer:
https://clerk.<your-domain>(production) or auto-set (development) - JWKS Endpoint: auto-configured by Clerk
Minimal claims (no organisations):
{
"aud": "convex",
"name": "{{user.full_name}}",
"email": "{{user.primary_email_address}}",
"picture": "{{user.image_url}}",
"email_verified": "{{user.email_verified}}"
}
Extended claims (with organisations, required for Thinklio):
{
"aud": "convex",
"name": "{{user.full_name}}",
"email": "{{user.primary_email_address}}",
"picture": "{{user.image_url}}",
"email_verified": "{{user.email_verified}}",
"org_id": "{{org.id}}",
"org_role": "{{org.role}}",
"org_slug": "{{org.slug}}"
}
Apps that use Clerk Organisations (Thinklio) require the org_* claims. Apps that do not (for example, CalmerFlow) use the minimal template. The template name (convex) must match applicationID in auth.config.ts regardless.
Optional: include public metadata in JWTs so it is available in Convex without an extra API call.
Inside Convex, read this via ctx.auth.getUserIdentity(). Keep the total session token payload under 1.2 KB (browser cookie limit).
7.4 Webhooks¶
Clerk webhooks sync user and organisation data to Convex. The webhook endpoint runs on the Convex HTTP deployment (not on the Next.js app).
Endpoint URL:
Minimum events (all apps):
user.createduser.updateduser.deleted
Additional events (apps using organisations):
organization.createdorganization.updatedorganization.deletedorganizationMembership.createdorganizationMembership.deleted
After creating the webhook, Clerk provides a signing secret (whsec_...). Set this as CLERK_WEBHOOK_SECRET on the Convex deployment.
Webhook handler pattern (Convex HTTP):
// convex/http.ts
import { httpRouter } from "convex/server";
import { httpAction } from "./_generated/server";
const http = httpRouter();
http.route({
path: "/clerk-webhook",
method: "POST",
handler: httpAction(async (ctx, request) => {
// 1. Verify signature using svix
// 2. Parse event type
// 3. Dispatch to appropriate mutation:
// user.created -> upsert user_profile
// user.updated -> patch user_profile
// user.deleted -> soft-delete or hard-delete
}),
});
export default http;
7.5 Organisations¶
Clerk organisations are optional per app. Use them when the app has multi-tenant workspaces (Thinklio). Skip them for single-user apps (CalmerFlow).
When using organisations:
- The JWT template must include
org_id,org_role,org_slugclaims. - Convex middleware extracts
org_idfrom every JWT and scopes queries to the organisation. - All data tables include an
accountIdforeign key (in Convex terms,v.id("account")). - Members are auto-added to organisation-wide resources via webhooks.
When not using organisations:
- The JWT template omits org claims.
- Convex scopes queries by the user's profile ID instead.
- Sharing is handled by app-level constructs (for example, board invites in CalmerFlow).
8. Next.js integration¶
8.1 Provider hierarchy¶
RootLayout (layout.tsx)
└─ ClerkProvider
└─ ConvexProviderWithClerk (passes useAuth to Convex)
└─ App content
Apps using organisations add an AccountGuard or similar wrapper inside the Convex provider to validate the active organisation and load account data. For Thinklio, the provider tree is:
RootLayout
└─ ClerkProvider
└─ ConvexProviderWithClerk
└─ AccountGuard (validates org + loads account data)
└─ App content
Example wiring:
// apps/web/src/app/providers.tsx
"use client";
import { ClerkProvider, useAuth } from "@clerk/nextjs";
import { ConvexReactClient } from "convex/react";
import { ConvexProviderWithClerk } from "convex/react-clerk";
const convex = new ConvexReactClient(process.env.NEXT_PUBLIC_CONVEX_URL!);
export function Providers({ children }: { children: React.ReactNode }) {
return (
<ClerkProvider>
<ConvexProviderWithClerk client={convex} useAuth={useAuth}>
{children}
</ConvexProviderWithClerk>
</ClerkProvider>
);
}
8.2 Middleware¶
Next.js middleware runs clerkMiddleware(), which handles:
- Session token validation on every request
- Clerk's FAPI proxy in production (routes
/clerkrequests to Clerk's API)
// apps/web/middleware.ts
import { clerkMiddleware } from "@clerk/nextjs/server";
export default clerkMiddleware();
export const config = {
matcher: [
"/((?!_next|[^?]*\\.(?:html?|css|js(?!on)|jpe?g|webp|png|gif|svg|ttf|woff2?|ico|csv|docx?|xlsx?|zip|webmanifest)).*)",
"/(api|trpc)(.*)",
],
};
This file is required for production. Without it, Clerk's token requests return 404 and auth enters an infinite retry loop.
8.3 Server-side Convex client¶
For server components and API routes that need authenticated Convex access:
import { ConvexHttpClient } from "convex/browser";
import { auth } from "@clerk/nextjs/server";
import { redirect } from "next/navigation";
export async function getAuthenticatedConvex() {
const { getToken, userId } = await auth();
if (!userId) redirect("/login");
const token = await getToken({ template: "convex" });
if (!token) redirect("/login");
const client = new ConvexHttpClient(process.env.NEXT_PUBLIC_CONVEX_URL!);
client.setAuth(token);
return { client };
}
8.4 Environment variables¶
Development (.env.local):
NEXT_PUBLIC_CONVEX_URL=https://<dev-deployment>.<region>.convex.cloud
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_test_...
CLERK_SECRET_KEY=sk_test_...
Production (set in deployment platform):
NEXT_PUBLIC_CONVEX_URL=https://<prod-deployment>.<region>.convex.cloud
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_live_...
CLERK_SECRET_KEY=sk_live_...
9. Mobile and native integration¶
9.1 Flutter¶
Use the clerk_flutter package. Clerk provides a ClerkAuth widget that wraps the app.
Access user and metadata:
Convex integration: pass the Clerk session token to the Convex Dart client as a bearer token. There is no ConvexProviderWithClerk equivalent for Flutter yet; token refresh is managed manually.
9.2 Native iOS (Swift)¶
Use the Clerk Swift SDK.
import ClerkSDK
Clerk.configure(publishableKey: "pk_test_...")
if let user = Clerk.shared.user {
let apps = user.publicMetadata["apps"]
}
9.3 Native Android (Kotlin)¶
Use the clerk-android SDK.
Clerk.configure(publishableKey = "pk_test_...")
val user = Clerk.shared.user
val apps = user?.publicMetadata?.get("apps")
9.4 Mobile considerations¶
- Mobile apps use
pk_test_*/pk_live_*keys directly (no middleware proxy needed). publicMetadataandunsafeMetadataare readable from all mobile SDKs.privateMetadatais never exposed to any client SDK.- Token refresh is handled by the Clerk SDK automatically.
- For Convex: obtain the session token via the Clerk SDK and pass it to Convex's HTTP client or websocket client as a bearer token.
10. User metadata¶
Clerk supports three metadata tiers on every user record:
| Type | Client-readable | Server-writable | In JWTs | Size limit |
|---|---|---|---|---|
| Public | All SDKs | Backend API only | Optional (via template) | 8 KB |
| Private | Never | Backend API only | Never | 8 KB |
| Unsafe | All SDKs | Client + Backend | Optional (via template) | 8 KB |
10.1 Reading metadata¶
Client-side (any platform):
Server-side (Next.js):
import { clerkClient } from "@clerk/nextjs/server";
const user = await clerkClient.users.getUser(userId);
const allMeta = user.publicMetadata;
const privateMeta = user.privateMetadata; // only available server-side
Inside Convex (via JWT claims):
const identity = await ctx.auth.getUserIdentity();
const metadata = identity?.metadata; // only if included in JWT template
10.2 Writing metadata¶
Server-side only for public and private. Unsafe metadata can also be written from the client.
import { clerkClient } from "@clerk/nextjs/server";
await clerkClient.users.updateUserMetadata(userId, {
publicMetadata: { tier: "pro", apps: ["calmerflow", "thinklio"] },
privateMetadata: { stripeCustomerId: "cus_xxx" },
});
10.3 Best practices¶
- Use public for anything the client needs to read (tier, feature flags, app list).
- Use private for secrets the client must never see (payment IDs, internal flags).
- Use unsafe sparingly, and only for user-editable preferences where tampering is harmless.
- If including metadata in JWTs, keep total claims under 1.2 KB.
- Metadata changes do not appear in the current session token until the next refresh. They are not instant.
11. Deployment checklist¶
11.1 New development environment¶
- Create Convex project (automatic dev deployment).
- Create Clerk application (development instance).
- Create
convexJWT template in Clerk (minimal or extended claims, per app). - Set
CLERK_JWT_ISSUER_DOMAINon Convex dev deployment. - Set up Clerk webhook pointing to Convex dev site URL and set
CLERK_WEBHOOK_SECRET. - Fill in
.env.localwith Clerk and Convex values. - Run
npx convex devand start the app. - Sign up and verify the full flow.
11.2 New production environment¶
- Create production deployment in Convex dashboard.
- Enable production in Clerk dashboard.
- Configure DNS records for Clerk (
clerk.<domain>CNAME + email records). - Wait for Clerk SSL provisioning.
- Create
convexJWT template in Clerk production (same claims as dev). - Set environment variables on Convex production:
CLERK_JWT_ISSUER_DOMAINCLERK_WEBHOOK_SECRET- app-specific keys (LLM, Postmark, R2, etc.)
- Set environment variables in deployment platform:
NEXT_PUBLIC_CONVEX_URL(build arg)NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY(build arg)CLERK_SECRET_KEY(runtime)- Ensure the web Dockerfile declares
ARG NEXT_PUBLIC_CLERK_PUBLISHABLE_KEYandARG NEXT_PUBLIC_CONVEX_URL. - Deploy Convex functions (
npx convex deploy). - Set up Clerk webhook for production pointing to prod Convex site URL.
- Deploy the web app and verify the full flow.
11.3 Thinklio-specific notes¶
- Clerk Organisations: required. The JWT template must include
org_id,org_role,org_slugclaims (section 7.3 extended claims). - Webhooks: all user events plus all organisation and organisationMembership events (section 7.4).
- What the webhooks do:
user.created/user.updatedupserts intouser_profile.organization.createdcreatesaccount, the default General channel, assigns a storage bucket, and auto-installs the Assistant agent.organizationMembership.createdadds the user to all organisation-type channels.- Delete events archive or remove the corresponding records.
- Seeding platform data:
# Set deployment target (omit for dev)
export CONVEX_DEPLOYMENT=prod:<deployment-slug>
# Seed agent catalogue (safe to re-run; upserts by slug)
npx convex run seed:syncAgentCatalog
# Seed storage buckets (idempotent)
npx convex run seed:seedStorageBuckets
Adding new agents to the catalogue: add the agent definition to CATALOG_AGENTS in convex/seed.ts, then run npx convex run seed:syncAgentCatalog on each deployment. Existing entries update by slug; new entries are inserted.
- Inspecting data:
npx convex run admin:listAccounts. - Docker / monorepo note:
outputFileTracingRootinnext.config.tsmust point to the monorepo root (../../). Without it, standalone builds do not include hoistednode_modules.
12. Cross-app identity (future option)¶
Status: Not implemented. Documented here as a design option for when seamless cross-product integration is desired.
Currently each Novansa app (Thinklio, CalmerFlow, etc.) maintains its own Clerk instance with separate user pools. This is simple and provides clean isolation, but a user with accounts on multiple apps has separate identities, separate login sessions, and no awareness between products.
12.1 Shared Clerk instance¶
A single Clerk instance could serve all Novansa apps. Users would sign up once and use the same credentials everywhere.
What this enables:
- Single sign-on across all Novansa products.
- A unified user record with shared metadata.
- Cross-app awareness: CalmerFlow could surface Thinklio data (and vice versa) without a separate integration auth flow.
- Centralised billing and subscription management.
How it works:
- One Clerk application with one user pool.
- Each app gets its own Convex deployment (data stays isolated).
- Each Convex deployment has its own JWT template pointing to the same Clerk issuer.
publicMetadatatracks which apps a user has activated:
{
"publicMetadata": {
"apps": ["calmerflow", "thinklio"],
"calmerflow": { "tier": "pro", "onboarded": true },
"thinklio": { "tier": "free", "onboarded": false }
}
}
- Each app reads its own namespace from metadata and ignores the rest.
- Convex webhook handlers are per-deployment: each app receives the same Clerk events and handles only what it needs.
Trade-offs:
- All apps share rate limits, MAU counts, and billing on the Clerk plan.
- A Clerk outage affects all products simultaneously.
- User deletion is global: you cannot delete a user from one app without removing them from all.
- Different apps may need different auth flows (for example, CalmerFlow uses email OTP, Thinklio might add SSO), so a shared instance must support the union of all requirements.
- Clerk's pricing is per-MAU across the instance, which may be cheaper or more expensive depending on user overlap.
12.2 Migration path (when ready)¶
- Create a new shared Clerk production instance.
- Export users from each app's Clerk instance.
- Merge user records by email, combining metadata namespaces.
- Import into the shared instance.
- Update each app's env vars to point to the shared instance.
- Update each Convex deployment's
CLERK_JWT_ISSUER_DOMAIN. - Re-create webhooks pointing each app's Convex site URL to the shared Clerk instance.
This is a one-way door. Merging back to separate instances requires re-splitting user data.
13. Troubleshooting Convex + Clerk¶
"No active organisation" error. The JWT does not include org_id. Check that the Clerk JWT template has "org_id": "{{org.id}}" in its claims. Only relevant for apps using organisations.
Clerk token requests return 404. Missing middleware.ts in the Next.js app. This file must exist and export clerkMiddleware() for production.
Webhooks not firing.
- Check that the webhook endpoint URL matches the Convex site URL (not cloud URL).
- Verify CLERK_WEBHOOK_SECRET is set on the Convex deployment.
- Check Clerk webhook logs in the dashboard for delivery failures.
User data missing in Convex. Webhooks create the data. If the webhook was not configured when the user was created, either trigger a user.updated event by editing the user in the Clerk dashboard or manually insert the record via the Convex dashboard.
Session token too large. If including metadata in JWTs, keep total custom claims under 1.2 KB. Large publicMetadata objects should be fetched server-side rather than embedded in the token.
Metadata changes not reflected immediately. Metadata updates do not appear in the current session token until the next refresh. For time-sensitive changes (for example, upgrading a subscription tier), call session.reload() on the client or wait for the next natural token refresh.
Part C: Programming conventions¶
Part C covers the coding conventions used inside the Convex + TypeScript backend. It replaces and condenses the Go-era programming guide. A compact archival summary of the legacy Go backend conventions is preserved at the end of this part (section 20) for anyone working with retiring services.
14. Single reactive backend¶
All application logic and data live inside Convex. There are no separate gateway, queue, or agent worker services. The Convex runtime provides:
- Reactive queries. Reads that return fresh results to every subscribed client whenever underlying data changes, without any pub/sub plumbing.
- Serialisable mutations. Writes that run under optimistic concurrency control with automatic retries. Each mutation is a single transaction.
- Actions. Long-running server code that can call external APIs (LLM providers, webhooks, R2). Actions can schedule mutations and other actions.
- Scheduler. A built-in durable scheduler for deferred work. Replaces external queues.
- HTTP routes. First-class HTTP endpoints for webhooks and integrations (
convex/http.ts). - Vector search. Native vector indexes on tables for embedding-based knowledge retrieval.
- File storage. Native blob storage; used via R2 as the canonical bucket for Thinklio documents.
Functional decomposition happens through modules, not services. A domain lives in convex/<domain>.ts (for example convex/agent.ts, convex/channel.ts, convex/knowledge.ts) and exposes a flat set of query / mutation / action functions that other modules call through the generated api object.
15. Request lifecycle¶
A client call follows this path:
1. React component issues useQuery / useMutation / useAction
│
2. Convex React client sends request over websocket (queries, mutations)
│ or HTTPS (actions with long-running work)
│ Auth token from Clerk is attached automatically via ConvexProviderWithClerk
│
3. Convex runtime authenticates the request
│ ctx.auth.getUserIdentity() returns the Clerk identity (including org_id)
│ Unauthenticated requests receive null; handlers decide whether to allow
│
4. Function handler runs
│ Queries: read-only, fully reactive
│ Mutations: transactional, retried on conflict
│ Actions: can call fetch(), schedule follow-up work
│
5. Result returned to the client
│ Queries continue to re-run automatically when dependencies change
│ Mutations / actions return once; side effects propagate via reactive queries
For channel-originated requests (Telegram, Postmark inbound email, external API), the same path applies with the authentication step swapped for the channel's own credential verification before the handler resolves the originating user.
16. Query, mutation, and action patterns¶
All server functions are defined using the generated builders from convex/_generated/server. Each function declares an argument schema and a handler. The argument schema is enforced at runtime by Convex and at compile time by TypeScript.
Query (read-only, reactive):
// convex/task.ts
import { query } from "./_generated/server";
import { v } from "convex/values";
export const listByAccount = query({
args: {
accountId: v.id("account"),
status: v.optional(v.union(v.literal("todo"), v.literal("in_progress"), v.literal("done"))),
limit: v.optional(v.number()),
},
handler: async (ctx, args) => {
const identity = await ctx.auth.getUserIdentity();
if (!identity) throw new Error("unauthenticated");
// Verify the caller belongs to the account (application-layer enforcement)
await assertAccountMember(ctx, identity.subject, args.accountId);
let q = ctx.db.query("task").withIndex("by_account", (q) => q.eq("accountId", args.accountId));
if (args.status) q = q.filter((q) => q.eq(q.field("status"), args.status));
return await q.take(args.limit ?? 100);
},
});
Mutation (transactional, writes):
export const create = mutation({
args: {
accountId: v.id("account"),
title: v.string(),
priority: v.optional(v.union(v.literal("low"), v.literal("normal"), v.literal("high"))),
},
handler: async (ctx, args) => {
const identity = await ctx.auth.getUserIdentity();
if (!identity) throw new Error("unauthenticated");
const userId = await resolveUserProfileId(ctx, identity.subject);
await assertAccountMember(ctx, identity.subject, args.accountId);
return await ctx.db.insert("task", {
accountId: args.accountId,
createdBy: userId,
title: args.title,
status: "todo",
priority: args.priority ?? "normal",
});
},
});
Action (long-running, external calls):
// convex/llm.ts
import { action } from "./_generated/server";
import { v } from "convex/values";
import { internal } from "./_generated/api";
export const generateResponse = action({
args: {
interactionId: v.id("interaction"),
model: v.string(),
messages: v.array(v.object({ role: v.string(), content: v.string() })),
},
handler: async (ctx, args) => {
const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENROUTER_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ model: args.model, messages: args.messages }),
});
const body = await response.json();
await ctx.runMutation(internal.interaction.appendAssistantMessage, {
interactionId: args.interactionId,
content: body.choices[0].message.content,
usage: body.usage,
});
return body.choices[0].message.content;
},
});
Calling conventions:
- From React:
useQuery(api.task.listByAccount, { accountId }),useMutation(api.task.create),useAction(api.llm.generateResponse). - From another Convex function:
await ctx.runQuery(api.task.listByAccount, { ... }),await ctx.runMutation(api.task.create, { ... }),await ctx.runAction(api.llm.generateResponse, { ... }). - Internal functions: Defined with
internalQuery,internalMutation,internalActionbuilders. Not exposed to clients; callable only from other Convex functions. Used for cross-module orchestration that should not be directly invokable by end users.
17. Response and error conventions¶
Convex functions return native TypeScript values (objects, arrays, primitives, Id<> references). There is no HTTP response envelope wrapping these results; the { data, meta } envelope from the Go era is gone. Clients receive the raw value.
Errors are thrown using ConvexError so they surface cleanly to the client:
import { ConvexError } from "convex/values";
if (!identity) {
throw new ConvexError({ code: "unauthenticated", message: "Sign in to continue." });
}
if (task.accountId !== args.accountId) {
throw new ConvexError({ code: "forbidden", message: "Task does not belong to this account." });
}
The error payload is available on the client as error.data. The canonical error code set used across the Thinklio Convex backend:
| Code | Meaning |
|---|---|
unauthenticated |
Missing or invalid auth |
forbidden |
Authenticated but not authorised for this action |
not_found |
Target entity does not exist (or is not visible to the caller) |
bad_request |
Caller-supplied data is invalid |
conflict |
Optimistic concurrency conflict or uniqueness violation |
budget_exceeded |
Budget gate tripped (see 07-security-governance.md Part B) |
policy_denied |
Policy evaluator denied the action |
rate_limited |
Per-caller rate limit hit |
internal_error |
Unexpected server-side failure |
For external-facing APIs (Channel, Platform, Integration) served via convex/http.ts, responses use the platform's three-surface error envelope documented in 09-external-api-tool-integration.md. Human-readable message fields are localised per the rules in 10-client-applications.md Part D.
18. Logging conventions¶
Server functions use standard console methods, which Convex captures into its log stream. Structured context goes in as a second argument:
console.log("harness_step_completed", {
interactionId: ix._id,
step: "llm_call",
durationMs: Date.now() - startedAt,
tokensIn: usage.input_tokens,
tokensOut: usage.output_tokens,
cost: usage.cost_usd,
});
Logs are viewable in the Convex dashboard (Logs tab) and can be shipped to an external aggregator via the Convex log streaming export. The observability conventions (OpenTelemetry, Prometheus, alerts) are documented in Part E.
Correlation IDs are managed per interaction. The interactionId is the canonical correlation key for all logs related to a single user turn through the harness. When a job is involved, include jobId as well. When a delegation chain is running, include the parent interactionId too.
19. Data access patterns¶
19.1 Indexes before filters¶
Every list-style query should hit an index. Convex indexes are defined on the schema and referenced with withIndex. Avoid filter() for primary access paths: it scans.
// Good: uses the composite index (accountId, createdAt desc)
await ctx.db
.query("interaction")
.withIndex("by_account_recent", (q) => q.eq("accountId", accountId))
.order("desc")
.take(50);
// Acceptable for narrowing after an index: filter runs against a small set
await ctx.db
.query("task")
.withIndex("by_account", (q) => q.eq("accountId", accountId))
.filter((q) => q.eq(q.field("status"), "todo"))
.take(100);
// Bad: full table scan filtered in memory
await ctx.db.query("task").filter((q) => q.eq(q.field("accountId"), accountId)).collect();
19.2 Account scoping is application-layer¶
Every tenant-scoped read and write must check that the authenticated user belongs to the accountId being acted on. This check runs in a shared helper (assertAccountMember) called from every query and mutation that touches account data. Convex has no automatic row-level security; the helpers are the enforcement point.
19.3 Writes are transactional¶
A mutation sees a consistent snapshot of the database and commits atomically. If two mutations collide on the same record, Convex retries one of them. Write code that is safe to retry: read, compute, write, without side effects that depend on position in the retry loop.
Side effects that cannot be retried (sending an email, calling an LLM) must run in an action, not a mutation. Actions can schedule follow-up mutations to record results.
19.4 Pagination uses indexes and cursors¶
Convex provides .paginate({ cursor, numItems }) on indexed queries. The cursor is opaque and round-trips through the client. Avoid offset-style pagination: it does not scale and is not supported idiomatically.
19.5 Vector search¶
Knowledge retrieval uses vector indexes declared in the schema. Vector indexes support filter expressions to scope retrieval to the caller's account, team, or user.
const hits = await ctx.vectorSearch("knowledge_fact", "by_embedding", {
vector: queryEmbedding,
limit: 20,
filter: (q) => q.eq("accountId", accountId),
});
Matching fact documents are then fetched by id in the calling function.
19.6 Archiving vs deleting¶
Prefer soft-delete (isArchived: true) for user-authored content and for anything referenced by audit trails. Hard delete only when GDPR deletion or equivalent compliance flow demands it. The user.deleted path anonymises contributions, hard-deletes user knowledge, and records the deletion in the audit log.
20. Legacy Go backend conventions (archival)¶
The Go backend is being retired. This subsection is preserved for anyone working on the remaining Go services until they are decommissioned. New code should not be written against these patterns.
Single binary, logical services. All Go services (gateway, agent, context, tool, queue, usage) ran within one process under cmd/server/main.go. Services communicated via Redis Streams events, not in-process calls, so the topology could split back out if ever needed.
Request lifecycle (Go era).
HTTP Request → Kill Switch middleware → Auth middleware → Route handler
→ publish message.received event → agent consumer group → harness
→ publish message.response event → gateway consumer group → channel adapter
Response envelope. Every Go API response used { data, meta } or { error, meta } where meta always carried request_id and timestamp. Helpers: api.JSON, api.Created, api.Accepted, api.BadRequest, api.Unauthorized, api.Forbidden, api.NotFound, api.InternalError.
Error codes. bad_request, unauthorized, forbidden, not_found, internal_error, method_not_allowed, slug_taken, create_failed, platform_unavailable. The Convex set (section 17) is a superset and covers the same meanings.
Database. Supabase Cloud PostgreSQL via Supavisor pooler. Pool sized at MaxConns 10, MinConns 2, MaxConnLifetime 30 min, MaxConnIdleTime 5 min. Row-level security used for client-facing queries; the Go service role connection bypassed RLS and enforced scoping at the application layer via lookupAccountID() and friends.
Query patterns. List endpoints used dynamic SQL with parameterised filters ($1, $2, ...) and argument counter argN. Create endpoints used INSERT ... RETURNING id. Update endpoints used allowlisted field maps to prevent injection. Delete endpoints either hard-deleted or soft-deleted via is_archived = true.
Event bus. Redis Streams. Each logical service had its own consumer group; each instance had a unique consumer name. Streams were named events:<type> (for example events:message.received). Events carried id, type, source, agent_id, user_id, team_id, account_id, session_id, parent_id, payload, metadata (with trace_id, version, priority), and created_at. Published with XADD, consumed with XREADGROUP.
Auth. Supabase JWT primary, API key secondary. The Auth middleware extracted sub (user UUID) and email from Supabase tokens and user_id, account_id from API keys (thk_...). The kill switch middleware returned 503 when platform_status != "online".
Logging. log/slog with JSON output at INFO by default. Contextual fields included interaction_id, agent_id, error.
Repository shape.
thinklio/
├── cmd/server/ # Single binary entry point
├── internal/ # Application packages (29 of them)
│ ├── admin/ api/ auth/ channel/ comms/ config/ database/
│ ├── delegation/ documents/ email/ event/ feedback/ harness/
│ ├── health/ integrations/ jobs/ knowledge/ llm/ notification/
│ ├── oauth/ planning/ platform/ storage/ telegram/ templates/
│ ├── tenant/ tools/ usage/ webhooks/
├── pkg/ # Public types and clients
├── deploy/ # Docker, Compose, reverse proxy
└── tests/ # Integration and end-to-end tests
Everything in the retired Go codebase that still matters conceptually (the harness, the knowledge layers, the delegation model, the channels, the platform services) has been reimplemented in Convex. Design details for those systems live in 02-system-architecture.md, 03-agent-architecture.md, 05-persistence-storage-and-ingestion.md, 06-events-channels-and-messaging.md, and 08-agents-catalogue.md.
Part D: Custom agent and integration developer guide¶
Part D is for two audiences: engineers building custom agents (both internal and external), and integration partners registering external tools, subscribing to events, or pushing data to the platform. For the full API contract reference, see 09-external-api-tool-integration.md. Part D focuses on the practical developer journey.
21. Custom agent development¶
21.1 Agent definition¶
Custom agents are registered via the Capabilities API (POST /v1/capabilities) with a manifest:
{
"version": "1",
"kind": "agent",
"name": "Research Assistant",
"slug": "research-assistant",
"description": "Finds and summarises information from the web",
"accent_colour": "#3B82F6",
"execution": {
"type": "platform",
"system_prompt": "You are a research assistant. Use the web_search tool to find information and provide well-sourced summaries."
},
"capability_level": "tools_only",
"tools": ["web_search", "web_reader", "memory_store"],
"channels": ["api", "telegram"],
"metadata": {
"origin": "installed",
"author": "Your Name",
"version": "1.0.0"
}
}
System prompt guidelines:
- Start with a clear role definition.
- Specify tool usage patterns (when to use which tool).
- Define response format expectations.
- Include any domain-specific constraints.
- The system prompt is augmented at runtime with knowledge facts, plan performance data, and the i18n locale directive.
Execution types:
"platform": agent runs on Thinklio, uses the system prompt and tool definitions."external": agent runs on your infrastructure; Thinklio sends HTTP requests toendpoint_url.
21.2 Tool selection and configuration¶
Assign tools to an agent via the agent_tool table or through the manifest's tools array. Each assignment has a permission level:
"read": can use tools withtrust_level = "read"."write": can use"read"and"write"tools."admin": can use all tools (requires account or platform admin).
Built-in tools available for assignment:
current_timereturns the current time in a timezone.memory_storestores knowledge facts.memory_searchsearches knowledge facts.web_search(Tavily) performs web search.web_readerreads web pages.task_*(internal or Todoist) handles task management.google_calendarprovides calendar access (requires user OAuth).gmailprovides email access (requires user OAuth).hubspotprovides CRM operations.
The tool catalogue and MCP server catalogue are documented in 09-external-api-tool-integration.md.
21.3 Knowledge seeding¶
Pre-load knowledge for a custom agent by inserting facts with scope = "agent":
// convex/seed.ts
await ctx.db.insert("knowledge_fact", {
scope: "agent",
scopeId: agentId,
agentId,
subject: "company",
predicate: "specializes in",
value: "SaaS analytics",
category: "domain",
confidence: 1.0,
locale: "en",
});
Alternatively, use the memory_store tool during initial interactions to build up the knowledge base organically from conversation.
21.4 Agent templates¶
Templates are reusable agent configurations. When an agent is instantiated from a template, it inherits system prompt, tool assignments, view definitions, and configuration schema. The agent.templateId field links the agent back to its source template.
Templates are seeded on deployment from convex/seed.ts. The seeder is idempotent and upserts by slug. Adding a new template:
- Add a template entry to
CATALOG_AGENTSinconvex/seed.ts. - Run
npx convex run seed:syncAgentCatalogon each deployment. - Existing entries update by slug; new entries are inserted.
21.5 Composed agents (delegation)¶
To create a multi-agent composition:
- Create the delegate agent(s) with their own system prompts and tools.
- Register each delegate as an agent-type tool:
await ctx.db.insert("tool", {
slug: "research_agent",
name: "Research Agent",
description: "Delegates research tasks",
type: "agent",
trustLevel: "read",
parameterSchema: {
type: "object",
properties: {
task: { type: "string" },
context: { type: "string" },
},
},
config: { agentId: delegateAgentId },
status: "active",
});
- Assign the agent-type tool to the orchestrator agent via
agent_tool. - The orchestrator's system prompt should describe when to delegate.
Delegation limits:
- Maximum depth defaults to 3 levels (configurable per account).
- Each delegation creates a child interaction with proper lineage tracking (parent interaction id and agent lineage chain).
- The child interaction runs the full harness independently.
- Cycles are rejected at composition time and at runtime.
21.6 Testing custom agents¶
- Unit test tool handlers: implement the Handler interface and test in isolation.
- Integration test via Channel API: send messages via
POST /v1/channels/:id/messagesand verify responses. - Monitor interactions: query the
interactionandsteptables to inspect execution flow, costs, and error details. - Check knowledge: verify that the knowledge extraction path captures the expected facts.
Part E covers the broader testing strategy.
22. Integration API¶
22.1 Registering external tools¶
Register a new tool via POST /v1/capabilities:
{
"version": "1",
"kind": "tool",
"name": "Weather Lookup",
"slug": "weather_lookup",
"description": "Returns current weather for a location",
"execution": {
"type": "external",
"endpoint_url": "https://your-service.com/tools/weather",
"health_check_url": "https://your-service.com/health",
"timeout_seconds": 10
},
"parameter_schema": {
"type": "object",
"properties": {
"location": { "type": "string", "description": "City name or coordinates" }
},
"required": ["location"]
},
"return_schema": {
"type": "object",
"properties": {
"temperature": { "type": "number" },
"conditions": { "type": "string" }
}
},
"trust_level": "read",
"execution_mode": "immediate"
}
Response:
The tool is created with status: "active". Assign it to agents via the agent_tool table or through the agent's manifest.
Slug uniqueness: slugs must be unique across all tools. A 409 Conflict is returned if the slug is already taken.
22.2 Event subscription¶
Subscribe to events via webhooks. The webhook system dispatches events to registered URLs.
Events currently dispatched to webhooks:
message.responseinteraction.completedjob.state_changed
See 09-external-api-tool-integration.md for the full event schema and signing rules.
22.3 Agent event webhooks¶
Push structured events to agents via POST /v1/agent-events:
{
"agent_id": "agent-uuid",
"event_type": "external_trigger",
"payload": {
"source": "crm",
"action": "deal_closed",
"data": { "deal_id": "123", "amount": 50000 }
}
}
This enqueues a message.received event for the specified agent, triggering the harness to run with the payload available as inbound context.
22.4 Callback pattern¶
For deferred or asynchronous tool execution, return results via POST /v1/callbacks/{call_id}:
{
"job_reference": "job-uuid",
"status": "resolved",
"result": {
"output": "The weather in Berlin is 15C and sunny."
}
}
The callback updates the job state and sets has_useful_output = true. The job system's observer pattern then triggers a follow-up interaction with the originating agent.
22.5 Listing capabilities¶
GET /v1/capabilities returns all registered tools and agents:
[
{ "id": "...", "slug": "web_search", "name": "Web Search", "kind": "tool", "type": "internal", "trust_level": "read" },
{ "id": "...", "slug": "research-bot", "name": "Research Bot", "kind": "agent", "capability_level": "tools_only", "execution_type": "platform" }
]
Filter by kind: GET /v1/capabilities?kind=tool or GET /v1/capabilities?kind=agent.
Get a specific capability: GET /v1/capabilities/{id} returns the full manifest (for agents) or basic info (for tools).
23. App UI integration¶
The Thinklio app (Next.js 15, React 19) consumes the Convex backend through the Convex React client. This is the canonical pattern; REST endpoints are only used from non-Convex clients and from the external Channel / Platform / Integration APIs defined in 09-external-api-tool-integration.md.
23.1 Consuming Convex from React¶
Reads:
import { useQuery } from "convex/react";
import { api } from "@/convex/_generated/api";
const tasks = useQuery(api.task.listByAccount, { accountId });
// tasks === undefined while loading; then the typed result; re-renders on change
Writes:
import { useMutation } from "convex/react";
const createTask = useMutation(api.task.create);
await createTask({ accountId, title: "Review the brief" });
Long-running work (external API calls, LLM generations):
import { useAction } from "convex/react";
const generate = useAction(api.llm.generateResponse);
const text = await generate({ interactionId, model, messages });
23.2 Auth integration¶
The app is wrapped in ClerkProvider and ConvexProviderWithClerk (see section 8.1). Inside components, use Clerk's React hooks for identity and organisation selection:
import { useUser, useOrganization } from "@clerk/nextjs";
const { user } = useUser();
const { organization } = useOrganization();
Convex sees the authenticated identity automatically: every useQuery / useMutation / useAction call carries the Clerk token, and server functions read it via ctx.auth.getUserIdentity().
23.3 Realtime updates¶
Convex queries are reactive by default. Any component subscribed to a query automatically re-renders when the underlying data changes, regardless of who made the change (this user, another user in the same account, a background action, a webhook). There is no separate channel, subscription, or polling layer to manage.
When the update source is a long-running action that writes via ctx.runMutation, the change propagates to subscribed clients as soon as the mutation commits.
23.4 Agent views¶
Agent views define the UI components rendered for each agent. Retrieved via useQuery(api.agent.getViews, { agentId }).
Views are defined in the agent's manifest (or inherited from the template manifest):
[
{
"slug": "board",
"name": "Task Board",
"type": "kanban",
"data_source": "task",
"config": {
"group_by": "status",
"columns": ["todo", "in_progress", "done"]
}
},
{
"slug": "chat",
"name": "Chat",
"type": "conversation",
"config": {}
}
]
The frontend reads these view definitions and renders the appropriate UI components. The component layer for the Thinklio app is documented in 10-client-applications.md Part B.
23.5 External HTTP callers¶
For clients that cannot use the Convex React client (non-Novansa apps, CLIs, server-to-server integrations), the Channel, Platform, and Integration APIs expose a conventional HTTPS + JSON surface. Authentication uses the platform API key (thk_...). See 09-external-api-tool-integration.md for endpoint-by-endpoint contracts, rate limits, and error codes.
Part E: Testing and observability¶
Part E is the canonical reference for how Thinklio is tested and observed. It sets the test strategy, describes the CI pipeline, explains the profiling mode, specifies the OpenTelemetry and Prometheus metrics Thinklio exports, and lists the structured log lines and Alertmanager rules that production runs on.
Parts of this section reference the retiring Go services (they still exist, they still need to be tested and observed until they are decommissioned) and parts describe the Convex-era equivalents that are taking over.
24. Testing strategy¶
24.1 Philosophy: critical path first¶
Tests target the paths users actually hit. We do not aim for a coverage percentage; we aim for confidence on the hot path. High-signal tests sitting on the critical path beat low-signal tests padding the coverage number.
24.2 Convex test layout¶
Convex functions are tested with convex-test, which runs the same function code against an in-memory Convex runtime. Test files live next to the functions they cover.
Typical test:
import { convexTest } from "convex-test";
import { expect, test } from "vitest";
import schema from "./schema";
import { api } from "./_generated/api";
test("listByAccount returns only account-scoped tasks", async () => {
const t = convexTest(schema);
// Seed
await t.run(async (ctx) => {
const accountId = await ctx.db.insert("account", { slug: "acme", name: "Acme" });
await ctx.db.insert("task", { accountId, title: "A", status: "todo", createdBy: "u1" });
return accountId;
});
// Act + assert via the public API
const result = await t.withIdentity({ subject: "user_1" }).query(api.task.listByAccount, { accountId });
expect(result).toHaveLength(1);
});
24.3 Go test inventory (archival)¶
The retiring Go services ship with the following test matrix. Maintain it until the corresponding service is decommissioned.
| Package | Tests | What's covered |
|---|---|---|
internal/triage |
30 cases | All intent categories, entity extraction, curly quotes, scoring, thresholds, edge cases |
internal/tools |
3 cases | CurrentTimeHandler, tool formatting, policy decisions |
internal/api |
1 case | Response formatting |
internal/auth |
JWT validation | |
internal/config |
Config loading, defaults | |
internal/delegation |
Cycle detection, depth limits | |
internal/documents |
Document chunking | |
internal/event |
Event creation, bus | |
internal/harness |
Executor basics | |
internal/jobs |
Job types | |
internal/knowledge |
Fact extraction, JSON | |
internal/llm |
Response parsing | |
internal/usage |
Budget status, nil handling | |
tests/integration |
Provider registry, tool resolution |
24.4 Running tests¶
# Convex tests (vitest)
npm test # Run all tests once
npm test -- --watch # Watch mode
npm test -- convex/task.test.ts # Single file
# Web type check
cd apps/web && npx tsc --noEmit
# Web lint
cd apps/web && npm run lint
# Legacy Go tests (while services remain in the tree)
go test ./... -v -count=1
go test -tags profile ./... -v -count=1
make test-cover
go test ./internal/triage/ -v
25. CI pipeline¶
GitHub Actions workflow: .github/workflows/ci.yml. Runs on push to main and on pull requests. Jobs run in parallel.
Convex job:
npm cinpm run typecheck(wrapstsc --noEmitover theconvex/source)npm test(vitest overconvex-test)npx convex dev --onceagainst a CI deployment (validates schema push)
Web job:
npm citsc --noEmit(type check)npm run build(production build)
Go job (while legacy services remain):
go build ./...golangci-lint(10 linters)go test ./... -v -count=1
26. Profiling mode¶
The legacy Go services support a profile build tag that turns on fine-grained tracing without changing the production binary's behaviour or performance.
26.1 Build tags¶
# Development and testing: profiling enabled
go build -tags profile ./cmd/server/
# Production: profiling disabled (zero overhead)
go build ./cmd/server/
26.2 Profile package (internal/profile/)¶
Two files with build tags:
profile.go(//go:build profile): real tracing withslog.Debug.noop.go(//go:build !profile): no-op stubs, zero overhead.
26.3 Usage¶
defer profile.TraceStart("context_assembly")()
defer profile.TraceStart("llm_call", "model", model)()
26.4 Current instrumentation points¶
interaction: full interaction lifecycle.triage_parse: computational parser.llm_call: OpenRouter / Anthropic API call.
For the Convex backend, profiling is replaced by the Convex dashboard's built-in function timing view and by the OpenTelemetry integration described in section 27.
27. OpenTelemetry and Prometheus¶
27.1 Architecture¶
Go server / Convex actions ── /metrics endpoint ──→ Prometheus (analytics_nbg1)
↓
Grafana (analytics_nbg1)
↓
Dashboards + alerts
For Convex, a lightweight actions-side OpenTelemetry exporter ships metrics to an intermediary Prometheus Pushgateway (or directly to a metrics collector that Prometheus scrapes). The Convex dashboard provides native per-function timing for queries and mutations and is the fastest path for day-to-day investigation.
27.2 Prometheus metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
thinklio_interaction_duration_seconds |
Histogram | tier (triage/llm) |
Interaction duration |
thinklio_triage_intent_total |
Counter | intent |
Triage classification count |
thinklio_llm_call_duration_seconds |
Histogram | model |
LLM API call duration |
thinklio_tool_execution_duration_seconds |
Histogram | tool |
Tool execution duration |
thinklio_cache_hit_total |
Counter | n/a | Cache hits (Redis in the Go era; Convex query cache behaves differently, see note below) |
thinklio_cache_miss_total |
Counter | n/a | Cache misses |
thinklio_interaction_cost_total |
Counter | n/a | Total LLM cost (USD) |
Plus standard Go runtime metrics (goroutines, GC, memory) from the retiring Go services, and Convex function timing from the Convex dashboard for the reactive backend.
Note on cache metrics under Convex: Convex maintains a reactive query cache internally and does not expose Redis-style hit and miss counters. The cache metrics above apply to the retiring Go services; the Convex-era equivalent is the function execution count and duration in the Convex dashboard.
27.3 Metrics endpoint¶
Returns Prometheus-format metrics. Scrape this from analytics_nbg1.
27.4 Prometheus configuration¶
Add to prometheus.yml on analytics_nbg1:
scrape_configs:
- job_name: 'thinklio'
scrape_interval: 15s
static_configs:
- targets: ['api.thinklio.ai:443']
scheme: https
27.5 Grafana dashboard¶
Recommended panels:
Row 1: Overview - Interaction rate (req/min) - P50 / P95 / P99 interaction duration - Triage hit rate (percent handled at Tier 2) - Cache hit rate
Row 2: Performance - LLM call duration histogram - Tool execution duration by tool - Interaction duration by tier (triage vs LLM)
Row 3: Cost - Cumulative LLM cost - Cost per interaction trend - Cost by model
Row 4: Triage analysis - Intent distribution (pie chart) - Triage intent over time (stacked area)
27.6 Alerting¶
Recommended Alertmanager rules:
groups:
- name: thinklio
rules:
- alert: HighInteractionLatency
expr: histogram_quantile(0.95, thinklio_interaction_duration_seconds) > 10
for: 5m
labels:
severity: warning
annotations:
summary: P95 interaction latency exceeds 10s
- alert: HighCacheMissRate
expr: rate(thinklio_cache_miss_total[5m]) / (rate(thinklio_cache_hit_total[5m]) + rate(thinklio_cache_miss_total[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: Cache miss rate exceeds 50%
- alert: LLMErrorRate
expr: rate(thinklio_interaction_duration_seconds_count{tier="llm"}[5m]) == 0 AND rate(thinklio_triage_intent_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: No LLM interactions completing; possible API outage
28. Structured logging in production¶
All logs are JSON. Convex functions use console.log with a structured context object (section 18). Retiring Go services use slog.JSONHandler. The field names below are common to both.
| Event | Level | Key fields |
|---|---|---|
triage |
INFO | interaction_id, intent, strength, ambiguity, subject |
step_completed |
INFO | interaction_id, step, duration_ms, tokens_in, tokens_out, cost |
tier_2_direct_response |
INFO | interaction_id, action, duration_ms |
interaction_completed |
INFO | interaction_id, total_cost, total_duration_ms |
channel_triage |
INFO | intent, strength, ambiguity, subject |
cache_warmed |
INFO | agents, tools, knowledge_scopes, duration_ms |
batch_flush_failed |
ERROR | interaction_id, error |
28.1 Log aggregation¶
Ship logs from Coolify container stdout (for legacy Go services) and from Convex log streaming (for the Convex backend) to a log aggregation service. Grafana Loki on analytics_nbg1 is recommended. Use interaction_id as the correlation key across sources.
28.2 Existing infrastructure (analytics_nbg1)¶
| Service | Purpose | Status |
|---|---|---|
| Prometheus | Metrics collection | Ready: add scrape target |
| Grafana | Dashboards | Ready: create Thinklio dashboard |
| Alertmanager | Alerting | Ready: add Thinklio rules |
| Metabase | SQL analytics | Available for ad-hoc queries |
| Umami | Web analytics | Running for app.thinklio.ai |
28.3 Configuration status¶
| Item | Status |
|---|---|
| Prometheus scrape target | Done: api.thinklio.ai:443 via HTTPS, up=1 confirmed |
| Grafana dashboard | Done: 8 panels imported (Thinklio Platform) |
| Alertmanager rules | Not yet: add latency, cache miss, LLM outage alerts |
Coolify /metrics access |
Works: auth middleware passes unauthenticated requests through |
29. Testing and observability backlog¶
29.1 Testing¶
- Channel API tests (
convex/http.tshandlers): test triage synchronous path vs asynchronous path, missing fields, auth. - Integration tests: full pipeline over
convex-testverifying triage to knowledge lookup to response. Legacy Go integration tests intests/integration/pipeline_test.goremain while Go services exist. - CI Redis service: add Redis container to GitHub Actions for the legacy Go integration tests (until those services retire).
29.2 Observability¶
- Alertmanager rules: P95 latency > 10s, cache miss rate > 50%, LLM outage detection (see alert rules in section 27.6).
- Triage logging: log every parse result (intent, strength, entities) for all messages, both Tier 2 and Tier 3, to measure actual traffic distribution and Tier 2 hit rate. Required before expanding Tier 2 (per the Smart Input Triage design in
03-agent-architecture.md). - Additional profile instrumentation (Go era): add
profile.TraceStartto Redis get / set, Postgres queries, individual tool execution. - Tool execution metrics: instrument tool executor with
observe.RecordToolExecutionper tool call. - Grafana dashboard enhancements: triage intent distribution pie chart, interaction duration by tier, tool execution breakdown.
Part F: Deployment, administration, and operations¶
Part F describes how to deploy, configure, administer, and maintain a Thinklio installation. It covers the full lifecycle from initial deployment through to scaled production, including day-to-day administration via the admin dashboard.
This part has two flavours: the Convex-era deployment model (which is the path forward) and the legacy Go monolith deployment (preserved as reference while it is retired). Where the two diverge, both are documented side by side with the current recommendation called out.
This is not a code-level implementation guide. It assumes access to the Thinklio GitHub repository and a competent system administrator familiar with Linux, Docker, Clerk, and Convex.
30. System components¶
30.1 Convex-era deployment (current)¶
| Component | Technology | Purpose |
|---|---|---|
| Convex | Managed (convex.dev) | Backend: database, server functions, scheduler, HTTP routes, vector search |
| Clerk | Managed (clerk.com) | Authentication, organisations, RBAC |
| Web app (thinklio-app) | Next.js 15 (Node runtime) | User-facing web interface, deployed via Coolify or Vercel |
| Admin dashboard | Next.js 15 (Node runtime) | Account, user, agent, job, and system management (includes Agent Studio) |
| Cloudflare R2 | Managed | Object storage for documents and artefacts |
| Postmark | Managed | Transactional email and inbound email channel |
| OpenRouter / Anthropic | Managed | LLM provider |
| Voyage AI | Managed | Embedding provider |
No servers, no Redis, no Postgres, and no reverse proxy are operated by Thinklio in the Convex-era deployment. The remaining operational surfaces are the web app runtime, the admin dashboard runtime, and the managed external services.
30.2 Legacy Go deployment (archival)¶
| Component | Technology | Purpose |
|---|---|---|
| Thinklio Server | Single Go binary | All platform services (gateway, agent, context, tool, queue, usage) running as logical services within a single process |
| Redis | Redis 7 | Cache, event bus (Streams), rate limiting, sessions, active job store |
| Supabase Cloud | Managed PostgreSQL | Primary data store, auth, Vault, Realtime, RLS |
| Admin Dashboard | React/Next.js | Account, user, agent, job, and system management |
| Reverse Proxy | Nginx or Caddy | SSL termination, routing, static asset serving |
The legacy Go stack runs on a single VPS (or a fleet behind a load balancer for scale). Operational details for this stack are kept in sections 32.3, 33.2, 34.2, 35.2, 36, 37, 38, and 39.
31. Infrastructure and prerequisites¶
31.1 Convex-era prerequisites¶
- Convex production deployment (created in the Convex dashboard under the Thinklio project).
- Clerk production instance with the
convexJWT template and webhooks configured (see Part B). - Cloudflare R2 bucket and access keys.
- OpenRouter or Anthropic API key.
- Voyage AI API key.
- Postmark server token (for outbound transactional mail and inbound email channel).
- Hosting target for the web app and admin dashboard (Coolify, Vercel, Cloudflare Pages, or equivalent). Minimum: Node 20 runtime, 512 MB RAM, 0.5 vCPU per app.
- Domain name with DNS control.
External dependencies shared with the legacy stack: LLM provider account, Telegram Bot Token (if using Telegram channel), DNS provider, SMTP provider (optional). A backup storage destination is only needed for the legacy Redis store, not for the Convex deployment.
31.2 Legacy Go infrastructure requirements¶
Minimum (single VPS, development / testing):
- 4 vCPU, 8 GB RAM, 160 GB SSD
- Ubuntu 24.04 LTS
- Docker Engine 24+ and Docker Compose v2
- Public IPv4 address
- Domain name with DNS control
Recommended (single VPS, small production):
- 8 vCPU, 16 GB RAM, 320 GB SSD
- Same software stack as above
- Automated backup destination (S3-compatible, for example Cloudflare R2)
Scaled (multi-VPS production): see section 36.2.
External dependencies for the legacy stack: Supabase Cloud project, LLM provider account, Telegram bot token, DNS provider, S3-compatible backup storage, SMTP provider (optional).
31.3 Repository access¶
Clone the Thinklio repository:
See section 1 for the repository layout.
32. Initial deployment¶
32.1 Convex-era deployment¶
- Provision external services.
- Create the Convex production deployment in the Convex dashboard.
- Create the Clerk production instance and configure DNS per Part B section 7.2.
- Create the Cloudflare R2 bucket and access keys.
- Obtain OpenRouter, Voyage AI, and Postmark credentials.
- Configure Convex.
- Set environment variables on the Convex production deployment (section 4 and Part B):
CLERK_JWT_ISSUER_DOMAIN,CLERK_WEBHOOK_SECRET,OPENROUTER_API_KEY,VOYAGE_API_KEY,TAVILY_API_KEY,POSTMARK_SERVER_TOKEN, R2 credentials, Telegram bot token. - Deploy Convex functions:
CONVEX_DEPLOYMENT=prod:<slug> npx convex deploy. - Configure Clerk.
- Create the
convexJWT template with extended claims (Part B section 7.3). - Create the Clerk webhook targeting the Convex site URL (Part B section 7.4).
- Copy the webhook signing secret into Convex env as
CLERK_WEBHOOK_SECRET. - Deploy the web app and admin dashboard.
- Build and deploy to Coolify (or equivalent). Required build args:
NEXT_PUBLIC_CONVEX_URL,NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY. Required runtime env:CLERK_SECRET_KEY. - Point DNS at the deployment target.
- Seed platform data.
export CONVEX_DEPLOYMENT=prod:<deployment-slug>
npx convex run seed:syncAgentCatalog
npx convex run seed:seedStorageBuckets
- Verify. Sign up, confirm a user_profile row appears, create an account, create an agent, connect a channel, and send a test message.
No VPS provisioning, no Docker Compose, no reverse proxy configuration. If the web app host does not terminate SSL on its own, add a thin reverse proxy (Caddy, Cloudflare) in front; otherwise the hosting platform handles it.
32.2 Setting up the admin dashboard¶
The admin dashboard is a separate Next.js application (or a protected surface within the main web app) that talks to the same Convex deployment. It uses Clerk's role-gating to limit access to platform or account admins. Deployment is the same as the main web app: set the build args, deploy, and point DNS.
32.3 Legacy Go deployment (archival)¶
For anyone deploying or maintaining the retiring Go monolith, the original single-VPS deployment path is preserved here.
Step 1: provision and harden the VPS. Ubuntu 24.04. Non-root user with sudo. Root SSH disabled. SSH on a non-standard port. UFW allowing SSH, 80, 443. fail2ban. Automatic security updates.
Step 2: configure DNS.
thinklio.yourdomain.com A -> VPS IP
api.thinklio.yourdomain.com A -> VPS IP
admin.thinklio.yourdomain.com A -> VPS IP
Step 3: configure environment. Copy deploy/compose/.env.example to deploy/compose/.env and fill in:
# Supabase
SUPABASE_DB_URL= # Supavisor pooler connection string
SUPABASE_JWT_SECRET= # JWT validation secret
SUPABASE_SERVICE_ROLE_KEY= # Service role key for admin operations
SUPABASE_URL= # Supabase project URL
# Redis
REDIS_URL= # Redis connection string
REDIS_PASSWORD= # Redis password (optional for dev)
# LLM
LLM_PROVIDER=openrouter
LLM_API_KEY= # Your LLM provider API key
LLM_DEFAULT_MODEL=anthropic/claude-sonnet-4-20250514
TELEGRAM_BOT_TOKEN= # From BotFather
# Job system
JOB_TIMEOUT_CHECK_INTERVAL=60
DEFAULT_JOB_TIMEOUT=1800
MAX_JOB_TIMEOUT=86400
DEFAULT_MAX_DELEGATION_DEPTH=3
# Admin dashboard
NEXT_PUBLIC_API_URL=https://api.thinklio.yourdomain.com
ADMIN_INITIAL_EMAIL=
ADMIN_INITIAL_PASSWORD=
# Backups
BACKUP_S3_ENDPOINT=
BACKUP_S3_BUCKET=
BACKUP_S3_ACCESS_KEY=
BACKUP_S3_SECRET_KEY=
BACKUP_S3_REGION=
Step 4: deploy.
The first startup pulls or builds container images, starts Redis locally, starts the Thinklio server (which connects to Supabase Cloud), builds and starts the admin dashboard, configures the reverse proxy with automatic SSL via Let's Encrypt, and seeds the initial admin user.
Step 5: verify.
docker compose ps # All services Up (healthy)
curl https://api.thinklio.yourdomain.com/health # 200
curl https://admin.thinklio.yourdomain.com # Loads
docker compose logs --tail 50 thinklio
docker compose logs --tail 50 admin
docker compose logs --tail 50 redis
Step 6: initial configuration. Log into the admin dashboard. Change the initial password. Create the first account. Create the first agent (or use a template). Connect the Telegram bot. Send a test message.
Docker Compose structure (legacy):
services:
redis: # Port 6379, internal only
thinklio: # Single Go binary, Port 8001, exposed via reverse proxy
admin: # Port 3000, exposed via reverse proxy
nginx: # Ports 80, 443, public-facing
Networks: frontend (nginx, thinklio, admin) and backend (thinklio, redis). Redis is never exposed to the public network. PostgreSQL is managed by Supabase Cloud and is not part of the local deployment.
33. Administration¶
The admin dashboard provides a web interface for managing the platform. It communicates with the same backend as external integrations, authenticated with admin-level credentials (Clerk role + platform admin flag in the Convex era; admin JWT tokens in the Go era).
33.1 Functional areas¶
Account management: create account (name, slug, plan, initial budget), edit settings and policies (including delegation limits) and budgets, suspend account (blocks all agent interactions), view account (members, teams, agents, usage summary).
Team management: create team within an account, manage members (invite, set roles: admin, member, readonly, remove), archive team (deactivates agents, preserves knowledge and audit trail).
User management: view users across the platform with account and team memberships, edit user (roles, reset password, manage channel connections), suspend user (blocks all interactions), delete user (GDPR-compliant deletion: user knowledge hard-deleted, contributions anonymised, job observer registrations removed, deletion logged).
Agent management: create agent from scratch or from template (configure name, system prompt, capability level, LLM model), edit agent, manage tools (assign or remove, set permission levels: read, readwrite), manage delegations (add or remove agent-as-tool delegations, configure invocation contracts and per-delegation restrictions), manage assignments (assign to users, teams, or accounts with scope, budget, and per-assignment tool restrictions), pause agent (kill switch: stops all new interactions, cancels pending jobs), resume agent, archive agent (deactivates permanently, preserves audit trail), view agent (knowledge facts, recent interactions, active jobs, usage, tool execution history, delegation graph).
Agent Studio: compose agents (visual interface for building agents that delegate to other specialists), delegation graph (view delegation structure, detect cycles, understand depth), per-delegation restrictions (configure what actions each delegate can perform in this composition), templates (create and deploy composed agent templates, for example "PA with Scheduler and Research"), cycle detection (prevents circular delegation configurations).
Template management: create template (reusable agent configuration: system prompt, tools, capability level, seed knowledge, delegation relationships), edit template, scope templates (platform-wide or account-specific), deploy from template (create a new agent pre-configured from a template).
API key management: create API key (scoped to an account, specific agent, or API surface), set permissions (read-only, read-write, admin), set rate limits, revoke API key (immediate), view usage (API call history per key).
Job monitoring: active jobs (list non-terminal jobs across the platform), job detail (subjob progress, observer registrations, context bundle, dispatch target), cancel job (manual cancel of stuck or unwanted jobs), job history (search terminal jobs by agent, type, state, date range), timeout configuration (view and adjust default and per-tool timeouts).
Monitoring and reporting: system health (service status, database connections, Redis status in the legacy stack including job store memory, queue depth), usage reports (cost by account, team, user, agent over time periods, including delegation cost breakdown), audit log viewer (searchable security and operational events), budget status (current spend vs budget for all accounts and teams).
Platform controls: platform kill switch (immediately halts all agent interactions, in-flight interactions complete their current step then stop, pending jobs cancelled, no new interactions accepted; emergency stop for the entire system), resume platform, maintenance mode (displays a maintenance message to users, queues incoming messages for processing when maintenance ends), feature flags (enable / disable platform features without deployment).
33.2 Command-line administration (legacy Go era)¶
# Redis operations
deploy/scripts/redis-backup.sh # Manual backup
deploy/scripts/redis-restore.sh <file> # Restore from backup
# Service operations
deploy/scripts/service-health.sh # Check all service health
deploy/scripts/service-restart.sh <svc> # Restart a specific service
deploy/scripts/service-logs.sh <svc> # Tail logs for a service
# Platform operations
deploy/scripts/platform-pause.sh # Kill switch via CLI
deploy/scripts/platform-resume.sh # Resume via CLI
# Job operations
deploy/scripts/jobs-active.sh # List active jobs in Redis
deploy/scripts/jobs-cancel.sh <id> # Cancel a specific job
# Maintenance
deploy/scripts/cache-clear.sh # Clear Redis cache (preserves job store)
deploy/scripts/archive-events.sh # Run event archival manually
33.3 Command-line administration (Convex era)¶
# Inspect data
npx convex run admin:listAccounts
npx convex run admin:listActiveJobs
# Platform controls
npx convex run admin:pausePlatform
npx convex run admin:resumePlatform
# Kill switches
npx convex run admin:pauseAgent --agentId <id>
npx convex run admin:resumeAgent --agentId <id>
# Job operations
npx convex run admin:cancelJob --jobId <id>
# Seed / upsert catalogue
npx convex run seed:syncAgentCatalog
npx convex run seed:seedStorageBuckets
All admin mutations and actions are defined in convex/admin.ts and are gated on the platform admin flag on the caller's user profile.
34. Backup and recovery¶
34.1 Convex-era backup and recovery¶
Convex handles durable storage and backup automatically. The Convex team maintains daily snapshots and point-in-time recovery for production deployments. Use npx convex export to take a full snapshot on demand (for example before a risky migration).
R2 objects are backed up per the R2 bucket's configuration (versioning + lifecycle rules recommended). For the Thinklio documents bucket, enable object versioning with a 30-day retention for previous versions.
There is no Redis to back up and no Postgres to back up in the Convex era. Active durable workflows and scheduled jobs live inside Convex itself and are covered by the Convex snapshot.
Recovery time objective (RTO): minutes, via Convex dashboard rollback or reimport from convex export snapshot.
Recovery point objective (RPO): minutes (continuous), via Convex point-in-time recovery on production deployments.
34.2 Legacy Go backup and recovery (archival)¶
Database (Supabase Cloud): automatic daily backups at 02:00 UTC. Pro plan includes point-in-time recovery (PITR) with 7-day retention. Full backup retention: 30 days.
Redis (local): RDB snapshots stored locally and uploaded to S3-compatible storage. Automated daily snapshot at 03:00 UTC. Retention: 30 daily snapshots, 12 weekly snapshots, 6 monthly snapshots.
Active jobs in Redis are not included in Supabase backups. If Redis data is lost, active jobs can be recovered from the event log in Supabase, though in-progress work may need to be re-dispatched.
Manual backup: deploy/scripts/redis-backup.sh.
Recovery procedure:
- Stop all services except Redis:
docker compose stop thinklio admin. - For Redis recovery, restore the snapshot:
deploy/scripts/redis-restore.sh backups/thinklio-redis-20260321-030000.rdb. - For database recovery via Supabase, use the dashboard to restore from PITR or contact Supabase support for older backups.
- Clear the Redis cache after database restore (stale after restore):
deploy/scripts/cache-clear.sh. - Restart:
docker compose up -d. - Verify:
deploy/scripts/service-health.sh.
Recovery time objective (RTO): < 15 minutes for full restore from backup.
Recovery point objective (RPO): < 24 hours (daily backup frequency for both systems).
Disaster recovery (VPS loss):
- Provision a new VPS with the same specifications.
- Follow the initial deployment steps (Step 1 to Step 4).
- Instead of Step 5 (verify), restore from backups: Supabase Cloud database is already restored via the Supabase dashboard; restore Redis from the latest S3 backup using
deploy/scripts/redis-restore.sh. - Update DNS records to point to the new VPS.
- Verify all services and functionality.
- Active jobs at the time of failure will need to be manually reviewed and potentially re-dispatched.
35. Upgrades¶
35.1 Convex-era upgrades¶
- Convex functions:
npx convex deploypushes new function code. Convex performs atomic swaps; no downtime. - Schema migrations:
npx convex deployvalidates the schema. Breaking schema changes should be staged (add new fields optional, backfill via a one-shot action, then flip to required in a subsequent deploy). - Web app and admin dashboard: rebuild and redeploy through the hosting platform's standard flow. Near-zero downtime for rolling updates.
- Clerk: managed; Clerk handles its own upgrades transparently.
- External services (OpenRouter, Voyage, R2, Postmark): no operator action needed.
35.2 Legacy Go upgrades (archival)¶
Service upgrades via Coolify: Coolify detects repository changes and deploys automatically (if auto-deploy is configured) or via manual trigger in the Coolify UI.
Service upgrades via Docker Compose:
Docker Compose performs rolling restarts. Services are stateless, so no data is lost during restart. Active jobs in Redis persist across service restarts.
Database migrations: managed by Supabase. When deploying a new version of Thinklio that includes schema changes, Supabase automatically runs pending migrations during the deployment process. The Thinklio server verifies that all expected migrations are present before starting. If a migration fails, the server does not start; fix the migration via Supabase and redeploy. Breaking migrations are documented in the changelog and may require a maintenance window coordinated with Supabase support.
Admin dashboard upgrades: rebuilt and redeployed as part of the standard deployment process. Static assets are served by the reverse proxy.
Redis upgrades: require a brief maintenance window.
- Enable maintenance mode via the admin dashboard.
- Backup:
deploy/scripts/redis-backup.sh. - Stop:
docker compose stop. - Update the Redis image version in docker-compose.yml.
- Restart:
docker compose up -d. - Verify Redis connectivity and active job count.
- Disable maintenance mode.
Redis data is ephemeral for caching purposes (cache misses are self-healing) and operationally persistent for active jobs (which are flushed to Supabase on terminal state). Unprocessed stream messages are redelivered via consumer groups.
Supabase project upgrades: Supabase handles PostgreSQL version upgrades transparently. For major version upgrades, monitor the Supabase status dashboard; the Thinklio server automatically retries during the short upgrade window.
36. Scaling¶
36.1 Convex-era scaling¶
Convex scales transparently. No operator action is required to handle higher query volumes, more concurrent users, or more background workflows; Convex allocates capacity automatically.
Capacity planning in this model focuses on:
- LLM provider rate limits. OpenRouter / Anthropic have per-org rate limits; request an increase as load grows.
- Clerk MAU. Clerk pricing is per monthly active user; check the plan tier against projected usage.
- Postmark volume. Outbound and inbound mail are metered per message.
- R2 storage and egress. R2 is cheap, but egress to non-Cloudflare destinations bills; keep clients hitting R2 directly via signed URLs rather than proxying through the web app.
The Next.js web app is horizontally scalable by the hosting platform. On Coolify, set the number of replicas; on Vercel, autoscaling is automatic.
36.2 Legacy Go scaling (archival)¶
Vertical scaling indicators:
- CPU consistently > 70 percent during normal operation
- Memory consistently > 80 percent utilised
- Response latency increasing under normal load
- Queue depth growing faster than it drains
- Redis memory usage approaching maxmemory (check active job store size)
- Supabase database connections frequently at pool limit
Vertical scaling steps:
- Resize the VPS via the provider's control panel.
- Increase Thinklio server resources: GOMAXPROCS via container resource limits, worker pools for CPU-bound operations.
- Increase Redis resources:
maxmemoryallocation, adjust eviction policy if needed. - Increase connection pools: update
SUPABASE_POOL_SIZEin environment. - Upgrade Supabase plan to increase connection limits.
- Restart services.
Practical limits of a single VPS. A well-configured single VPS (16 vCPU, 32 GB RAM) with Supabase Pro plan can typically handle hundreds of concurrent users, thousands of daily interactions, tens of thousands of knowledge facts, hundreds of active jobs, and multiple accounts and teams. Beyond this, horizontal scaling is needed.
Horizontal scaling architecture:
VPS 1 (edge)
├── Nginx (reverse proxy, SSL, load balancing)
├── Thinklio Server (2+ instances)
└── Admin Dashboard
VPS 2+ (additional Thinklio instances)
├── Thinklio Server (2+ instances)
└── Redis (shared, or local with clustering)
Supabase Cloud (managed)
├── PostgreSQL Primary
└── (optional) Read replicas for scaling
External
├── S3-compatible backup storage
Scaling strategy. The single Thinklio binary can be horizontally scaled because all services within it are stateless: multiple instances run behind a load balancer, each instance connects to the same Supabase Cloud database, each instance connects to the same Redis (or Redis Cluster for larger scales), no inter-instance communication or coordination is needed, scaling is linear.
Steps to distribute services:
- Provision additional VPS instances with Docker installed.
- Set up private networking between VPS instances.
- Deploy the Thinklio server to all VPS instances.
- Configure the reverse proxy (on VPS 1) to load-balance across Thinklio instances using health checks.
- Configure Redis: either shared instance on VPS 2 or Redis Cluster.
- Update Supabase connection strings for all instances to use the same Supabase Cloud project.
- Verify inter-instance communication across VPS boundaries.
- Update backup procedures to exclude instance-local caches (back up Redis and Supabase only).
37. Monitoring and alerting¶
37.1 Convex-era monitoring¶
- Convex dashboard (Logs and Functions tabs): per-function timing, error rates, recent invocations. This is the primary investigation surface.
- Clerk dashboard: sign-in volume, error rates, webhook delivery status.
- Hosting platform metrics (Coolify, Vercel): CPU, memory, request rates, error rates for the web app and admin dashboard.
- External metrics exported to Prometheus (Part E section 27): platform-level metrics for interactions, LLM costs, and tool executions.
- Grafana dashboards backed by Prometheus for platform-wide views, and Convex logs for function-level investigation.
37.2 Health checks¶
Every service (legacy Go and the Next.js apps) exposes a /health endpoint. It reports service status, dependency status, uptime, and version. In the Convex era, a lightweight /health on the web app confirms the Convex client can connect and the Clerk key is valid.
37.3 Legacy Go metrics¶
All services in the legacy stack expose Prometheus-compatible metrics at /metrics:
- Request rates, latencies, error rates per endpoint
- Queue depths and processing rates
- Cache hit rates
- Supabase connection pool utilisation
- Active interactions and step execution times
- Budget utilisation per account
- Active job count, job dispatch rate, job completion rate
- Job timeout rate
- Delegation depth distribution
- Redis memory usage by key pattern (cache vs jobs vs sessions)
37.4 Dashboards¶
Grafana dashboards deployed as part of the monitoring stack provide:
- Platform Overview: all services, request rates, error rates, latency.
- Agent Performance: per-agent interaction counts, costs, response times, delegation activity.
- Data Layer: Convex function timing (Convex era); Supabase query performance, Redis cache effectiveness, Redis job store utilisation (legacy Go era).
- Queue Health: task throughput, retry rates, dead letter queue size (legacy Go era); Convex scheduler backlog (Convex era).
- Job System: active jobs, subjob completion rates, timeout rates, observer notification delivery.
- Budget and Usage: spend tracking, budget utilisation, cost trends (including delegation cost breakdown).
37.5 Alerting¶
Alerts are configured in Prometheus with notification via email or Telegram:
| Alert | Condition | Severity |
|---|---|---|
| Service down | Health check fails for > 1 minute | Critical |
| High error rate | 5xx rate > 5 percent for > 2 minutes | Warning |
| Queue backlog | Queue depth > 1000 for > 5 minutes | Warning |
| Cache hit rate low | < 70 percent for > 10 minutes | Warning |
| Budget exceeded | Any account / team over budget | Info |
| Supabase connection errors | Connection failures for > 1 minute (legacy Go) | Critical |
| Convex auth failures | Auth error rate > 5 percent for > 2 minutes | Critical |
| Disk space | < 20 percent free on any volume | Warning |
| Certificate expiry | SSL cert expires in < 14 days | Warning |
| Job timeout rate high | > 10 percent of jobs timing out over 1 hour | Warning |
| Redis memory high | Job store memory > 80 percent of maxmemory (legacy Go) | Warning |
| Webhook delivery failures | > 5 percent failure rate over 10 minutes | Warning |
38. Routine maintenance¶
38.1 Convex-era routine maintenance¶
Weekly. Review the Convex dashboard for slow functions. Review the Clerk dashboard for auth anomalies. Check the web app hosting platform for deployment health. Review the webhook delivery log (Clerk, Postmark, any registered external webhooks).
Monthly. Review and rotate API keys if needed. Review usage reports for budget planning (LLM spend trending, subscription tiering). Check for npm dependency security updates (npm audit). Review and clear archived data per retention policy.
Quarterly. Full disaster recovery drill against a scratch Convex deployment (export production snapshot, import into the drill deployment, verify functionality). Security audit (access logs, permission grants, API key usage, external tool registrations, MCP installations). Capacity planning review (growth trends, scaling needs, job volume trends).
38.2 Legacy Go routine maintenance (archival)¶
Daily (automated). Redis backup to S3 (RDB snapshot). Event archival via Supabase (events older than retention period moved to archive). Terminal job archival via Supabase. Redis memory check (verify under maxmemory limit, check job store size). Health check verification.
Weekly. Review monitoring dashboards for trends. Check Supabase slow query log for optimisation opportunities. Review queue dead letter queue for recurring failures. Review job timeout patterns (may indicate misconfigured timeouts). Verify backup integrity (spot-check restore of Redis snapshot).
Monthly. Review and rotate API keys if needed. Review usage reports for budget planning. Check for dependency security updates. Review and clear archived data per retention policy. Check Supabase plan usage (connections, storage, etc).
Quarterly. Full disaster recovery test (restore from backup to a test instance). Security audit (review access logs, permission grants, API key usage, external tool registrations). Capacity planning review. Supabase statistics review (index usage, query performance, storage trends).
39. Troubleshooting production¶
39.1 Convex-era troubleshooting¶
Web app fails to build. Check that NEXT_PUBLIC_CONVEX_URL and NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY are available as build args. Check that the Dockerfile declares them as ARG. Check the Convex deployment slug matches the target environment.
Clerk token requests return 404. Missing middleware.ts in the Next.js app. Part B section 13.
Convex functions return unauthenticated unexpectedly. The convex JWT template may be missing or misnamed. Verify applicationID: "convex" in convex/auth.config.ts matches the Clerk JWT template name.
Webhook not firing. Part B section 13: check endpoint URL is the Convex site URL (not cloud URL), CLERK_WEBHOOK_SECRET is set, Clerk dashboard delivery log.
Interaction is slow. Check the Convex dashboard for slow queries or mutations. Check LLM provider latency (OpenRouter / Anthropic have status pages). Check knowledge retrieval: if the vector index is not being used, rewrite the query to use .vectorSearch(...).
Agent not responding. Check the agent is not paused (admin dashboard). Check budget enforcement (admin dashboard). Check the LLM API key is valid. Check the channel adapter: Telegram bot token, Postmark domain signing, etc. Check the Convex logs for errors from the relevant action.
39.2 Legacy Go troubleshooting (archival)¶
Service won't start.
docker compose logs <service>.- Common causes: Supabase connection failed (check connection string, verify Supabase project is active), environment variable missing, port conflict, Redis unreachable.
Agent not responding.
- Check gateway health:
curl https://api.thinklio.yourdomain.com/health. - Check Thinklio server logs:
docker compose logs thinklio. - Check event bus: verify messages are being published and consumed from Redis.
- Check Supabase connectivity.
- Check LLM provider.
- Check budget: verify the user, team, or account has not exceeded their budget.
- Check agent status: verify the agent is not paused.
Slow responses.
- Check which step is slow (query the steps table in Supabase for the interaction).
- Common bottlenecks: context assembly (indexing, cold cache), LLM call (provider latency, model choice), tool execution (external API latency, timeout config), delegation (delegate agent performance).
- Supabase dashboard SQL editor for slow queries.
- Redis cache hit rate via metrics dashboard.
Job system issues.
- Job stuck in non-terminal state: check the execution engine (n8n, external service), dispatch target (webhook URL), timeout monitor, or cancel manually via
deploy/scripts/jobs-cancel.sh <job_id>. - Follow-up interactions not triggering: verify the Thinklio server is subscribed to
job.state_changedevents via Redis Streams, check the observer registration for the job in Supabase, check server logs for notification handling errors. - Job timeout monitor not running: check server logs, verify
JOB_TIMEOUT_CHECK_INTERVAL, the monitor runs as a cron process within the Thinklio server.
Database issues.
- Connection refused: Supabase project status, connection string, Supabase status dashboard.
- Too many connections: upgrade Supabase plan, reduce pool size.
- Slow queries: Supabase dashboard query analysis, add indexes, review query plans.
- Disk full (Supabase): archive old events and terminal jobs, increase storage via plan upgrade, review retention policies.
Redis issues.
- Memory full:
maxmemorysetting, eviction policy, job store size, cache pattern memory leaks. - Stream lag: consumer groups falling behind; scale Thinklio instances or Redis.
- Connection refused: Redis status, password config, network connectivity.
40. Security operations¶
40.1 Responding to a security incident¶
- Assess. Determine scope and severity from audit logs (admin dashboard; also Convex logs for function-level detail; also Supabase audit for legacy).
- Contain. Use the admin dashboard kill switch if needed (agent-level or platform-level). The platform kill switch cancels pending jobs.
- Investigate. Review audit logs, access logs, event history, delegation chains.
- Remediate. Block compromised accounts, revoke API keys, deregister suspicious external tools, patch vulnerability.
- Recover. Resume services, verify integrity.
- Document. Record the incident, response, and preventive measures in the decision log.
40.2 Revoking access¶
- User: suspend via admin dashboard; immediately blocks all interactions.
- API key: revoke via admin dashboard; immediately invalidates the key.
- Agent: pause via admin dashboard; immediately stops all agent interactions and cancels pending jobs.
- External tool: disable via admin dashboard; immediately prevents agents from invoking the tool.
- Account: suspend via admin dashboard; blocks all account activity.
- Platform: kill switch via admin dashboard; halts everything, cancels all pending jobs.
40.3 Audit log access¶
The admin dashboard provides a searchable audit log viewer. For bulk analysis:
- Convex era: query
convex/audit.ts:listEventsdirectly, or export theaudit_eventtable vianpx convex export --table audit_event. - Legacy Go era: query the events table via Supabase.
SELECT type, user_id, agent_id, payload, created_at
FROM events
WHERE type LIKE 'security.%'
AND created_at > NOW() - INTERVAL '7 days'
ORDER BY created_at DESC;
References¶
- Convex and Clerk setup.
11-convex-reference.mdfor Convex component and schema conventions. Part B of this document is the canonical setup reference. - Architecture.
02-system-architecture.mdfor service topology, execution tiers, and governance-as-middleware.03-agent-architecture.mdfor the harness, context assembly, delegation, and extensibility model. - Data model.
04-data-model.mdfor schema and indexes.05-persistence-storage-and-ingestion.mdfor storage, caching, and ingestion pipelines. - Events and messaging.
06-events-channels-and-messaging.mdfor the event bus, channel adapters, and Postmark inbound email channel. - Security and governance.
07-security-governance.mdfor security model, governance policy framework, credential management, and MCP permission model. Referenced by sections on auth, policy denial codes, and audit events. - External APIs.
09-external-api-tool-integration.mdfor the Channel, Platform, and Integration API contracts, the MCP server reference, the Tool integration developer guide, and event webhook delivery. - Client applications.
10-client-applications.mdfor the app UI specification (Part B), the docs and developer portal UI (Part C), and the internationalisation architecture (Part D). - Product and strategy.
01-product-and-strategy.mdfor pricing, plans, and enforcement that the admin dashboard surfaces. - Implementation plan.
13-implementation-plan-and-status.mdfor phasing, milestones, and current build status.
Revision history¶
| Version | Date | Change |
|---|---|---|
| 1.0.0 | 2026-04-16 | Initial consolidated release. Supersedes pre-consolidation docs 14 (Deployment & Administration v03), 35 (Programming Guide v01), 50 (Convex + Clerk Setup Guide v02), 55 (Testing & Observability v01). Programming Guide editorially translated from the Go + Supabase + Redis + Postgres stack to the Convex + Clerk stack: query / mutation / action patterns replacing the Go request lifecycle, ConvexError replacing the HTTP { data, meta } envelope, console.log with structured context replacing slog.JSONHandler, index-first data access replacing the SQL query pattern catalogue. The original Go conventions are preserved as a compact archival subsection (section 20). The Convex + Clerk Setup Guide is absorbed into Part B in full. Testing & Observability is updated to reflect the convex-test + vitest pattern, with the Go test matrix preserved as an archival subsection. Deployment & Administration is updated with a Convex-era deployment path (Part F section 32.1), with the original single-VPS Docker Compose deployment preserved as archival (section 32.3). Table names are singular per the project convention (user_profile, account, agent, agent_tool, agent_assignment, audit_event, knowledge_fact). Cross-references retargeted to the 14-document canonical set. No content loss beyond renumbering, sentence-case normalisation, and the stack-migration edits called out above. |