Data Agent¶
Thinklio Built-in Agent Specification Version 0.1 | March 2026
1. Purpose and Problem Statement¶
The Data Agent is a general-purpose data manipulation agent. Its primary role in the standard pipeline is to work with structured outputs from other agents — most commonly source lists from the Research Agent — but its scope is broader than any single pipeline.
It handles three categories of work:
- Transformation — converting agent output to other formats (e.g. a source list to CSV, a draft to plain text)
- Combination — merging, deduplicating, or intersecting outputs from multiple agent runs
- Filtering and enrichment — applying criteria to reduce, sort, or annotate a dataset
The Data Agent is deliberately general. It is not the right tool for writing, checking, or rendering — those belong to the other pipeline agents. But anything that looks like "take this structured data and reshape it" is in scope.
2. Position in the Pipeline¶
The Data Agent can appear at multiple points:
Research Agent ×N → [Data Agent: merge] → Writer Agent
Research Agent → Writer Agent → [Data Agent: export CSV]
It is typically used either before the Writer Agent (to prepare a combined source list) or after the full pipeline (to export results in a consumable format). It can also be used standalone, outside any pipeline context.
3. Invocation Modes¶
Programmatic (agent-to-agent) A coordinator passes one or more data inputs and a transformation specification. The Data Agent returns the transformed output. No UI required.
Standalone (user-initiated) A user selects inputs and configures a transformation in the UI. Useful for ad hoc exports, merges, or analysis outside a defined pipeline.
4. Core Capabilities¶
4.1 Source List Operations¶
These are the most common pipeline use cases.
| Operation | Description |
|---|---|
| Merge | Combine two or more source lists into one, deduplicating by DOI or URL |
| Filter | Reduce a source list by criteria: date range, relevance score threshold, source type, keyword presence |
| Sort | Reorder by relevance score, date, or publication |
| Deduplicate | Remove duplicate sources across lists, with configurable match logic (exact URL, DOI, or title similarity) |
| Top-N | Return only the top N sources by relevance score |
| Annotate | Add tags or notes to sources in bulk based on criteria |
4.2 Export Operations¶
| Operation | Description |
|---|---|
| To CSV | Export a source list or any tabular agent output as CSV |
| To JSON | Export any agent output as JSON |
| To markdown table | Convert a source list to a formatted markdown reference table |
| Citation list | Produce a formatted citation list in a specified style (APA, Chicago, Vancouver, etc.) |
4.3 Draft Operations¶
| Operation | Description |
|---|---|
| Extract sections | Pull named sections from a Draft as separate text outputs |
| Flatten | Convert a structured Draft to plain text |
| Word count analysis | Return word counts per section vs. targets |
| Diff | Compare two versions of a Draft and return a structured diff |
4.4 General Purpose Operations¶
| Operation | Description |
|---|---|
| Schema map | Map fields from one structured output to another schema |
| Aggregate | Count, sum, or group records by a field |
| Join | Join two datasets on a common field |
| Pivot | Reshape tabular data |
| Lookup | Enrich records by matching against a reference dataset (e.g. add category labels to sources) |
5. Configuration¶
5.1 Admin Configuration¶
| Setting | Description |
|---|---|
| Max input records | Hard limit on records that can be processed in a single run |
| Export formats permitted | Which export formats are available in this workspace |
| External data sources | Whether the Data Agent may query external data sources (beyond agent outputs) |
5.2 Run-time Parameters¶
| Parameter | Type | Description |
|---|---|---|
inputs |
DataInput[] | One or more inputs. Each is a reference to an agent output, a saved dataset, or inline data. |
operations |
Operation[] | Ordered list of operations to apply. Operations are applied sequentially. |
output_format |
enum | json, csv, markdown, source_list (typed Thinklio output) |
save_to |
reference | Optional record to attach the output to |
label |
string | Optional label for the output dataset |
Operations are composable. A run might specify: merge two source lists → filter by relevance > 0.7 → top 20 → export as CSV.
6. Operation Specification¶
Operations are defined as a pipeline of steps. Each step has a type and type-specific parameters:
Example pipeline spec:
[
{
"type": "merge",
"params": {
"inputs": ["source_list_a", "source_list_b"],
"dedup_by": "doi_or_url"
}
},
{
"type": "filter",
"params": {
"field": "relevance_score",
"operator": "gte",
"value": 0.7
}
},
{
"type": "top_n",
"params": {
"n": 20,
"order_by": "relevance_score",
"direction": "desc"
}
},
{
"type": "export",
"params": {
"format": "csv"
}
}
]
This declarative approach means coordinator agents can construct operation pipelines programmatically without custom code.
7. Output Structure¶
The Data Agent returns a DataOutput:
DataOutput
├── output_id UUID
├── label string
├── format enum
├── record_count integer
├── generated_at timestamp
├── operations Operation[] (log of applied operations)
└── data string | object (format-dependent)
For source_list output format, the result is a valid SourceList that can be passed directly to the Writer Agent.
8. User Interface¶
8.1 Configuration Screen¶
- Input selector: add one or more inputs (agent output references, saved datasets, or file upload for CSV/JSON)
- Operation builder: drag-and-drop or ordered form for adding and sequencing operations
- Each operation type shows relevant parameter fields
- Preview of record count after each operation (where computable without full execution)
- Output format: dropdown
- Label: text field
- Save to: record picker
8.2 Progress View¶
- Record counts at each operation step
- Status messages
- Cancel button
8.3 Results View¶
- Record count summary
- Preview of output (first 20 rows for tabular formats, full view for small datasets)
- Download button (CSV, JSON, or markdown)
- Save to media system option (for outputs that warrant artefact treatment)
- Copy as markdown table button
9. Data Model Integration¶
| Data Object | Interaction |
|---|---|
| Note | Data output saved as a Note (e.g. a merged source list for reference) |
| Task | Export or merge run tracked as a Task subtask |
| Item | Structured data attached to an Item (e.g. a CSV of research for an enquiry) |
10. Use Cases¶
UC-1: Merge research streams¶
A coordinator runs three Research Agent instances (academic, news, general) for a newsletter. Before passing to the Writer Agent, it invokes the Data Agent to merge and deduplicate the three source lists, filter to relevance > 0.65, and return a single consolidated source list.
UC-2: Export source list to CSV¶
A user has a source list from a research run and wants to share it with a colleague who doesn't use Thinklio. They open the Data Agent, select the source list, add an export-to-CSV operation, and download the file.
UC-3: Citation list generation¶
A researcher wants a formatted reference list from their source list. They run the Data Agent with a citation-list operation set to Vancouver style, then copy the output into their document.
UC-4: Compare research runs¶
A scheduled Research Agent has been running for four weeks. A user wants to see what new sources have appeared since week one. They pass the week-one and week-four source lists to the Data Agent with a diff operation to identify net-new sources.
UC-5: Relevance filtering before writing¶
A coordinator passes a large source list (50 sources) from a broad research run to the Data Agent, which filters to the top 15 by relevance score before passing the reduced list to the Writer Agent. This keeps the writer focused and reduces token cost.
11. Open Questions¶
- Should the Data Agent be able to invoke the Research Agent's APIs directly (e.g. to enrich a source list with fresh metadata), or should it only work on existing agent outputs? Direct API access makes it more powerful but blurs the boundary with the Research Agent.
- For the diff operation on source lists, what constitutes "the same source" when DOIs are absent — title similarity threshold, URL match, or both?
- The general-purpose operations (join, pivot, lookup) introduce significant scope. Should these be phase-two capabilities, with the initial release focused on source list and export operations only?
- CSV export of a source list loses the hierarchical structure (facts lists, nested metadata). Should the export flatten this, or produce multiple related CSV files?
Previous: Report Writer Agent | Next: Newsletter Creator — Case Study