Data Agent¶

Thinklio Built-in Agent Specification Version 0.1 | March 2026

1. Purpose and Problem Statement¶

The Data Agent is a general-purpose data manipulation agent. Its primary role in the standard pipeline is to work with structured outputs from other agents — most commonly source lists from the Research Agent — but its scope is broader than any single pipeline.

It handles three categories of work:

Transformation — converting agent output to other formats (e.g. a source list to CSV, a draft to plain text)
Combination — merging, deduplicating, or intersecting outputs from multiple agent runs
Filtering and enrichment — applying criteria to reduce, sort, or annotate a dataset

The Data Agent is deliberately general. It is not the right tool for writing, checking, or rendering — those belong to the other pipeline agents. But anything that looks like "take this structured data and reshape it" is in scope.

2. Position in the Pipeline¶

The Data Agent can appear at multiple points:

Research Agent ×N  →  [Data Agent: merge]  →  Writer Agent
Research Agent     →  Writer Agent          →  [Data Agent: export CSV]

It is typically used either before the Writer Agent (to prepare a combined source list) or after the full pipeline (to export results in a consumable format). It can also be used standalone, outside any pipeline context.

3. Invocation Modes¶

Programmatic (agent-to-agent) A coordinator passes one or more data inputs and a transformation specification. The Data Agent returns the transformed output. No UI required.

Standalone (user-initiated) A user selects inputs and configures a transformation in the UI. Useful for ad hoc exports, merges, or analysis outside a defined pipeline.

4. Core Capabilities¶

4.1 Source List Operations¶

These are the most common pipeline use cases.

Operation	Description
Merge	Combine two or more source lists into one, deduplicating by DOI or URL
Filter	Reduce a source list by criteria: date range, relevance score threshold, source type, keyword presence
Sort	Reorder by relevance score, date, or publication
Deduplicate	Remove duplicate sources across lists, with configurable match logic (exact URL, DOI, or title similarity)
Top-N	Return only the top N sources by relevance score
Annotate	Add tags or notes to sources in bulk based on criteria

4.2 Export Operations¶

Operation	Description
To CSV	Export a source list or any tabular agent output as CSV
To JSON	Export any agent output as JSON
To markdown table	Convert a source list to a formatted markdown reference table
Citation list	Produce a formatted citation list in a specified style (APA, Chicago, Vancouver, etc.)

4.3 Draft Operations¶

Operation	Description
Extract sections	Pull named sections from a Draft as separate text outputs
Flatten	Convert a structured Draft to plain text
Word count analysis	Return word counts per section vs. targets
Diff	Compare two versions of a Draft and return a structured diff

4.4 General Purpose Operations¶

Operation	Description
Schema map	Map fields from one structured output to another schema
Aggregate	Count, sum, or group records by a field
Join	Join two datasets on a common field
Pivot	Reshape tabular data
Lookup	Enrich records by matching against a reference dataset (e.g. add category labels to sources)

5. Configuration¶

5.1 Admin Configuration¶

Setting	Description
Max input records	Hard limit on records that can be processed in a single run
Export formats permitted	Which export formats are available in this workspace
External data sources	Whether the Data Agent may query external data sources (beyond agent outputs)

5.2 Run-time Parameters¶

Parameter	Type	Description
`inputs`	DataInput[]	One or more inputs. Each is a reference to an agent output, a saved dataset, or inline data.
`operations`	Operation[]	Ordered list of operations to apply. Operations are applied sequentially.
`output_format`	enum	`json`, `csv`, `markdown`, `source_list` (typed Thinklio output)
`save_to`	reference	Optional record to attach the output to
`label`	string	Optional label for the output dataset

Operations are composable. A run might specify: merge two source lists → filter by relevance > 0.7 → top 20 → export as CSV.

6. Operation Specification¶

Operations are defined as a pipeline of steps. Each step has a type and type-specific parameters:

Operation
├── type        enum
└── params      object (type-specific)

Example pipeline spec:

[
  {
    "type": "merge",
    "params": {
      "inputs": ["source_list_a", "source_list_b"],
      "dedup_by": "doi_or_url"
    }
  },
  {
    "type": "filter",
    "params": {
      "field": "relevance_score",
      "operator": "gte",
      "value": 0.7
    }
  },
  {
    "type": "top_n",
    "params": {
      "n": 20,
      "order_by": "relevance_score",
      "direction": "desc"
    }
  },
  {
    "type": "export",
    "params": {
      "format": "csv"
    }
  }
]

This declarative approach means coordinator agents can construct operation pipelines programmatically without custom code.

7. Output Structure¶

The Data Agent returns a DataOutput:

DataOutput
├── output_id       UUID
├── label           string
├── format          enum
├── record_count    integer
├── generated_at    timestamp
├── operations      Operation[] (log of applied operations)
└── data            string | object (format-dependent)

For source_list output format, the result is a valid SourceList that can be passed directly to the Writer Agent.

8. User Interface¶

8.1 Configuration Screen¶

Input selector: add one or more inputs (agent output references, saved datasets, or file upload for CSV/JSON)
Operation builder: drag-and-drop or ordered form for adding and sequencing operations
Each operation type shows relevant parameter fields
Preview of record count after each operation (where computable without full execution)
Output format: dropdown
Label: text field
Save to: record picker

8.2 Progress View¶

Record counts at each operation step
Status messages
Cancel button

8.3 Results View¶

Record count summary
Preview of output (first 20 rows for tabular formats, full view for small datasets)
Download button (CSV, JSON, or markdown)
Save to media system option (for outputs that warrant artefact treatment)
Copy as markdown table button

9. Data Model Integration¶

Data Object	Interaction
Note	Data output saved as a Note (e.g. a merged source list for reference)
Task	Export or merge run tracked as a Task subtask
Item	Structured data attached to an Item (e.g. a CSV of research for an enquiry)

10. Use Cases¶

UC-1: Merge research streams¶

A coordinator runs three Research Agent instances (academic, news, general) for a newsletter. Before passing to the Writer Agent, it invokes the Data Agent to merge and deduplicate the three source lists, filter to relevance > 0.65, and return a single consolidated source list.

UC-2: Export source list to CSV¶

A user has a source list from a research run and wants to share it with a colleague who doesn't use Thinklio. They open the Data Agent, select the source list, add an export-to-CSV operation, and download the file.

UC-3: Citation list generation¶

A researcher wants a formatted reference list from their source list. They run the Data Agent with a citation-list operation set to Vancouver style, then copy the output into their document.

UC-4: Compare research runs¶

A scheduled Research Agent has been running for four weeks. A user wants to see what new sources have appeared since week one. They pass the week-one and week-four source lists to the Data Agent with a diff operation to identify net-new sources.

UC-5: Relevance filtering before writing¶

A coordinator passes a large source list (50 sources) from a broad research run to the Data Agent, which filters to the top 15 by relevance score before passing the reduced list to the Writer Agent. This keeps the writer focused and reduces token cost.

11. Open Questions¶

Should the Data Agent be able to invoke the Research Agent's APIs directly (e.g. to enrich a source list with fresh metadata), or should it only work on existing agent outputs? Direct API access makes it more powerful but blurs the boundary with the Research Agent.
For the diff operation on source lists, what constitutes "the same source" when DOIs are absent — title similarity threshold, URL match, or both?
The general-purpose operations (join, pivot, lookup) introduce significant scope. Should these be phase-two capabilities, with the initial release focused on source list and export operations only?
CSV export of a source list loses the hierarchical structure (facts lists, nested metadata). Should the export flatten this, or produce multiple related CSV files?

Previous: Report Writer Agent | Next: Newsletter Creator — Case Study