Skip to content

Data Agent

Thinklio Built-in Agent Specification Version 0.1 | March 2026


1. Purpose and Problem Statement

The Data Agent is a general-purpose data manipulation agent. Its primary role in the standard pipeline is to work with structured outputs from other agents — most commonly source lists from the Research Agent — but its scope is broader than any single pipeline.

It handles three categories of work:

  • Transformation — converting agent output to other formats (e.g. a source list to CSV, a draft to plain text)
  • Combination — merging, deduplicating, or intersecting outputs from multiple agent runs
  • Filtering and enrichment — applying criteria to reduce, sort, or annotate a dataset

The Data Agent is deliberately general. It is not the right tool for writing, checking, or rendering — those belong to the other pipeline agents. But anything that looks like "take this structured data and reshape it" is in scope.


2. Position in the Pipeline

The Data Agent can appear at multiple points:

Research Agent ×N  →  [Data Agent: merge]  →  Writer Agent
Research Agent     →  Writer Agent          →  [Data Agent: export CSV]

It is typically used either before the Writer Agent (to prepare a combined source list) or after the full pipeline (to export results in a consumable format). It can also be used standalone, outside any pipeline context.


3. Invocation Modes

Programmatic (agent-to-agent) A coordinator passes one or more data inputs and a transformation specification. The Data Agent returns the transformed output. No UI required.

Standalone (user-initiated) A user selects inputs and configures a transformation in the UI. Useful for ad hoc exports, merges, or analysis outside a defined pipeline.


4. Core Capabilities

4.1 Source List Operations

These are the most common pipeline use cases.

Operation Description
Merge Combine two or more source lists into one, deduplicating by DOI or URL
Filter Reduce a source list by criteria: date range, relevance score threshold, source type, keyword presence
Sort Reorder by relevance score, date, or publication
Deduplicate Remove duplicate sources across lists, with configurable match logic (exact URL, DOI, or title similarity)
Top-N Return only the top N sources by relevance score
Annotate Add tags or notes to sources in bulk based on criteria

4.2 Export Operations

Operation Description
To CSV Export a source list or any tabular agent output as CSV
To JSON Export any agent output as JSON
To markdown table Convert a source list to a formatted markdown reference table
Citation list Produce a formatted citation list in a specified style (APA, Chicago, Vancouver, etc.)

4.3 Draft Operations

Operation Description
Extract sections Pull named sections from a Draft as separate text outputs
Flatten Convert a structured Draft to plain text
Word count analysis Return word counts per section vs. targets
Diff Compare two versions of a Draft and return a structured diff

4.4 General Purpose Operations

Operation Description
Schema map Map fields from one structured output to another schema
Aggregate Count, sum, or group records by a field
Join Join two datasets on a common field
Pivot Reshape tabular data
Lookup Enrich records by matching against a reference dataset (e.g. add category labels to sources)

5. Configuration

5.1 Admin Configuration

Setting Description
Max input records Hard limit on records that can be processed in a single run
Export formats permitted Which export formats are available in this workspace
External data sources Whether the Data Agent may query external data sources (beyond agent outputs)

5.2 Run-time Parameters

Parameter Type Description
inputs DataInput[] One or more inputs. Each is a reference to an agent output, a saved dataset, or inline data.
operations Operation[] Ordered list of operations to apply. Operations are applied sequentially.
output_format enum json, csv, markdown, source_list (typed Thinklio output)
save_to reference Optional record to attach the output to
label string Optional label for the output dataset

Operations are composable. A run might specify: merge two source lists → filter by relevance > 0.7 → top 20 → export as CSV.


6. Operation Specification

Operations are defined as a pipeline of steps. Each step has a type and type-specific parameters:

Operation
├── type        enum
└── params      object (type-specific)

Example pipeline spec:

[
  {
    "type": "merge",
    "params": {
      "inputs": ["source_list_a", "source_list_b"],
      "dedup_by": "doi_or_url"
    }
  },
  {
    "type": "filter",
    "params": {
      "field": "relevance_score",
      "operator": "gte",
      "value": 0.7
    }
  },
  {
    "type": "top_n",
    "params": {
      "n": 20,
      "order_by": "relevance_score",
      "direction": "desc"
    }
  },
  {
    "type": "export",
    "params": {
      "format": "csv"
    }
  }
]

This declarative approach means coordinator agents can construct operation pipelines programmatically without custom code.


7. Output Structure

The Data Agent returns a DataOutput:

DataOutput
├── output_id       UUID
├── label           string
├── format          enum
├── record_count    integer
├── generated_at    timestamp
├── operations      Operation[] (log of applied operations)
└── data            string | object (format-dependent)

For source_list output format, the result is a valid SourceList that can be passed directly to the Writer Agent.


8. User Interface

8.1 Configuration Screen

  • Input selector: add one or more inputs (agent output references, saved datasets, or file upload for CSV/JSON)
  • Operation builder: drag-and-drop or ordered form for adding and sequencing operations
  • Each operation type shows relevant parameter fields
  • Preview of record count after each operation (where computable without full execution)
  • Output format: dropdown
  • Label: text field
  • Save to: record picker

8.2 Progress View

  • Record counts at each operation step
  • Status messages
  • Cancel button

8.3 Results View

  • Record count summary
  • Preview of output (first 20 rows for tabular formats, full view for small datasets)
  • Download button (CSV, JSON, or markdown)
  • Save to media system option (for outputs that warrant artefact treatment)
  • Copy as markdown table button

9. Data Model Integration

Data Object Interaction
Note Data output saved as a Note (e.g. a merged source list for reference)
Task Export or merge run tracked as a Task subtask
Item Structured data attached to an Item (e.g. a CSV of research for an enquiry)

10. Use Cases

UC-1: Merge research streams

A coordinator runs three Research Agent instances (academic, news, general) for a newsletter. Before passing to the Writer Agent, it invokes the Data Agent to merge and deduplicate the three source lists, filter to relevance > 0.65, and return a single consolidated source list.

UC-2: Export source list to CSV

A user has a source list from a research run and wants to share it with a colleague who doesn't use Thinklio. They open the Data Agent, select the source list, add an export-to-CSV operation, and download the file.

UC-3: Citation list generation

A researcher wants a formatted reference list from their source list. They run the Data Agent with a citation-list operation set to Vancouver style, then copy the output into their document.

UC-4: Compare research runs

A scheduled Research Agent has been running for four weeks. A user wants to see what new sources have appeared since week one. They pass the week-one and week-four source lists to the Data Agent with a diff operation to identify net-new sources.

UC-5: Relevance filtering before writing

A coordinator passes a large source list (50 sources) from a broad research run to the Data Agent, which filters to the top 15 by relevance score before passing the reduced list to the Writer Agent. This keeps the writer focused and reduces token cost.


11. Open Questions

  • Should the Data Agent be able to invoke the Research Agent's APIs directly (e.g. to enrich a source list with fresh metadata), or should it only work on existing agent outputs? Direct API access makes it more powerful but blurs the boundary with the Research Agent.
  • For the diff operation on source lists, what constitutes "the same source" when DOIs are absent — title similarity threshold, URL match, or both?
  • The general-purpose operations (join, pivot, lookup) introduce significant scope. Should these be phase-two capabilities, with the initial release focused on source list and export operations only?
  • CSV export of a source list loses the hierarchical structure (facts lists, nested metadata). Should the export flatten this, or produce multiple related CSV files?

Previous: Report Writer Agent | Next: Newsletter Creator — Case Study