/docprocess | Type: Embedded | PCID required: No
Document processing tools: convert between formats (CSV, XLSX, HTML, PDF, Markdown, Word, XML, JSON), extract text (PDF, DOCX, PPTX), OCR images and PDFs, fill PDF forms, fill Word templates, create Word documents, validate CSVs, extract invoice line items with AI, and process Word documents with AI.
Tools
| Tool | Description |
|---|---|
docprocess_csv_to_xlsx | Convert CSV files to Excel |
docprocess_xlsx_to_csv | Convert Excel files to CSV |
docprocess_html_to_pdf | Convert HTML to PDF |
docprocess_md_to_docx | Convert Markdown to Word |
docprocess_md_to_pdf | Convert Markdown to PDF |
docprocess_xml_to_json | Convert XML to JSON |
docprocess_docx_to_txt | Extract text from Word documents |
docprocess_pdf_to_txt | Extract text from PDF files |
docprocess_pptx_to_txt | Extract text from PowerPoint files |
docprocess_ocr | Extract text from images/PDFs with OCR |
docprocess_ocr_poll | Poll OCR job status |
docprocess_fill_pdf | Fill a PDF form with data |
docprocess_fill_word_tpl | Fill a Word template with data |
docprocess_create_word | Create a Word document from JSON spec |
docprocess_word_ai | Edit Word documents with AI |
docprocess_word_ai_poll | Poll Word AI processing status |
docprocess_validate_csv | Validate CSV structure and data quality |
docprocess_invoice_extract | Extract line items from invoices (PDF/image) with AI |
docprocess_invoice_extract_poll | Poll invoice extraction job status |
Conversion tools — common pattern
The nine conversion and text-extraction tools below all share the same base parameters and response structure. Common parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file_urls | string[] | Yes | — | URLs of the source files to convert |
file_links_expire_in_days | number | No | 7 | Days until output file links expire (1–30) |
| Field | Type | Description |
|---|---|---|
message | string | Summary message |
files | object[] | Array of conversion results |
files[].file_url | string | URL of the original file |
files[].file_size | number | Size of the original file in bytes |
files[].mime_type | string | MIME type of the original file |
files[].filename | string | Name of the original file |
files[].converted.file_url | string | URL of the converted file |
files[].converted.file_size | number | Size of the converted file in bytes |
files[].converted.mime_type | string | MIME type of the converted file |
files[].converted.filename | string | Name of the converted file |
total | number | Total number of files processed |
docprocess_csv_to_xlsx
Convert one or more CSV files to Excel format. Parameters: Common parameters only — see conversion tools common pattern. Response fields: Common response fields — see above.docprocess_xlsx_to_csv
Convert Excel files to CSV. Multi-sheet workbooks produce a separate CSV per sheet. Parameters: Common parameters only — see conversion tools common pattern. Response fields: Common response fields — see above.docprocess_html_to_pdf
Convert HTML files to PDF. URLs must point directly to.html files.
Parameters: Common parameters only — see conversion tools common pattern.
Response fields: Common response fields — see above.
docprocess_md_to_docx
Convert Markdown files to Word documents. Parameters: Common parameters only — see conversion tools common pattern. Response fields: Common response fields — see above.docprocess_md_to_pdf
Convert Markdown files to PDF with configurable page size and orientation. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file_urls | string[] | Yes | — | URLs of the Markdown files to convert |
file_links_expire_in_days | number | No | 7 | Days until output file links expire (1–30) |
pdf_format | string | No | "a4" | Page size — "a4" or "letter" |
pdf_orientation | string | No | "portrait" | Page orientation — "portrait" or "landscape" |
docprocess_xml_to_json
Convert XML files to JSON. Can return data inline or store as a file. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file_urls | string[] | Yes | — | URLs of the XML files to convert |
file_links_expire_in_days | number | No | 7 | Days until output file links expire (1–30) |
store_xml_json | boolean | No | false | true to store the JSON as a file, false to return data inline |
docprocess_docx_to_txt
Extract plain text from Word documents. Parameters: Common parameters only — see conversion tools common pattern. Response fields: Common response fields — see above.docprocess_pdf_to_txt
Extract plain text from PDF files. Parameters: Common parameters only — see conversion tools common pattern. Response fields: Common response fields — see above.docprocess_pptx_to_txt
Extract plain text from PowerPoint files. Parameters: Common parameters only — see conversion tools common pattern. Response fields: Common response fields — see above.docprocess_ocr
Extract text from images or PDFs using OCR. Supports PNG, JPEG, GIF, WebP, BMP, TIFF, SVG, and PDF. Can run synchronously or asynchronously. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
fileUrls | string[] | Yes | — | URLs of images or PDFs to OCR |
languageHints | string[] | No | — | Language hints for OCR (e.g. ["en", "es"]) |
extractTextOnly | boolean | No | true | Extract plain text only |
collectionId | string | No | — | File storage collection ID for output files |
async | boolean | No | false | Run asynchronously — poll with docprocess_ocr_poll |
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the request succeeded |
async | boolean | Whether the job is running asynchronously |
collectionId | string | Collection ID (if provided) |
capability | string | OCR capability used |
totalFiles | number | Total files submitted |
successfulFiles | number | Files processed successfully |
failedFiles | number | Files that failed |
totalOcrPages | number | Total pages processed |
results | object[] | Array of per-file results |
results[].fileUrl | string | URL of the input file |
results[].inputFileName | string | Original file name |
results[].outputFileName | string | Output file name |
results[].outputMimeType | string | MIME type of the output |
results[].extractedText | string | Extracted text content |
results[].fileId | string | File ID in storage |
results[].signedUrl | string | Signed URL for the output file |
status | string | Job status |
responseId | string | ID for polling async jobs |
message | string | Status message |
docprocess_ocr_poll
Poll the status of an asynchronous OCR job. Parameters:| Parameter | Type | Required | Description |
|---|---|---|---|
responseId | string | Yes | Response ID from the original docprocess_ocr call |
| Field | Type | Description |
|---|---|---|
status | string | Job status — "completed", "failed", "processing", or "queued" |
responseId | string | The response ID |
results | object[] | Per-file results (same structure as docprocess_ocr results) |
error | string | Error message (when status is "failed") |
message | string | Status message |
docprocess_fill_pdf
Fill a PDF form with provided data. The PDF must have fillable form fields (not just a static document). Field matching is automatic — exact name match first, then case-insensitive partial match as a fallback. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
pdf_url | string | Yes | — | URL of the PDF form to fill |
form_data | object | Yes | — | Field-name-to-value pairs (see field types below) |
output_filename | string | No | "filled_form.pdf" | Name for the output file |
file_links_expire_in_days | number | No | 7 | Days until the output link expires |
form_data:
| Field type | Value format | Example |
|---|---|---|
| Text fields | "string value" | {"full_name": "Jane Doe"} |
| Checkboxes | true / false or "checked" | {"agree_terms": true} |
| Dropdowns | Option text (case-insensitive) | {"state": "California"} |
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the operation succeeded |
status | string | Status message |
file_url | string | URL of the filled PDF |
filename | string | Output file name |
size | number | File size in bytes |
fields_filled | number | Number of fields filled |
field_summary | string | Summary of filled fields |
docprocess_fill_word_tpl
Fill a Word template (.docx) with data while preserving all formatting. Templates use placeholder syntax in the document text — the tool replaces them with your data. Placeholder syntax:| Syntax | Purpose | Example |
|---|---|---|
{name} | Simple value | {company} → “Acme Corp” |
{user.firstName} | Nested path | Access nested objects |
{#items}...{/items} | Loop | Repeat a section for each array item |
{#condition}...{/condition} | Conditional | Show section only if truthy |
{^condition}...{/condition} | Inverted conditional | Show section only if falsy |
{price * quantity} | Expression | Evaluate simple expressions |
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
template_url | string | Yes | — | URL of the Word template (.docx) |
data | object | Yes | — | Data object with values for template placeholders (keys are case-sensitive) |
output_filename | string | No | "filled_template.docx" | Name for the output file |
file_links_expire_in_days | number | No | 7 | Days until the output link expires |
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the operation succeeded |
status | string | Status message |
file_url | string | URL of the filled document |
filename | string | Output file name |
size | number | File size in bytes |
placeholders_filled | number | Number of placeholders filled |
docprocess_create_word
Create a Word document (.docx) from scratch using a JSON specification. Build reports, invoices, contracts, and other professional documents programmatically — no template required. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
document_spec | object | Yes | — | JSON specification defining the document structure (see below) |
output_filename | string | No | "created_document.docx" | Name for the output file |
file_links_expire_in_days | number | No | 7 | Days until the output link expires |
Document spec structure
Thedocument_spec object contains a sections array. Each section has a children array of elements:
Element types
Heading — requireslevel (1–6) and text:
| Property | Type | Required | Description |
|---|---|---|---|
type | "heading" | Yes | Element type |
level | number | Yes | Heading level (1–6) |
text | string | Yes | Heading text |
alignment | string | No | "left", "center", "right", or "justified" |
| Property | Type | Required | Description |
|---|---|---|---|
type | "paragraph" | Yes | Element type |
text | string | No | Simple text (use this or children, not both) |
children | object[] | No | Array of formatted text runs (see formatting below) |
alignment | string | No | "left", "center", "right", or "justified" |
| Property | Type | Required | Description |
|---|---|---|---|
type | "bullet" | Yes | Element type |
text | string | Yes | Bullet text |
level | number | No | Indentation level (0+, default 0) |
alignment | string | No | "left", "center", "right", or "justified" |
| Property | Type | Required | Description |
|---|---|---|---|
type | "table" | Yes | Element type |
rows | array[] | Yes | Array of rows, each row is an array of cell values (strings or {"text": "...", "bold": true}) |
alignment | string | No | "left", "center", "right", or "justified" |
Text run formatting
When usingchildren in a paragraph, each text run supports these properties:
| Property | Type | Description |
|---|---|---|
text | string | Text content (required) |
bold | boolean | Bold |
italic | boolean | Italic |
underline | boolean | Underline |
strikethrough | boolean | |
doubleStrikethrough | boolean | Double strikethrough |
highlight | string | Background highlight — "yellow", "green", "cyan", "magenta", "red", "blue", "darkBlue", "darkGreen", "darkYellow", "lightGray", "darkGray", "black", "white" |
superScript | boolean | Superscript |
subScript | boolean | Subscript |
allCaps | boolean | ALL CAPITALS |
smallCaps | boolean | Small Capitals |
color | string | Hex color (e.g. "FF0000" for red) |
size | number | Font size in half-points (e.g. 28 = 14pt) |
font | string | Font family (e.g. "Arial", "Times New Roman") |
Example
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the operation succeeded |
status | string | Status message |
file_url | string | URL of the created document |
filename | string | Output file name |
size | number | File size in bytes |
element_count | number | Number of elements in the document |
docprocess_word_ai
Edit an existing Word document with AI while preserving all formatting. The AI reads the document content, applies the requested changes, and produces a new.docx with the original styling intact.
This tool runs asynchronously. Call it to start processing, then poll with docprocess_word_ai_poll until the result is ready.
Workflow
- Call
docprocess_word_aiwith the document URL and task description — returns aresponseId - Poll
docprocess_word_ai_pollwith thatresponseIdevery 5–10 seconds - When
statusis"completed", download the result fromresultFile.url
What you can do
| Task type | Example task values | Suggested strategy |
|---|---|---|
| Translation | "Translate to Spanish", "Translate to Japanese" | DENSE_CHANGES |
| Grammar & spelling | "Fix all grammar and spelling errors" | SPARSE_CHANGES |
| Rewriting | "Rewrite in a formal tone", "Simplify for an 8th-grade reading level" | DENSE_CHANGES |
| Summarization | "Summarize to 2 paragraphs" | DENSE_CHANGES |
| Tone adjustment | "Make this more concise and professional" | DENSE_CHANGES |
| Minor edits | "Replace all instances of 'Acme Corp' with 'Globex Inc'" | SPARSE_CHANGES |
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
documentUrl | string | Yes | — | URL of the .docx file to process |
task | string | Yes | — | Natural language description of what to do |
model | string | No | claude-sonnet-4-5-20250929 | AI model — claude-sonnet-4-5-20250929, gpt-4.1, gpt-4o, or gemini-2.5-flash |
strategy | string | No | auto-detected | "SPARSE_CHANGES" for minor edits (grammar, spelling, find-replace) or "DENSE_CHANGES" for major changes (translation, rewriting, summarization). Auto-detected if omitted. |
Preserved formatting
All document formatting is maintained during AI processing:- Bold, italic, underline, strikethrough
- Font families, sizes, and colors
- Text highlighting and backgrounds
- Paragraph alignment and spacing
- Bulleted and numbered lists
- Tables with all formatting
- Headers and footers
- Page breaks and sections
- Images and charts (untouched)
| Field | Type | Description |
|---|---|---|
status | string | Job status |
responseId | string | ID for polling with docprocess_word_ai_poll |
message | string | Status message |
task | string | The task description |
docprocess_word_ai_poll
Poll for the result of a Word AI job. Call every 5–10 seconds untilstatus is "completed" or "failed". When completed, download the processed document from resultFile.url.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
responseId | string | Yes | The responseId returned by docprocess_word_ai |
| Field | Type | Description |
|---|---|---|
status | string | Job status — "completed", "failed", "queued", or "processing" |
responseId | string | The response ID |
resultFile | object | The processed document (when completed) |
resultFile.url | string | URL of the processed document |
resultFile.filename | string | Output file name |
resultFile.size | number | File size in bytes |
error | string | Error message (when status is "failed") |
message | string | Status message |
docprocess_validate_csv
Validate the structure and data quality of CSV files. Reports errors, warnings, and statistics including encoding, delimiter, column/row counts, and detected data types. Parameters:| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
file_urls | string[] | Yes | — | URLs of the CSV files to validate |
file_links_expire_in_days | number | No | 7 | Days until output file links expire (1–30) |
| Field | Type | Description |
|---|---|---|
message | string | Summary message |
files | object[] | Array of validation results |
files[].validation.isValid | boolean | Whether the CSV is valid |
files[].validation.errors | string[] | Validation errors found |
files[].validation.warnings | string[] | Validation warnings |
files[].validation.statistics.encoding | string | Detected file encoding |
files[].validation.statistics.delimiter | string | Detected delimiter character |
files[].validation.statistics.hasHeaders | boolean | Whether headers were detected |
files[].validation.statistics.columnCount | number | Number of columns |
files[].validation.statistics.rowCount | number | Number of rows |
files[].validation.statistics.emptyRowCount | number | Number of empty rows |
files[].validation.statistics.totalCells | number | Total number of cells |
files[].validation.statistics.emptyCells | number | Number of empty cells |
files[].validation.dataTypes | object | Detected data types per column |
total | number | Total number of files validated |
docprocess_invoice_extract
Extract structured line items from invoices (PDF or image). Uses AI to discover columns dynamically from the document — different invoice types (legal, product, etc.) produce different column names. Supports multi-page PDFs with automatic sharding and parallel processing. Runs asynchronously — poll withdocprocess_invoice_extract_poll to retrieve results.
Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
fileUrl | string | Yes | — | URL to the invoice file. Supported: PDF, JPEG, PNG |
pagesPerShard | number | No | 3 | Pages per processing shard for PDFs (1–20). Smaller values may improve accuracy for dense invoices |
| Field | Type | Description |
|---|---|---|
jobId | string | Job ID for polling with docprocess_invoice_extract_poll |
status | string | Initial status — "queued" |
createdAt | string | ISO timestamp when the job was created |
message | string | Human-readable status message with polling instructions |
docprocess_invoice_extract_poll
Check the status of an invoice line-item extraction job. Call afterdocprocess_invoice_extract; poll every 10–15 seconds until status is "completed" or "failed". When completed, returns artifact URLs for the extracted CSV, JSON, and summary files.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
jobId | string | Yes | The jobId returned by docprocess_invoice_extract |
| Field | Type | Description |
|---|---|---|
jobId | string | Job ID |
status | string | "queued", "in_progress", "completed", or "failed" |
createdAt | string | ISO timestamp when the job was created |
completedAt | string | ISO timestamp when the job completed (when status is "completed") |
artifacts | object | Output artifact URLs (when status is "completed") |
artifacts.csvUrl | string | URL to download the extracted line items as CSV |
artifacts.jsonUrl | string | URL to download the extracted line items as JSON |
artifacts.summaryUrl | string | URL to download the processing summary |
error | string | Error message (when status is "failed") |
message | string | Human-readable status message |

