Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions api-reference/workflow/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2232,6 +2232,81 @@ Allowed values for `subtype` and `model_name` include the following:
- `"model_name": "voyage-code-2"`
- `"model_name": "voyage-multimodal-3"`

### Extract node

An **Extract** node has a `type` of `structured_data_extractor` and a `subtype` of `llm`.

<AccordionGroup>
<Accordion title="Python SDK">
```python
embedder_workflow_node = WorkflowNode(
name="Extractor",
subtype="llm",
type="structured_data_extractor",
settings={
"output_mode": "prepend"|"overwrite",
"schema_to_extract": {
"json_schema": "<json-schema>",
"extraction_guidance": "<extraction-guidance>"
},
"provider": "<provider>",
"model": "<model>"
}
)
```
</Accordion>
<Accordion title="curl, Postman">
```json
{
"name": "Extractor",
"type": "structured_data_extractor",
"subtype": "llm",
"settings": {
"output_mode": "prepend"|"overwrite",
"schema_to_extract": {
"json_schema": "<json-schema>",
"extraction_guidance": "<extraction-guidance>"
},
"provider": "<provider>",
"model": "<model>"
}
}
```
</Accordion>
</AccordionGroup>

Fields for `settings` include:

- `output_mode`: _Optional_. The mode in which to output the extracted data. Allowed values include `prepend` (the default if not otherwise specified) and `overwrite`:

- `prepend`: Prepend the extracted data according to the schema specified in `schema_to_extract` to the data that is partitioned according to the default Unstructured document elements format.
- `overwrite`: Output only the extracted data according to the schema specified in `schema_to_extract`.

- `schema_to_extract`: _Required_. The schema or guidance for the structured data that you want to extract. One (and only one) of the following must also be specified:

- `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string.
- `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string.

- Allowed values for `provider` and `model` include the following:

- `"provider": "anthropic"`

- `"model": "..."`

- `"provider": "azure_openai"`

- `"model": "..."`

- `"provider": "bedrock"`

- `"model": "..."`

- `"provider": "openai"`

- `"model": "..."`

[Learn more](/ui/data-extractor).

## List templates

To list templates, use the `UnstructuredClient` object's `templates.list_templates` function (for the Python SDK) or the `GET` method to call the `/templates` endpoint (for `curl` or Postman).
Expand Down
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@
"pages": [
"ui/document-elements",
"ui/partitioning",
"ui/data-extractor",
"ui/chunking",
{
"group": "Enriching",
Expand Down
Binary file added img/ui/data-extractor/house-plant-care.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ui/data-extractor/medical-invoice.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ui/data-extractor/real-estate-listing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ui/data-extractor/schema-builder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
257 changes: 256 additions & 1 deletion snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -465,9 +465,264 @@ embedding model that is provided by an embedding provider. For the best embeddin
6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
the workflow designer so that you can continue designing things later as you see fit.

## Step 7: Experiment with structured data extraction

In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured
automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured
partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured
output key information from the source documents in a custom structured data format, appearing within a `DocumentData` element that contains a JSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on.

1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**.

![Adding an extract node](/img/ui/walkthrough/AddExtract.png)

2. In the node's settings pane's **Details** tab, under **Provider**, select **Anthropic**. Under **Model**, select **Claude Sonnet 4.5**. This is the model that Unstructured will use to do the structured data extraction.

<Note>
The list of available models for structured data extraction is constantly being updated. Your list might also be different, depending on your Unstructured
account type. If **Anthropic** and **Claude Sonnet 4.5** is not available, choose another available model from the list.

If you have an Unstructured **Business** account and want to add more models to this list, contact your
Unstructured account administrator or Unstructured sales representative, or email Unstructured Support at
[support@unstructured.io](mailto:support@unstructured.io).
</Note>

3. Click **Upload JSON**.
4. in the **JSON Schema** box, enter the following JSON schema, and then click **Use this Schema**:

```json
{
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Full title of the research paper"
},
"authors": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Author's full name"
},
"affiliation": {
"type": "string",
"description": "Author's institutional affiliation"
},
"email": {
"type": "string",
"description": "Author's email address"
}
},
"required": [
"name",
"affiliation",
"email"
],
"additionalProperties": false
},
"description": "List of paper authors with their affiliations"
},
"abstract": {
"type": "string",
"description": "Paper abstract summarizing the research"
},
"introduction": {
"type": "string",
"description": "Introduction section describing the problem and motivation"
},
"methodology": {
"type": "object",
"properties": {
"approach_name": {
"type": "string",
"description": "Name of the proposed method (e.g., StrokeNet)"
},
"description": {
"type": "string",
"description": "Detailed description of the methodology"
},
"key_techniques": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of key techniques used in the approach"
}
},
"required": [
"approach_name",
"description",
"key_techniques"
],
"additionalProperties": false
},
"experiments": {
"type": "object",
"properties": {
"datasets": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Dataset name"
},
"description": {
"type": "string",
"description": "Dataset description"
},
"size": {
"type": "string",
"description": "Dataset size (e.g., number of sentence pairs)"
}
},
"required": [
"name",
"description",
"size"
],
"additionalProperties": false
},
"description": "Datasets used for evaluation"
},
"baselines": {
"type": "array",
"items": {
"type": "string"
},
"description": "Baseline methods compared against"
},
"evaluation_metrics": {
"type": "array",
"items": {
"type": "string"
},
"description": "Metrics used for evaluation"
},
"experimental_setup": {
"type": "string",
"description": "Description of experimental configuration and hyperparameters"
}
},
"required": [
"datasets",
"baselines",
"evaluation_metrics",
"experimental_setup"
],
"additionalProperties": false
},
"results": {
"type": "object",
"properties": {
"main_findings": {
"type": "string",
"description": "Summary of main experimental findings"
},
"performance_improvements": {
"type": "array",
"items": {
"type": "object",
"properties": {
"dataset": {
"type": "string",
"description": "Dataset name"
},
"metric": {
"type": "string",
"description": "Evaluation metric (e.g., BLEU)"
},
"baseline_score": {
"type": "number",
"description": "Baseline method score"
},
"proposed_score": {
"type": "number",
"description": "Proposed method score"
},
"improvement": {
"type": "number",
"description": "Improvement over baseline"
}
},
"required": [
"dataset",
"metric",
"baseline_score",
"proposed_score",
"improvement"
],
"additionalProperties": false
},
"description": "Performance improvements over baselines"
},
"parameter_reduction": {
"type": "string",
"description": "Description of parameter reduction achieved"
}
},
"required": [
"main_findings",
"performance_improvements",
"parameter_reduction"
],
"additionalProperties": false
},
"related_work": {
"type": "string",
"description": "Summary of related work and prior research"
},
"conclusion": {
"type": "string",
"description": "Conclusion section summarizing contributions and findings"
},
"limitations": {
"type": "string",
"description": "Limitations and challenges discussed in the paper"
},
"acknowledgments": {
"type": "string",
"description": "Acknowledgments section"
},
"references": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of cited references"
}
},
"additionalProperties": false,
"required": [
"title",
"authors",
"abstract",
"introduction",
"methodology",
"experiments",
"results",
"related_work",
"conclusion",
"limitations",
"acknowledgments",
"references"
]
}
```

5. Immediately above the **Source** node, click **Test**.
6. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow.
7. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks).
8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
the workflow designer so that you can continue designing things later as you see fit.

## Next steps

Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing
Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing
context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.

Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file.
Expand Down
3 changes: 2 additions & 1 deletion snippets/general-shared-text/get-started-single-file-ui.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ You can also do the following:

What's next?

- <Icon icon="plus" />&nbsp;&nbsp;[Learn how to add chunking, embeddings, and additional enrichments to your local file results](/ui/walkthrough-2).
- <Icon icon="code" />&nbsp;&nbsp;[Learn how to extract structured data in a custom format from your local file](/ui/data-extractor#use-the-structured-data-extractor-from-the-start-page).
- <Icon icon="plus" />&nbsp;&nbsp;[Learn how to add chunking, embeddings, custom structured data extraction, and additional enrichments to your local file results](/ui/walkthrough-2).
- <Icon icon="database" />&nbsp;&nbsp;[Learn how to do large-scale batch processing of multiple files and semi-structured data that are stored in remote locations instead](/ui/quickstart#remote-quickstart).
- <Icon icon="desktop" />&nbsp;&nbsp;[Learn more about the Unstructured user interface](/ui/overview).
Loading