Unstructured-IO · Paul-Cornell · Sep 18, 2025 · Sep 19, 2025 · Sep 24, 2025 · Sep 30, 2025
diff --git a/api-reference/workflow/workflows.mdx b/api-reference/workflow/workflows.mdx
@@ -2232,6 +2232,81 @@ Allowed values for `subtype` and `model_name` include the following:
   - `"model_name": "voyage-code-2"`
   - `"model_name": "voyage-multimodal-3"`
 
+### Extract node
+
+An **Extract** node has a `type` of `structured_data_extractor` and a `subtype` of `llm`.
+
+<AccordionGroup>
+    <Accordion title="Python SDK">
+        ```python
+       embedder_workflow_node = WorkflowNode(
+            name="Extractor",
+            subtype="llm",
+            type="structured_data_extractor",
+            settings={
+                "output_mode": "prepend"|"overwrite",
+                "schema_to_extract": {
+                    "json_schema": "<json-schema>",
+                    "extraction_guidance": "<extraction-guidance>"
+                },
+                "provider": "<provider>",
+                "model": "<model>"
+            }
+        )
+        ```
+    </Accordion>
+    <Accordion title="curl, Postman">
+        ```json
+        {
+            "name": "Extractor",
+            "type": "structured_data_extractor",
+            "subtype": "llm",
+            "settings": {
+                "output_mode": "prepend"|"overwrite",
+                "schema_to_extract": {
+                    "json_schema": "<json-schema>",
+                    "extraction_guidance": "<extraction-guidance>"
+                },
+                "provider": "<provider>",
+                "model": "<model>"
+            }
+        }
+        ```
+    </Accordion>
+</AccordionGroup>
+
+Fields for `settings` include:
+
+- `output_mode`: _Optional_. The mode in which to output the extracted data. Allowed values include `prepend` (the default if not otherwise specified) and `overwrite`:
+
+  - `prepend`: Prepend the extracted data according to the schema specified in `schema_to_extract` to the data that is partitioned according to the default Unstructured document elements format.
+  - `overwrite`: Output only the extracted data according to the schema specified in `schema_to_extract`.
+
+- `schema_to_extract`: _Required_. The schema or guidance for the structured data that you want to extract. One (and only one) of the following must also be specified:
+
+  - `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string.
+  - `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string.
+
+- Allowed values for `provider` and `model` include the following:
+
+  - `"provider": "anthropic"`
+
+    - `"model": "..."`
+
+  - `"provider": "azure_openai"`
+
+    - `"model": "..."`
+
+  - `"provider": "bedrock"`
+
+    - `"model": "..."`
+
+  - `"provider": "openai"`
+
+    - `"model": "..."`
+
+[Learn more](/ui/data-extractor).
+
 ## List templates
 
 To list templates, use the `UnstructuredClient` object's `templates.list_templates` function (for the Python SDK) or the `GET` method to call the `/templates` endpoint (for `curl` or Postman). 

diff --git a/docs.json b/docs.json
@@ -120,6 +120,7 @@
             "pages": [
               "ui/document-elements",
               "ui/partitioning",
+              "ui/data-extractor",
               "ui/chunking",
               {
                 "group": "Enriching",

diff --git a/img/ui/data-extractor/house-plant-care.png b/img/ui/data-extractor/house-plant-care.png
diff --git a/img/ui/data-extractor/medical-invoice.png b/img/ui/data-extractor/medical-invoice.png
diff --git a/img/ui/data-extractor/real-estate-listing.png b/img/ui/data-extractor/real-estate-listing.png
diff --git a/img/ui/data-extractor/schema-builder.png b/img/ui/data-extractor/schema-builder.png
diff --git a/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf b/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf
diff --git a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
@@ -465,9 +465,264 @@ embedding model that is provided by an embedding provider. For the best embeddin
 6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to 
    the workflow designer so that you can continue designing things later as you see fit.
 
+## Step 7: Experiment with structured data extraction
+
+In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured 
+automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured 
+partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured 
+output key information from the source documents in a custom structured data format, appearing within a `DocumentData` element that contains a JSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on.
+
+1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**.
+
+   ![Adding an extract node](/img/ui/walkthrough/AddExtract.png)
+
+2. In the node's settings pane's **Details** tab, under **Provider**, select **Anthropic**. Under **Model**, select **Claude Sonnet 4.5**. This is the model that Unstructured will use to do the structured data extraction.
+
+   <Note>
+       The list of available models for structured data extraction is constantly being updated. Your list might also be different, depending on your Unstructured 
+       account type. If **Anthropic** and **Claude Sonnet 4.5** is not available, choose another available model from the list.
+
+       If you have an Unstructured **Business** account and want to add more models to this list, contact your 
+       Unstructured account administrator or Unstructured sales representative, or email Unstructured Support at 
+       [support@unstructured.io](mailto:support@unstructured.io).
+    </Note>
+
+3. Click **Upload JSON**.
+4. in the **JSON Schema** box, enter the following JSON schema, and then click **Use this Schema**:
+
+   ```json
+   {
+     "type": "object",
+     "properties": {
+       "title": {
+         "type": "string",
+         "description": "Full title of the research paper"
+       },
+       "authors": {
+         "type": "array",
+         "items": {
+           "type": "object",
+           "properties": {
+             "name": {
+               "type": "string",
+               "description": "Author's full name"
+             },
+             "affiliation": {
+               "type": "string",
+               "description": "Author's institutional affiliation"
+             },
+             "email": {
+               "type": "string",
+               "description": "Author's email address"
+             }
+           },
+           "required": [
+             "name",
+             "affiliation",
+             "email"
+           ],
+           "additionalProperties": false
+         },
+         "description": "List of paper authors with their affiliations"
+       },
+       "abstract": {
+         "type": "string",
+         "description": "Paper abstract summarizing the research"
+       },
+       "introduction": {
+         "type": "string",
+         "description": "Introduction section describing the problem and motivation"
+       },
+       "methodology": {
+         "type": "object",
+         "properties": {
+           "approach_name": {
+             "type": "string",
+             "description": "Name of the proposed method (e.g., StrokeNet)"
+           },
+           "description": {
+             "type": "string",
+             "description": "Detailed description of the methodology"
+           },
+           "key_techniques": {
+             "type": "array",
+             "items": {
+               "type": "string"
+             },
+             "description": "List of key techniques used in the approach"
+           }
+         },
+         "required": [
+           "approach_name",
+           "description",
+           "key_techniques"
+         ],
+         "additionalProperties": false
+       },
+       "experiments": {
+         "type": "object",
+         "properties": {
+           "datasets": {
+             "type": "array",
+             "items": {
+               "type": "object",
+               "properties": {
+                 "name": {
+                   "type": "string",
+                   "description": "Dataset name"
+                 },
+                 "description": {
+                   "type": "string",
+                   "description": "Dataset description"
+                 },
+                 "size": {
+                   "type": "string",
+                   "description": "Dataset size (e.g., number of sentence pairs)"
+                 }
+               },
+               "required": [
+                 "name",
+                 "description",
+                 "size"
+               ],
+               "additionalProperties": false
+             },
+             "description": "Datasets used for evaluation"
+           },
+           "baselines": {
+             "type": "array",
+             "items": {
+               "type": "string"
+             },
+             "description": "Baseline methods compared against"
+           },
+           "evaluation_metrics": {
+             "type": "array",
+             "items": {
+               "type": "string"
+             },
+             "description": "Metrics used for evaluation"
+           },
+           "experimental_setup": {
+             "type": "string",
+             "description": "Description of experimental configuration and hyperparameters"
+           }
+         },
+         "required": [
+           "datasets",
+           "baselines",
+           "evaluation_metrics",
+           "experimental_setup"
+         ],
+         "additionalProperties": false
+       },
+       "results": {
+         "type": "object",
+         "properties": {
+           "main_findings": {
+             "type": "string",
+             "description": "Summary of main experimental findings"
+           },
+           "performance_improvements": {
+             "type": "array",
+             "items": {
+               "type": "object",
+               "properties": {
+                 "dataset": {
+                   "type": "string",
+                   "description": "Dataset name"
+                 },
+                 "metric": {
+                   "type": "string",
+                   "description": "Evaluation metric (e.g., BLEU)"
+                 },
+                 "baseline_score": {
+                   "type": "number",
+                   "description": "Baseline method score"
+                 },
+                 "proposed_score": {
+                   "type": "number",
+                   "description": "Proposed method score"
+                 },
+                 "improvement": {
+                   "type": "number",
+                   "description": "Improvement over baseline"
+                 }
+               },
+               "required": [
+                 "dataset",
+                 "metric",
+                 "baseline_score",
+                 "proposed_score",
+                 "improvement"
+               ],
+               "additionalProperties": false
+             },
+             "description": "Performance improvements over baselines"
+           },
+           "parameter_reduction": {
+             "type": "string",
+             "description": "Description of parameter reduction achieved"
+           }
+         },
+         "required": [
+           "main_findings",
+           "performance_improvements",
+           "parameter_reduction"
+         ],
+         "additionalProperties": false
+       },
+       "related_work": {
+         "type": "string",
+         "description": "Summary of related work and prior research"
+       },
+       "conclusion": {
+         "type": "string",
+         "description": "Conclusion section summarizing contributions and findings"
+       },
+       "limitations": {
+         "type": "string",
+         "description": "Limitations and challenges discussed in the paper"
+       },
+       "acknowledgments": {
+         "type": "string",
+         "description": "Acknowledgments section"
+       },
+       "references": {
+         "type": "array",
+         "items": {
+           "type": "string"
+         },
+         "description": "List of cited references"
+       }
+     },
+     "additionalProperties": false,
+     "required": [
+       "title",
+       "authors",
+       "abstract",
+       "introduction",
+       "methodology",
+       "experiments",
+       "results",
+       "related_work",
+       "conclusion",
+       "limitations",
+       "acknowledgments",
+       "references"
+     ]
+   }
+   ```
+
+5. Immediately above the **Source** node, click **Test**.
+6. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow.
+7. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks).
+8. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to 
+   the workflow designer so that you can continue designing things later as you see fit.
+
 ## Next steps
 
-Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing 
+Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing 
 context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.
 
 Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file. 

diff --git a/snippets/general-shared-text/get-started-single-file-ui.mdx b/snippets/general-shared-text/get-started-single-file-ui.mdx
@@ -116,6 +116,7 @@ You can also do the following:
 
 What's next?
 
-- <Icon icon="plus" />&nbsp;&nbsp;[Learn how to add chunking, embeddings, and additional enrichments to your local file results](/ui/walkthrough-2).
+- <Icon icon="code" />&nbsp;&nbsp;[Learn how to extract structured data in a custom format from your local file](/ui/data-extractor#use-the-structured-data-extractor-from-the-start-page).
+- <Icon icon="plus" />&nbsp;&nbsp;[Learn how to add chunking, embeddings, custom structured data extraction, and additional enrichments to your local file results](/ui/walkthrough-2).
 - <Icon icon="database" />&nbsp;&nbsp;[Learn how to do large-scale batch processing of multiple files and semi-structured data that are stored in remote locations instead](/ui/quickstart#remote-quickstart).
 - <Icon icon="desktop" />&nbsp;&nbsp;[Learn more about the Unstructured user interface](/ui/overview).