Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions generate_synthetic_table/flow.py
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,7 @@ def build_synthetic_table_graph(
llm: ChatOpenAI,
provider: str = "openai",
qa_only: bool = False,
skip_qa: bool = False,
) -> StateGraph:
"""
Assemble the LangGraph pipeline.
Expand All @@ -773,6 +774,7 @@ def build_synthetic_table_graph(
llm: LLM instance
provider: LLM provider name
qa_only: If True, generate QA directly from image without synthetic data generation
skip_qa: If True, skip QA generation after table generation (table only mode)
"""

graph = StateGraph(TableState)
Expand All @@ -783,7 +785,7 @@ def build_synthetic_table_graph(
graph.add_edge(START, "generate_qa_from_image")
graph.add_edge("generate_qa_from_image", END)
else:
# Full pipeline mode
# Full pipeline mode (or table-only mode if skip_qa=True)
graph.add_node("image_to_html", image_to_html_node(llm))
graph.add_node("pymupdf_parse", pymupdf_parse_node)
graph.add_node("validate_parsed_table", validate_parsed_table_node(llm))
Expand All @@ -795,7 +797,9 @@ def build_synthetic_table_graph(
graph.add_node("self_reflection", self_reflection_node(llm))
graph.add_node("revise_synthetic_table", revise_synthetic_table_node(llm))
graph.add_node("parse_synthetic_table", parse_synthetic_table_node(llm))
graph.add_node("generate_qa", generate_qa_node(llm))

if not skip_qa:
graph.add_node("generate_qa", generate_qa_node(llm))

# Routing based on provider and input type
def route_start(state: TableState) -> str:
Expand Down Expand Up @@ -842,8 +846,13 @@ def route_start(state: TableState) -> str:
)

graph.add_edge("revise_synthetic_table", "self_reflection")
graph.add_edge("parse_synthetic_table", "generate_qa")
graph.add_edge("generate_qa", END)

# Final edge: skip QA if requested
if skip_qa:
graph.add_edge("parse_synthetic_table", END)
else:
graph.add_edge("parse_synthetic_table", "generate_qa")
graph.add_edge("generate_qa", END)

return graph

Expand Down Expand Up @@ -914,6 +923,7 @@ def run_synthetic_table_flow(
azure_deployment: str | None = None,
azure_endpoint: str | None = None,
qa_only: bool = False,
skip_qa: bool = False,
image_paths: List[str] | None = None,
domain: str | None = None,
# 체크포인팅 옵션
Expand All @@ -935,6 +945,7 @@ def run_synthetic_table_flow(
azure_deployment: Azure OpenAI deployment name
azure_endpoint: Azure OpenAI endpoint URL
qa_only: If True, skip synthetic data generation and only generate QA from image
skip_qa: If True, generate table only without QA generation
image_paths: Optional list of image paths for multi-image processing
domain: Optional domain for prompt customization (e.g. 'public')
enable_checkpointing: 체크포인팅 활성화 여부
Expand All @@ -955,7 +966,7 @@ def run_synthetic_table_flow(
config_path=config_path,
)

graph = build_synthetic_table_graph(llm, provider=provider, qa_only=qa_only)
graph = build_synthetic_table_graph(llm, provider=provider, qa_only=qa_only, skip_qa=skip_qa)

# 체크포인팅 설정
if enable_checkpointing:
Expand Down
82 changes: 62 additions & 20 deletions generate_synthetic_table/prompts/academic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,37 +94,79 @@ generate_qa_from_image: |

generate_synthetic_table: |
You are a Synthetic Data Generator specializing in Academic Data.
Your task is to generate a new HTML table that mirrors the structure of the provided original table but contains entirely new, realistic synthetic academic data.

**⚠️ CRITICAL INSTRUCTION: DO NOT COPY ORIGINAL DATA ⚠️**
Your task is to generate a new HTML table with the SAME STRUCTURE as the original but COMPLETELY DIFFERENT academic data values.

**Inputs:**
1. **Original Table Structure:**
1. **Original Table Structure (for structure reference ONLY - DO NOT copy the data values):**
{html}

2. **Table Summary:**
2. **Table Summary (describes the data patterns to follow):**
{summary}

**Requirements:**
1. **Structure:** Keep the exact same HTML structure.
2. **Data:** Replace ALL cell values with new, synthetic academic data.
- Use realistic Korean student names, university names, course titles, and grades.
- Contexts: Transcripts, Research Papers, Enrollment Stats, Faculty Lists.
- Do NOT use real private data.
3. **Consistency:** Ensure mathematical consistency (e.g., sum of credits, correct GPA calculations if visible).
4. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
1. **Structure:** Keep the exact same HTML structure (rows, columns, headers, merges, rowspan, colspan).
2. **Headers:** Keep header text the same (column names, category labels).
3. **⚠️ Data Transformation - ABSOLUTELY MANDATORY ⚠️:**
- **ALL data cell values MUST be replaced with completely new synthetic values.**
- **NEVER copy any original data values** - generate fresh, realistic alternatives.
- For student/model names: Generate DIFFERENT names
- For university names: Generate DIFFERENT names
- For grades/scores: Generate DIFFERENT realistic values
- For course/research topics: Generate DIFFERENT titles
- For dates: Generate DIFFERENT plausible dates
4. **Styling:** Use **Tailwind CSS** classes (NO inline styles). **Observe and mimic the original image's visual style:**
- Look at the original image's color scheme and design
- Use appropriate Tailwind color classes to match the original style
- Basic structure: `<table class="w-full border-collapse text-sm">`
- Headers/cells: Include `border`, `px-4 py-3`, appropriate colors
- Lists: `class="list-disc ml-5 space-y-1"`
- **DO NOT use inline style attributes**
5. **Domain Consistency:** Ensure academic logic (credits sum correctly, GPA valid)
6. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. No markdown code blocks.

**Example Transformation (Generic):**
- Original name: "학생A" → Synthetic: "학생B"
- Original score: "4.0" → Synthetic: "3.5"
- Original model: "모델X" → Synthetic: "모델Y"

⚠️ If the generated content is identical or very similar to the original, the output is INVALID.

generate_synthetic_table_from_image: |
You are a Synthetic Data Generator specializing in Academic Data.
Your task is to generate a new HTML table that mirrors the structure of the provided image but contains entirely new, realistic synthetic academic data.

**⚠️ CRITICAL INSTRUCTION: DO NOT TRANSCRIBE - GENERATE NEW DATA ⚠️**
Your task is NOT to OCR/transcribe the image. Instead, you must:
1. Understand the table's STRUCTURE from the image
2. Understand it's an ACADEMIC table
3. Generate COMPLETELY NEW synthetic academic data that fits the domain but uses ENTIRELY DIFFERENT values

**Inputs:**
1. **Image:** An image of an academic table.
1. **Image:** An image of an academic table. Use this to understand structure and domain ONLY.

**Requirements:**
1. **Structure Preservation:** Accurately reconstruct the table structure.
2. **Data Generation:** Replace ALL cell values with new, synthetic academic data.
- Use realistic Korean student names, course titles, grades, research topics.
3. **Styling:** Use **Tailwind CSS** classes (same as default).
- `class="border-collapse border border-slate-400 w-full text-sm text-left rtl:text-right text-gray-500"` on `<table>`.
- `class="border border-slate-300 p-2 bg-gray-50 font-semibold"` on `<th>`.
- `class="border border-slate-300 p-2"` on `<td>`.
4. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
1. **Structure Preservation:** Accurately reconstruct the table structure, including rowspan/colspan.
2. **Headers:** Keep header text the same as in the image.
3. **⚠️ Data Generation - ABSOLUTELY CRITICAL ⚠️:**
- **NEVER copy the data values from the image** - this is NOT an OCR task
- **ALL cell content must be completely NEW and DIFFERENT**
- For student/model names: Generate DIFFERENT names
- For grades/scores: Generate DIFFERENT values
- For course/research topics: Generate DIFFERENT titles
4. **Styling:** Use **Tailwind CSS** classes exclusively (NO inline styles).
- `<table>`: `class="w-full border-collapse text-sm"`
- `<thead>`: `class="bg-gradient-to-r from-indigo-700 to-indigo-800 text-white"`
- `<th>`: `class="border border-indigo-300 px-4 py-3 font-semibold text-left"`
- `<tbody>`: `class="divide-y divide-slate-200"`
- `<tr>` (body rows): `class="hover:bg-indigo-50 transition-colors"`
- `<td>`: `class="border border-slate-200 px-4 py-3 text-slate-700"`
- `<ul>`: `class="list-disc ml-5 space-y-1 text-slate-600"`
- **DO NOT use inline style attributes**
5. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. No markdown code blocks.

**Example (Generic):**
- Name in image: "이름X" → Generate: "이름Y"
- Score in image: "점수A" → Generate: "점수B"

⚠️ If the generated content is identical or very similar to the image, the output is INVALID.
93 changes: 73 additions & 20 deletions generate_synthetic_table/prompts/business.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,37 +94,90 @@ generate_qa_from_image: |

generate_synthetic_table: |
You are a Synthetic Data Generator specializing in Business Data.
Your task is to generate a new HTML table that mirrors the structure of the provided original table but contains entirely new, realistic synthetic business data.

**⚠️ CRITICAL INSTRUCTION: DO NOT COPY ORIGINAL DATA ⚠️**
Your task is to generate a new HTML table with the SAME STRUCTURE as the original but COMPLETELY DIFFERENT business data values.
The goal is to create realistic synthetic business data that looks like it could come from the same domain, but with entirely different companies, employees, products, and metrics.

**Inputs:**
1. **Original Table Structure:**
1. **Original Table Structure (for structure reference ONLY - DO NOT copy the data values):**
{html}

2. **Table Summary:**
2. **Table Summary (describes the data patterns to follow):**
{summary}

**Requirements:**
1. **Structure:** Keep the exact same HTML structure.
2. **Data:** Replace ALL cell values with new, synthetic business data.
- Use realistic Korean company names, department names, product lines, and financial metrics.
- Contexts: Sales Reports, Inventory, HR Employee Lists, Marketing Campaigns.
- Do NOT use real private data.
3. **Consistency:** Ensure mathematical consistency (e.g., Q1 + Q2 + Q3 + Q4 = Total).
4. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
1. **Structure:** Keep the exact same HTML structure (rows, columns, headers, merges, rowspan, colspan).
2. **Headers:** Keep header text the same (column names, category labels like 기업경쟁력, 시장경쟁력).
3. **⚠️ Data Transformation - ABSOLUTELY MANDATORY ⚠️:**
- **ALL data cell values MUST be replaced with completely new synthetic values.**
- **NEVER copy any original data values** - generate fresh, realistic alternatives.
- For company/team names: Generate DIFFERENT names (e.g., "A팀" → "B팀")
- For employee names: Generate DIFFERENT Korean names (e.g., "김OO" → "박OO")
- For business metrics: Generate DIFFERENT numbers (e.g., "100억" → "150억")
- For strategy/description text: Write DIFFERENT content with similar structure
- For bullet point items: Create DIFFERENT but domain-appropriate content
4. **Styling:** Use **Tailwind CSS** classes (NO inline styles). **Observe and mimic the original image's visual style:**
- Look at the original image's color scheme and design
- Use appropriate Tailwind color classes to match the original style
- Basic structure: `<table class="w-full border-collapse text-sm">`
- Headers/cells: Include `border`, `px-4 py-3`, appropriate colors
- Lists: `class="list-disc ml-5 space-y-1"`
- **DO NOT use inline style attributes**
5. **Domain Consistency:**
- Ensure business logic (Q1+Q2+Q3+Q4=Total, percentages add up)
- Use realistic Korean business terminology
- Contexts: Sales Reports, Inventory, HR Employee Lists, Marketing Campaigns
6. **Output:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. No markdown code blocks.

**Example Transformation (Generic):**
- Original name: "A팀" → Synthetic: "B팀"
- Original amount: "5억원" → Synthetic: "7.3억원"
- Original description: "신규 사업 추진" → Synthetic: "해외 시장 진출"

⚠️ If the generated content is identical or very similar to the original, the output is INVALID.
Remember: The synthetic table should look like a completely different business dataset from the same domain.

generate_synthetic_table_from_image: |
You are a Synthetic Data Generator specializing in Business Data.
Your task is to generate a new HTML table that mirrors the structure of the provided image but contains entirely new, realistic synthetic business data.

**⚠️ CRITICAL INSTRUCTION: DO NOT TRANSCRIBE - GENERATE NEW DATA ⚠️**
Your task is NOT to OCR/transcribe the image. Instead, you must:
1. Understand the table's STRUCTURE from the image (rows, columns, merged cells, nested structures)
2. Understand it's a BUSINESS table (기업경쟁력, 시장경쟁력, 매출, 실적 등)
3. Generate COMPLETELY NEW synthetic business data that fits the domain but uses ENTIRELY DIFFERENT values

**Inputs:**
1. **Image:** An image of a business table.
1. **Image:** An image of a business table. Use this to understand structure and domain ONLY.

**Requirements:**
1. **Structure Preservation:** Accurately reconstruct the table structure.
2. **Data Generation:** Replace ALL cell values with new, synthetic business data.
- Use realistic Korean company names, products, sales figures.
3. **Styling:** Use **Tailwind CSS** classes (same as default).
- `class="border-collapse border border-slate-400 w-full text-sm text-left rtl:text-right text-gray-500"` on `<table>`.
- `class="border border-slate-300 p-2 bg-gray-50 font-semibold"` on `<th>`.
- `class="border border-slate-300 p-2"` on `<td>`.
4. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`.
1. **Structure Preservation:** Accurately reconstruct the table structure, including `rowspan` and `colspan` for merged cells.
2. **Headers:** Keep header text (column names, category labels like 기업경쟁력, 차별화 요소) the same as in the image.
3. **⚠️ Data Generation - ABSOLUTELY CRITICAL ⚠️:**
- **NEVER copy the data values from the image** - this is NOT an OCR task
- **ALL cell content must be completely NEW and DIFFERENT from the original**
- Generate COMPLETELY NEW synthetic business values for all data cells:
* For company/team names: Generate DIFFERENT names (e.g., "A팀" → "B팀")
* For business metrics: Generate DIFFERENT numbers (e.g., "100억" → "150억")
* For strategy/description text: Write DIFFERENT content with similar structure
* For bullet point items: Create DIFFERENT but domain-appropriate items
* For employee names: Generate DIFFERENT Korean names (e.g., "김OO" → "박OO")
- The synthetic table should look like a COMPLETELY DIFFERENT business report from the same industry
4. **Styling:** Use **Tailwind CSS** classes (NO inline styles). **Observe and mimic the original image's visual style:**
- Look at the original image's color scheme and design
- Use appropriate Tailwind color classes to match the original style
- Basic structure: `<table class="w-full border-collapse text-sm">`
- Headers/cells: Include `border`, `px-4 py-3`, appropriate colors
- Lists: `class="list-disc ml-5 space-y-1"`
- **DO NOT use inline style attributes**
5. **Output Format:** Return ONLY the raw HTML string starting with `<table>` and ending with `</table>`. No markdown code blocks.

**Example of Expected Behavior (Generic):**
If the image shows a business table with:
- Team name: "영업팀" → Generate different: "마케팅팀"
- Revenue: "10억원" → Generate different: "15억원"
- Strategy: "시장 확대" → Generate different: "신규 진출"
- Bullet point items → Generate completely different items

⚠️ If the generated content is identical or very similar to the image, the output is INVALID.
Remember: The output should be a new synthetic business dataset, not a transcription of the original.
Loading