Commit 13d868a
authored
GML-2049 chunker updates (#29)
### **PR Type**
Enhancement, Bug fix, Tests
___
### **Description**
- Enable apiToken for TigerGraph connections
- Skip getToken when token provided
- Add unit tests for apiToken
- Bound HTML/Markdown chunks recursively (4096)
- Default fallback size and overlap support
- Fix PDF image paths, form artifacts
- Handle spaces; deduplicate table rows
- Route graph stats to function calls
- Update provider prompts for counts
___
### Diagram Walkthrough
```mermaid
flowchart LR
CHUNK["Chunkers updated (defaults, recursive)"]
HTML["HTML chunker\nfallback+recursive"]
MD["Markdown chunker\nfallback+recursive"]
CHAR["Character chunker\n4096 fallback"]
RECUR["Recursive chunker\n4096 fallback"]
CONN["DB connections\napiToken support"]
CFG["Config init\napiToken passthrough"]
PDF["PDF extractor\nimage+markdown fixes"]
PROMPT["Routing prompts\nGraph stats -> functions"]
LOAD["Loader\nconfigurable batch/delay"]
DOCKER["Compose\nTG service optional"]
CHUNK -- "applies to" --> HTML
CHUNK -- "applies to" --> MD
CHUNK -- "applies to" --> CHAR
CHUNK -- "applies to" --> RECUR
CFG -- "used by" --> CONN
CONN -- "unit tests" --> PROMPT
PDF -- "clean images/markdown" --> CHUNK
PROMPT -- "provider prompts updated" --> CHUNK
LOAD -- "tunable throughput" --> CFG
DOCKER -- "external TG supported" --> CONN
```
<details> <summary><h3> File Walkthrough</h3></summary>
<table><thead><tr><th></th><th align="left">Relevant
files</th></tr></thead><tbody><tr><td><strong>Enhancement</strong></td><td><details><summary>8
files</summary><table>
<tr>
<td><strong>character_chunker.py</strong><dd><code>Default to 4096 and
validate overlaps</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-086cb1310ad96c42ae62b4fde5d4878bb5553f711b6d051005450b85a17492cb">+6/-6</a>
</td>
</tr>
<tr>
<td><strong>html_chunker.py</strong><dd><code>Recursive split for
oversized header sections</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-d99da1157b1f0eea3c23bf54cfe0d42cba987287b112c15f4b35b16e2e498ac1">+31/-3</a>
</td>
</tr>
<tr>
<td><strong>markdown_chunker.py</strong><dd><code>Fallback size and
recursive markdown splitting</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-c42407e62189ab854f9c10b7ea1b0b16701f2188daca41e36d3ce569e756984a">+20/-13</a>
</td>
</tr>
<tr>
<td><strong>recursive_chunker.py</strong><dd><code>Default recursive
chunk size set to 4096</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-6afa5baf1bf76a0ba886adfd895408a1c772cbc24235a9fcbdbb1bae8cac69b5">+4/-2</a>
</td>
</tr>
<tr>
<td><strong>config.py</strong><dd><code>Support static apiToken and
conditional getToken</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-1bacff878451e5aa9c6d164150c7b2daad028d5e7acba90bb720cb73ffdd827b">+2/-1</a>
</td>
</tr>
<tr>
<td><strong>connections.py</strong><dd><code>Use apiToken directly; skip
getToken; async support</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-2c15601c7002076bf82559499eeb3f746145bb1433c7883f1c77e61b24a50d20">+29/-1</a>
</td>
</tr>
<tr>
<td><strong>base_llm.py</strong><dd><code>Route graph statistics
questions to function calls</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-9eb1335890737196b08ce1158f1de3ff08db71d02a613fbc9b967c347a0aa36d">+8/-2</a>
</td>
</tr>
<tr>
<td><strong>supportai_ingest.py</strong><dd><code>Pass chunk
size/overlap to HTML chunker</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-b4a80de039d3bcbf47c3bb7705354de123426efc4abd3d1f0e93570721c0f820">+3/-1</a>
</td>
</tr>
</table></details></td></tr><tr><td><strong>Bug
fix</strong></td><td><details><summary>1 files</summary><table>
<tr>
<td><strong>text_extractors.py</strong><dd><code>Fix image paths; clean
PDF markdown artifacts</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-c749a2c8ea1a8bc0a734203aef7fa5aa9300d705006afbb4cac26985c2ac257d">+99/-6</a>
</td>
</tr>
</table></details></td></tr><tr><td><strong>Configuration
changes</strong></td><td><details><summary>3 files</summary><table>
<tr>
<td><strong>ecc_util.py</strong><dd><code>Update chunker defaults and
pass new parameters</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-890bb6f3c6fbe84bfda83faf66d59a1f8058f9760e9e2ee4cac1c388a90f276f">+5/-3</a>
</td>
</tr>
<tr>
<td><strong>graph_rag.py</strong><dd><code>Configurable batch size and
optional upsert delay</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-55a1e5b20c75c4a71f03a1541658d1a4de6567501d51550a1464f49901cb626a">+7/-5</a>
</td>
</tr>
<tr>
<td><strong>docker-compose.yml</strong><dd><code>Comment out TigerGraph
service; externalize dependency</code> </dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3">+12/-12</a>
</td>
</tr>
</table></details></td></tr><tr><td><strong>Tests</strong></td><td><details><summary>1
files</summary><table>
<tr>
<td><strong>test_connections.py</strong><dd><code>Add unit tests for
apiToken connection handling</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-8a4af273a44b613bebaa3e29ef95b20f56d221885b6a3332223d5b4d2203880e">+117/-0</a>
</td>
</tr>
</table></details></td></tr><tr><td><strong>Documentation</strong></td><td><details><summary>7
files</summary><table>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-2d8cd2c2831fbe9bf617715e1f3283566c7e35b6cb8c76337fbe1fca93d234dc">+1/-1</a>
</td>
</tr>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-7a05cd838d0fbc74cd095d6300d38aa945902e8571917f14545fa433b1ba2f6f">+1/-1</a>
</td>
</tr>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-ebb5cd586870142fc16974ce4dec23ec858ec49cec6a599a301df699d4a91cf5">+1/-1</a>
</td>
</tr>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-c2b1f009ab574e291d9fb2f989cc75b7eb0ea287ce87ca72f36ee863d36e3786">+1/-1</a>
</td>
</tr>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-cdc5a67e8b6a4d3e4810329f036c2f08b062094f26d73c0133f6cfc504a45efb">+1/-1</a>
</td>
</tr>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-ed98f91d4d2bb1d67ac44004e04e536d3e38b5c0c3c844887d95824c8aa13c0b">+1/-1</a>
</td>
</tr>
<tr>
<td><strong>generate_function.txt</strong><dd><code>Clarify count
queries route to Count functions</code>
</dd></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-6bc117f2b29afcf62aa14d094d48537e9ffeacd6ab2b4a8b1cd9251be736f7f1">+1/-1</a>
</td>
</tr>
</table></details></td></tr><tr><td><strong>Additional
files</strong></td><td><details><summary>1 files</summary><table>
<tr>
<td><strong>generate_function.txt</strong></td>
<td><a
href="https://github.com/tigergraph/graphrag/pull/29/files#diff-85d6281f16d53975597ec7e01ddfc9895652f050e631c62392b70d4f8defd794">+1/-1</a>
</td>
</tr>
</table></details></td></tr></tr></tbody></table>
</details>
___21 files changed
Lines changed: 351 additions & 63 deletions
File tree
- common
- chunkers
- db
- llm_services
- prompts
- aws_bedrock_claude3haiku
- aws_bedrock_titan
- azure_open_ai_gpt35_turbo_instruct
- custom/aml
- gcp_vertexai_palm
- google_gemini
- llama_70b
- openai_gpt4
- utils
- ecc/app
- graphrag
- graphrag
- app/supportai
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
| 4 | + | |
3 | 5 | | |
4 | 6 | | |
5 | | - | |
6 | | - | |
7 | | - | |
8 | | - | |
| 7 | + | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
13 | | - | |
| 12 | + | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
19 | 24 | | |
20 | 25 | | |
21 | 26 | | |
| |||
25 | 30 | | |
26 | 31 | | |
27 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
28 | 37 | | |
29 | 38 | | |
30 | 39 | | |
31 | 40 | | |
32 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
33 | 44 | | |
| 45 | + | |
| 46 | + | |
34 | 47 | | |
35 | 48 | | |
36 | 49 | | |
| |||
77 | 90 | | |
78 | 91 | | |
79 | 92 | | |
80 | | - | |
81 | | - | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
82 | 110 | | |
83 | 111 | | |
84 | 112 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
20 | 25 | | |
21 | 26 | | |
22 | 27 | | |
| |||
25 | 30 | | |
26 | 31 | | |
27 | 32 | | |
28 | | - | |
| 33 | + | |
29 | 34 | | |
30 | 35 | | |
31 | 36 | | |
32 | 37 | | |
33 | 38 | | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
34 | 43 | | |
35 | | - | |
36 | 44 | | |
37 | | - | |
| 45 | + | |
38 | 46 | | |
39 | 47 | | |
40 | 48 | | |
41 | 49 | | |
42 | 50 | | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
53 | 60 | | |
54 | 61 | | |
55 | 62 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
19 | 21 | | |
20 | 22 | | |
21 | | - | |
22 | | - | |
| 23 | + | |
| 24 | + | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
259 | 259 | | |
260 | 260 | | |
261 | 261 | | |
| 262 | + | |
262 | 263 | | |
263 | | - | |
| 264 | + | |
264 | 265 | | |
265 | 266 | | |
266 | 267 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
123 | 151 | | |
124 | 152 | | |
125 | 153 | | |
| |||
129 | 157 | | |
130 | 158 | | |
131 | 159 | | |
132 | | - | |
| 160 | + | |
133 | 161 | | |
134 | 162 | | |
135 | 163 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
112 | | - | |
| 112 | + | |
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
117 | 123 | | |
118 | | - | |
| 124 | + | |
119 | 125 | | |
120 | 126 | | |
121 | 127 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| |||
0 commit comments