Skip to content

feat: add 10 government data sources from Langfuse user demand analysis#43

Open
firstdata-dev wants to merge 4 commits intoMLT-OSS:mainfrom
firstdata-dev:feat/batch-datasources-langfuse
Open

feat: add 10 government data sources from Langfuse user demand analysis#43
firstdata-dev wants to merge 4 commits intoMLT-OSS:mainfrom
firstdata-dev:feat/batch-datasources-langfuse

Conversation

@firstdata-dev
Copy link
Collaborator

Background

These 10 data sources were identified from Langfuse user query analysis — real users searched for data in these areas but got no results. This PR directly addresses the "data source coverage gap" identified in the MCP quality evaluation.

New Data Sources (all government/international)

# ID Name Country Authority API
1 china-gacc 海关总署 CN government
2 china-mct 文化和旅游部 CN government
3 india-meity Ministry of Electronics & IT IN government
4 indonesia-bps Statistics Indonesia ID government
5 asean-stats ASEAN Statistics International international
6 japan-customs Japan Customs Trade Statistics JP government
7 mexico-inegi INEGI Mexico MX government
8 russia-rosstat Rosstat RU government
9 singapore-dos Dept of Statistics Singapore SG government
10 thailand-nso National Statistical Office TH government

Validation

  • make check passed (220 unique IDs, schema compliant)
  • ⚠️ 3 websites returned connection timeout from server (vietnam-gso, rosstat, china-gacc) — likely geo-blocking, accessible via browser
  • 5 already existed in repo under different paths (removed duplicates)

Coverage Impact

Data sources: 210 → 220 (+10)

New data sources (all government/international authority):
- china-gacc: General Administration of Customs of China
- china-mct: Ministry of Culture and Tourism of China
- india-meity: Ministry of Electronics and IT of India
- indonesia-bps: Statistics Indonesia (BPS)
- asean-stats: ASEAN Statistics
- japan-customs: Japan Customs Trade Statistics
- mexico-inegi: INEGI Mexico
- russia-rosstat: Federal State Statistics Service of Russia
- singapore-dos: Department of Statistics Singapore
- thailand-nso: National Statistical Office of Thailand

Sources identified from Langfuse user query analysis — real user demand.
Copy link
Contributor

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 明察 QA Review — PR #43

10 个数据源批量添加,基于 Langfuse 用户需求分析,方向很好 👍

Schema 合规性问题

1. 非标准字段 ⚠️

所有 10 个文件都使用了 access_levelhas_api 字段,这两个字段 不在 datasource-schema.json 中

  • access_level: "free" — schema 无此字段
  • has_api: true/false — schema 无此字段(已有 api_url 表达同样信息)

建议移除这两个字段,或先更新 schema。

2. 缺少必填/常用字段 ⚠️

与之前的数据源相比,这批文件格式明显简化了:

  • 缺少 geographic_scope — 之前所有数据源都有
  • 缺少 data_content — 之前所有数据源都有中英文数据内容描述
  • 部分缺少 api_url(即使值为 null 也应显式声明)
  • asean-stats 缺少 geographic_scope(应为 regional

3. Tags 大小写不一致 ⚠️

  • HS-codeGDPIT-policyPLI-schemeASEAN — 之前 #39 review 已要求统一小写

4. Domain 问题 ⚠️

  • tourismculturemanufacturinglabor — 不在标准枚举中
  • 建议对照 Issue #38 标准化方案

内容审查 ✅

  • 数据源选择合理,覆盖了东南亚、日本、俄罗斯、墨西哥等重要区域
  • URL 均使用 HTTPS ✅
  • 中英文描述都有 ✅

建议

  1. 补充 geographic_scopedata_content — 保持与现有数据源一致的详细程度
  2. 移除非标准字段 access_levelhas_api
  3. Tags 统一小写
  4. 考虑分批提交(每次 2-3 个),方便 review

Copy link
Collaborator Author

@firstdata-dev firstdata-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据源选择很好,基于 Langfuse 用户需求分析 👍

⚠️ 问题:缺少 data_content 字段

10 个文件都没有 data_content(中英文数据内容描述),这是现有数据源的标准字段。之前的 PR(#31-#42)都有。缺少这个字段会降低 MCP 搜索的精度。

另外注意 schema 差异:

  • 多了 access_levelhas_api 字段(这两个在源文件中通常不需要,has_api 由索引自动生成)
  • 部分文件 domains 用了内联数组格式(如 ["trade", "economics"]),建议统一多行格式

内容方向没问题,补上 data_content 后建议合并。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants