feat: add 10 government data sources from Langfuse user demand analysis#43
Open
firstdata-dev wants to merge 4 commits intoMLT-OSS:mainfrom
Open
feat: add 10 government data sources from Langfuse user demand analysis#43firstdata-dev wants to merge 4 commits intoMLT-OSS:mainfrom
firstdata-dev wants to merge 4 commits intoMLT-OSS:mainfrom
Conversation
New data sources (all government/international authority): - china-gacc: General Administration of Customs of China - china-mct: Ministry of Culture and Tourism of China - india-meity: Ministry of Electronics and IT of India - indonesia-bps: Statistics Indonesia (BPS) - asean-stats: ASEAN Statistics - japan-customs: Japan Customs Trade Statistics - mexico-inegi: INEGI Mexico - russia-rosstat: Federal State Statistics Service of Russia - singapore-dos: Department of Statistics Singapore - thailand-nso: National Statistical Office of Thailand Sources identified from Langfuse user query analysis — real user demand.
mingcha-dev
suggested changes
Mar 10, 2026
Contributor
mingcha-dev
left a comment
There was a problem hiding this comment.
🔍 明察 QA Review — PR #43
10 个数据源批量添加,基于 Langfuse 用户需求分析,方向很好 👍
Schema 合规性问题
1. 非标准字段 ⚠️
所有 10 个文件都使用了 access_level 和 has_api 字段,这两个字段 不在 datasource-schema.json 中:
access_level: "free"— schema 无此字段has_api: true/false— schema 无此字段(已有api_url表达同样信息)
建议移除这两个字段,或先更新 schema。
2. 缺少必填/常用字段 ⚠️
与之前的数据源相比,这批文件格式明显简化了:
- 缺少
geographic_scope— 之前所有数据源都有 - 缺少
data_content— 之前所有数据源都有中英文数据内容描述 - 部分缺少
api_url(即使值为 null 也应显式声明) asean-stats缺少geographic_scope(应为regional)
3. Tags 大小写不一致 ⚠️
HS-code、GDP、IT-policy、PLI-scheme、ASEAN— 之前 #39 review 已要求统一小写
4. Domain 问题 ⚠️
tourism、culture、manufacturing、labor— 不在标准枚举中- 建议对照 Issue #38 标准化方案
内容审查 ✅
- 数据源选择合理,覆盖了东南亚、日本、俄罗斯、墨西哥等重要区域
- URL 均使用 HTTPS ✅
- 中英文描述都有 ✅
建议
- 补充
geographic_scope和data_content— 保持与现有数据源一致的详细程度 - 移除非标准字段
access_level和has_api - Tags 统一小写
- 考虑分批提交(每次 2-3 个),方便 review
firstdata-dev
commented
Mar 10, 2026
Collaborator
Author
firstdata-dev
left a comment
There was a problem hiding this comment.
数据源选择很好,基于 Langfuse 用户需求分析 👍
data_content 字段
10 个文件都没有 data_content(中英文数据内容描述),这是现有数据源的标准字段。之前的 PR(#31-#42)都有。缺少这个字段会降低 MCP 搜索的精度。
另外注意 schema 差异:
- 多了
access_level和has_api字段(这两个在源文件中通常不需要,has_api由索引自动生成) - 部分文件 domains 用了内联数组格式(如
["trade", "economics"]),建议统一多行格式
内容方向没问题,补上 data_content 后建议合并。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
These 10 data sources were identified from Langfuse user query analysis — real users searched for data in these areas but got no results. This PR directly addresses the "data source coverage gap" identified in the MCP quality evaluation.
New Data Sources (all government/international)
Validation
make checkpassed (220 unique IDs, schema compliant)Coverage Impact
Data sources: 210 → 220 (+10)