diff --git a/docs/dashboard.md b/docs/dashboard.md new file mode 100644 index 0000000..f85c5f0 --- /dev/null +++ b/docs/dashboard.md @@ -0,0 +1,150 @@ +# InfiniMetrics Dashboard User Guide +## 1. Dashboard Overview + +InfiniMetrics Dashboard provides a unified interface to visualize benchmark and evaluation results of AI accelerators across the following scenarios: + +- Communication (NCCL / Collective Communication) + +- Training (Training / Distributed Training) + +- Inference (Direct / Service Inference) + +- Operator (Core Operator Performance) + +The benchmark framework produces two types of outputs: + +``` +JSON -> configuration / environment / scalar metrics +CSV -> curves / time-series data +``` +The Dashboard automatically loads test results and provides unified analysis capabilities, including: + +- un ID fuzzy search: locate specific test runs using partial Run IDs + +- General filters: filter results by framework, model, device count, etc. + +- Multi-run comparison: select multiple runs to compare performance + +- Performance visualization: display curves such as latency / throughput / loss + +- Statistics and configuration view: inspect throughput statistics, runtime configuration, and environment details + +For example, you can enter: +``` +allreduce +service +``` +to perform fuzzy matching on Run IDs + +Example screenshot: +![Run_ID research](./images/runid_research.jpg) +## 2. Running the Dashboard +### 2.1 Environment Requirements + +Before using the Dashboard, install the following dependencies: +``` +streamlit +plotly +pandas +``` +### 2.2 Start the Dashboard + +Run the following command in the project root directory: +``` +python -m streamlit run dashboard/app.py +``` +Access URL after startup: +``` +Local URL: http://localhost:8501 +Network URL: http://:8501 +``` +Explanation: + +Local URL: accessible only on the local machine + +Network URL: accessible from other machines within the same network + +## 3. Communication Test Analysis + +Path: +``` +Dashboard → Communication Performance Test +``` +Supported features: +``` +Bandwidth analysis curve - peak bandwidth + +Latency analysis curve - average latency + +Test duration + +GPU memory usage + +Communication configuration analysis +``` +Example screenshot: +![Communication Test](./images/dashboard_communication.jpg) + +## 4. Inference Test Analysis + +Path: +``` +Dashboard → Inference Performance Test +``` +Modes: +``` +Direct Inference +Service Inference +``` +Displayed metrics: +``` +TTFT + +Latency + +Throughput + +GPU memory usage + +Inference configuration analysis +``` +Example screenshot: +![Inference Test](./images/dashboard_inference.jpg) + +## 5. Training Test Analysis + +Path: +``` +Dashboard → Training Performance Test +``` +Supported features: +``` +Loss curve + +Perplexity curve + +Throughput curve + +GPU memory usage + +Training configuration analysis +``` +Example screenshot: +![Training Test](./images/dashboard_training.jpg) + +## 6. Operator Test Analysis + +Path: +``` +Dashboard → Operator Performance Test +``` +Supported metrics: +``` +latency + +flops + +bandwidth +``` +Example screenshot: +![Operator Test](./images/dashboard_operators.jpg) diff --git a/docs/images/dashboard_communication.jpg b/docs/images/dashboard_communication.jpg new file mode 100644 index 0000000..4f56efc Binary files /dev/null and b/docs/images/dashboard_communication.jpg differ diff --git a/docs/images/dashboard_inference.jpg b/docs/images/dashboard_inference.jpg new file mode 100644 index 0000000..7c31a4f Binary files /dev/null and b/docs/images/dashboard_inference.jpg differ diff --git a/docs/images/dashboard_operators.jpg b/docs/images/dashboard_operators.jpg new file mode 100644 index 0000000..88b0943 Binary files /dev/null and b/docs/images/dashboard_operators.jpg differ diff --git a/docs/images/dashboard_training.jpg b/docs/images/dashboard_training.jpg new file mode 100644 index 0000000..4b354f7 Binary files /dev/null and b/docs/images/dashboard_training.jpg differ diff --git a/docs/images/runid_research.jpg b/docs/images/runid_research.jpg new file mode 100644 index 0000000..6eb1e64 Binary files /dev/null and b/docs/images/runid_research.jpg differ diff --git a/docs/zh/dashboard.md b/docs/zh/dashboard.md new file mode 100644 index 0000000..c48d109 --- /dev/null +++ b/docs/zh/dashboard.md @@ -0,0 +1,157 @@ +# InfiniMetrics Dashboard 使用指南 + +## 1. Dashboard 简介 + +InfiniMetrics Dashboard 用于统一展示 AI 加速卡在以下场景下的测试与评测结果 + +- 通信(NCCL / 集合通信) +- 训练(Training / 分布式训练) +- 推理(Direct / Service 推理) +- 算子(核心算子性能) + +测试框架输出两类数据: +``` +JSON -> 配置 / 环境 / 标量指标 +CSV -> 曲线 / 时序数据 +``` +Dashboard 会自动加载测试结果,并提供统一的分析功能,包括: + +- Run ID 模糊搜索:支持通过部分 Run ID 快速定位测试运行 + +- 通用筛选器:按框架、模型、设备数量等条件筛选 + +- 多运行对比:同时选择多个测试运行进行性能对比 + +- 性能可视化:展示 latency / throughput / loss 等性能曲线 + +- 统计与配置展示:查看吞吐量统计、运行配置和环境信息 + +例如可以输入: +``` +allreduce +service +``` +对 Run ID 进行模糊匹配搜索 + +示例截图: + +![Run ID搜索](../images/runid_research.jpg) +## 2. 运行 Dashboard +### 2.1 环境依赖 +使用 Dashboard 前需要安装以下依赖: +``` +streamlit +plotly +pandas +``` +### 2.2 启动 Dashboard +在项目根目录执行: +``` +python -m streamlit run dashboard/app.py +``` +访问地址,启动成功后显示: +``` +Local URL: http://localhost:8501 +Network URL: http://:8501 +``` +说明: + +Local URL:仅本机访问 + +Network URL:同一网络内其他机器可访问 + +## 3. 通信测试分析 +路径: + +``` +Dashboard → 通信性能测试 +``` + +支持: +``` +带宽分析曲线 - 峰值带宽 + +延迟分析曲线 - 平均延迟 + +测试耗时 + +显存使用 + +通信配置解析 +``` + +示例截图: + +![通信测试](../images/dashboard_communication.jpg) +## 4. 推理测试分析 + +路径: + +``` +Dashboard → 推理性能测试 +``` + +模式: +``` +Direct Inference +Service Inference +``` +展示指标: +``` +TTFT + +Latency + +Throughput + +显存使用 + +推理配置解析 +``` +示例截图: + +![推理测试](../images/dashboard_inference.jpg) + +## 5. 训练测试分析 +路径: + +``` +Dashboard → 训练性能测试 +``` + +支持: +``` +Loss 曲线 + +Perplexity 曲线 + +Throughput 曲线 + +显存使用 + +训练配置解析 +``` +示例截图: + +![训练测试](../images/dashboard_training.jpg) + +## 6. 算子测试分析 + +路径: + +``` +Dashboard → 算子性能测试 +``` + +支持: +``` +latency + +flops + +bandwidth +``` + +示例截图: + +![算子测试](../images/dashboard_operators.jpg)