diff --git a/dashboard/top-sql.md b/dashboard/top-sql.md index 73c3b5e0b9185..2a51cb7dfe56c 100644 --- a/dashboard/top-sql.md +++ b/dashboard/top-sql.md @@ -1,58 +1,60 @@ --- -title: TiDB Dashboard Top SQL page -summary: TiDB Dashboard Top SQL allows real-time monitoring and visualization of CPU overhead for SQL statements in your database. It helps optimize performance by identifying high CPU load statements and provides detailed execution information. It's suitable for analyzing performance issues and can be accessed through TiDB Dashboard or a browser. The feature has a slight impact on cluster performance and is now generally available for production use. +title: TiDB Dashboard TopSQL page +summary: Use TopSQL to identify queries that consume high CPU, network, and logical IO resources --- -# TiDB Dashboard Top SQL Page +# TiDB Dashboard TopSQL Page -With Top SQL, you can monitor and visually explore the CPU overhead of each SQL statement in your database in real-time, which helps you optimize and resolve database performance issues. Top SQL continuously collects and stores CPU load data summarized by SQL statements at any seconds from all TiDB and TiKV instances. The collected data can be stored for up to 30 days. Top SQL presents you with visual charts and tables to quickly pinpoint which SQL statements are contributing the high CPU load of a TiDB or TiKV instance over a certain period of time. +TiDB Dashboard TopSQL helps you visually analyze the most resource-intensive queries on a specific TiDB or TiKV instance over a period of time. By default, TopSQL continuously collects CPU load data from each TiDB and TiKV instance and retains the data for up to 30 days. For TiKV instances, you can also enable **TiKV Network IO collection (multi-dimensional)** in the settings panel to view metrics such as `Network Bytes` and `Logical IO Bytes`, and analyze the results by `Query`, `Table`, `DB`, or `Region`. -Top SQL provides the following features: +TopSQL provides the following features: -* Visualize the top 5 types of SQL statements with the highest CPU overhead through charts and tables. -* Display detailed execution information such as queries per second, average latency, and query plan. -* Collect all SQL statements that are executed, including those that are still running. -* Allow viewing data of a specific TiDB and TiKV instance. +* Show the top `5`, `20`, or `100` records with the highest load in the selected time range, and automatically aggregate the rest into `Others`. +* Sort hotspots by `CPU Time` or `Network Bytes`, and when a TiKV instance is selected, by `Logical IO Bytes` as well. +* Analyze load by `Query` and view SQL and execution plan details. When a TiKV instance is selected, you can also aggregate and analyze data by `Table`, `DB`, or `Region`. +* Zoom in on a selected time range in the chart, manually refresh data, enable auto refresh, and export table data to CSV. +* Collect all SQL statements that are running, including unfinished statements. +* View data for a specific TiDB or TiKV instance in the cluster. ## Recommended scenarios -Top SQL is suitable for analyzing performance issues. The following are some typical Top SQL scenarios: +TopSQL is suitable for analyzing performance issues in scenarios such as the following: -* You discovered that an individual TiKV instance in the cluster has a very high CPU usage through the Grafana charts. You want to know which SQL statements cause the CPU hotspots so that you can optimize them and better leverage all of your distributed resources. -* You discovered that the cluster has a very high CPU usage overall and queries are slow. You want to quickly figure out which SQL statements are currently consuming the most CPU resources so that you can optimize them. -* The CPU usage of the cluster has drastically changed and you want to know the major cause. -* Analyze the most resource-intensive SQL statements in the cluster and optimize them to reduce hardware costs. +* You find that one TiDB or TiKV instance has very high CPU usage and want to quickly identify which queries are consuming the most CPU resources. +* The overall cluster becomes slower and you want to identify the queries that currently consume the most resources, or compare the major query changes before and after a workload shift. +* You want to locate hotspots at a higher level and analyze TiKV-side resource usage by `Table`, `DB`, or `Region`. +* You want to troubleshoot TiKV hotspots from the perspective of network traffic or logical IO instead of CPU alone. -Top SQL cannot be used in the following scenarios: +TopSQL is not suitable for the following scenarios: -- Top SQL cannot be used to pinpoint non-performance issues, such as incorrect data or abnormal crashes. -- Top SQL does not support analyzing database performance issues that are not caused by high CPU load, such as transaction lock conflicts. +- It cannot answer non-performance questions such as data correctness issues or abnormal crashes. +- It is not designed to directly analyze issues such as lock conflicts or transaction semantic errors that are not caused by resource consumption. ## Access the page -You can access the Top SQL page using either of the following methods: +You can access the TopSQL page using either of the following methods: -* After logging in to TiDB Dashboard, click **Top SQL** in the left navigation menu. +* After logging in to TiDB Dashboard, click **TopSQL** in the left navigation menu. - ![Top SQL](/media/dashboard/top-sql-access.png) + ![TopSQL](/media/dashboard/top-sql-access.png) * Visit in your browser. Replace `127.0.0.1:2379` with the actual PD instance address and port. -## Enable Top SQL +## Enable TopSQL > **Note:** > -> To use Top SQL, your cluster should be deployed or upgraded with a recent version of TiUP (v1.9.0 or above) or TiDB Operator (v1.3.0 or above). If your cluster was upgraded using an earlier version of TiUP or TiDB Operator, see [FAQ](/dashboard/dashboard-faq.md#a-required-component-ngmonitoring-is-not-started-error-is-shown) for instructions. +> To use TopSQL, your cluster should be deployed or upgraded with a recent version of TiUP (v1.9.0 or above) or TiDB Operator (v1.3.0 or above). If your cluster was upgraded using an earlier version of TiUP or TiDB Operator, see [FAQ](/dashboard/dashboard-faq.md#a-required-component-ngmonitoring-is-not-started-error-is-shown) for instructions. -Top SQL is not enabled by default as it has a slight impact on cluster performance (within 3% on average) when enabled. You can enable Top SQL by the following steps: +TopSQL is disabled by default because it has a slight impact on cluster performance, usually less than 3%. You can enable TopSQL as follows: -1. Visit the [Top SQL page](#access-the-page). -2. Click **Open Settings**. On the right side of the **Settings** area, switch on **Enable Feature**. +1. Visit the [TopSQL page](#access-the-page). +2. Click **Open Settings**. In the **Settings** panel on the right, turn on **Enable Feature**. 3. Click **Save**. -After enabling the feature, wait up to 1 minute for Top SQL to load the data. Then you can see the CPU load details. +After TopSQL is enabled, only data collected from that point forward is available. Historical fine-grained data before enabling TopSQL is not backfilled. New data is usually visible after about 1 minute. After TopSQL is disabled, existing historical data remains queryable until it expires, but no new TopSQL data is collected. -In addition to the UI, you can also enable the Top SQL feature by setting the TiDB system variable [`tidb_enable_top_sql`](/system-variables.md#tidb_enable_top_sql-new-in-v540): +In addition to the UI, you can also enable the TopSQL feature by setting the TiDB system variable [`tidb_enable_top_sql`](/system-variables.md#tidb_enable_top_sql-new-in-v540): {{< copyable "sql" >}} @@ -60,58 +62,99 @@ In addition to the UI, you can also enable the Top SQL feature by setting the Ti SET GLOBAL tidb_enable_top_sql = 1; ``` -## Use Top SQL +### Enable TiKV Network IO collection (optional) -The following are the common steps to use Top SQL. +If you want to use `Order By Network`, `Order By Logical IO`, or `By Region` on a TiKV instance, continue in the same settings panel and turn on **Enable TiKV Network IO collection (multi-dimensional)**, and then save the setting. -1. Visit the [Top SQL page](#access-the-page). +As shown in the following screenshot, the **Settings** panel contains both **Enable Feature** and **Enable TiKV Network IO collection (multi-dimensional)**: -2. Select a particular TiDB or TiKV instance that you want to observe the load. +![Enable TiKV network IO collection](/media/dashboard/top-sql-settings-enable-tikv-network-io.png) - ![Select Instance](/media/dashboard/top-sql-usage-select-instance.png) +This setting increases storage and query overhead. After it is enabled, the configuration is applied to all current TiKV nodes. The new data might still take about 1 minute to appear. If some TiKV nodes fail to enable the setting, the page displays a warning, and newly collected data might be incomplete. - If you are unsure of which TiDB or TiKV instance to observe, you can select an arbitrary instance. Also, when the cluster CPU load is extremely unbalanced, you can first use Grafana charts to determine the specific instance you want to observe. +For TiKV nodes that are added later by scaling out, this switch does not automatically take effect. You need to manually turn it on again so that the configuration is applied to all TiKV nodes. If you want newly added TiKV nodes to automatically enable this capability, add the following configuration under `server_configs.tikv` in the TiUP cluster topology file and re-apply the TiKV configuration using TiUP: -3. Observe the charts and tables presented by Top SQL. +```yaml +server_configs: + tikv: + resource-metering.enable-network-io-collection: true +``` - ![Chart and Table](/media/dashboard/top-sql-usage-chart.png) +For more information about TiUP topology configuration, see [TiDB Cluster Topology Reference](/tiup/tiup-cluster-topology-reference.md). - The size of the bars in the bar chart represents the size of CPU resources consumed by the SQL statement at that moment. Different colors distinguish different types of SQL statements. In most cases, you only need to focus on the SQL statements that have a higher CPU resource overhead in the corresponding time range in the chart. +## Use TopSQL {#use-top-sql} -4. Click a SQL statement in the table to show more information. You can see detailed execution metrics of different plans of that statement, such as Call/sec (average queries per second) and Scan Indexes/sec (average number of index rows scanned per second). +The following are the common steps to use TopSQL: - ![Details](/media/dashboard/top-sql-details.png) +1. Visit the [TopSQL page](#access-the-page). + +2. Select the TiDB or TiKV instance that you want to observe. + + ![Select Instance](/media/dashboard/top-sql-usage-select-instance.png) + + If you are not sure which instance to inspect, you can first identify the busy node from Grafana or the Overview page, and then return to TopSQL for deeper analysis. -5. Based on these initial clues, you can further explore the [SQL Statement](/dashboard/dashboard-statement-list.md) or [Slow Queries](/dashboard/dashboard-slow-query.md) page to find the root cause of high CPU consumption or large data scans of the SQL statement. +3. Set the time range, and use **Refresh** or auto refresh when needed. - You can adjust the time range in the time picker or select a time range in the chart to get a more precise and detailed look at the problem. A smaller time range can provide more detailed data, with precision of up to 1 second. + You can adjust the time range in the time picker, or drag over a range in the chart to zoom in. A smaller time range provides more fine-grained data, with precision down to 1 second. ![Change time range](/media/dashboard/top-sql-usage-change-timerange.png) - If the chart is out of date, you can click the **Refresh** button or select Auto Refresh options from the **Refresh** drop-down list. + If the chart is out of date, click **Refresh**, or select an auto refresh interval from the **Refresh** drop-down list. ![Refresh](/media/dashboard/top-sql-usage-refresh.png) -6. View the CPU resource usage by table or database level to quickly identify resource usage at a higher level. Currently, only TiKV instances are supported. +4. Choose the observation mode. - Select a TiKV instance, and then select **By TABLE** or **By DB**: + - Use `Limit` to display the top `5`, `20`, or `100` records. + - Use `Order By` to sort by `CPU Time` or `Network Bytes`. When a TiKV instance is selected, you can also sort by `Logical IO Bytes`. + - Use `By Query`, `By Table`, `By DB`, or `By Region` to switch the aggregation dimension. The last three options are available only for TiKV instances. + + When a TiKV instance is selected and **TiKV Network IO collection (multi-dimensional)** is enabled, the `Order By` drop-down list shows `Order By CPU`, `Order By Network`, and `Order By Logical IO`. + + ![Select order by](/media/dashboard/top-sql-usage-select-order-by.png) ![Select aggregation dimension](/media/dashboard/top-sql-usage-select-agg-by.png) - View the aggregated results at a higher level: + `By Region`, `Order By Network`, and `Order By Logical IO` depend on **TiKV Network IO collection (multi-dimensional)**. If the feature is disabled but historical data is still retained, the page can continue to display historical data and warns that newly collected data might be incomplete. + +5. Observe hotspot records in the chart and table. + + ![Chart and Table](/media/dashboard/top-sql-usage-chart.png) + + Each block in the chart represents resource consumption under the current sort dimension, and different colors represent different records. The table is sorted by the current metric and includes an extra `Others` row that summarizes all non-Top N records. + +6. In the `By Query` view, click a row in the table to expand query details by execution plan. + + ![Details](/media/dashboard/top-sql-details.png) + + In the detail panel, you can view the query template, query template ID, plan template ID, and execution plan text. The detail table shows different metrics depending on the selected instance type: + + - For TiDB instances, the detail table typically shows `Call/sec` and `Latency/call`. + - For TiKV instances, the detail table typically shows `Call/sec`, `Scan Rows/sec`, and `Scan Indexes/sec`. + + In the `By Table`, `By DB`, or `By Region` views, the page shows aggregated results rather than per-plan SQL details. + +7. On a TiKV instance, if you need to analyze hotspots at a higher level, switch to `By Table`, `By DB`, or `By Region` to view aggregated results. ![Aggregated results at DB level](/media/dashboard/top-sql-usage-agg-by-db-detail.png) -## Disable Top SQL +8. Based on these clues, continue with the [SQL Statements](/dashboard/dashboard-statement-list.md) page or the [Slow Queries](/dashboard/dashboard-slow-query.md) page to investigate the root cause. + + In the `By Query` view, you can also click **Search in SQL Statements** in the table to jump to the corresponding SQL Statements page. If you need to analyze the current table data offline, use `Download to CSV`. -You can disable this feature by following these steps: +## Disable TopSQL -1. Visit [Top SQL page](#access-the-page). -2. Click the gear icon in the upper right corner to open the settings screen and switch off **Enable Feature**. +You can disable TopSQL by following these steps: + +1. Visit the [TopSQL page](#access-the-page). +2. Click the settings icon in the upper-right corner, and turn off **Enable Feature**. 3. Click **Save**. -4. In the popped-up dialog box, click **Disable**. +4. In the confirmation dialog, click **Disable**. + +After TopSQL is disabled, no new TopSQL data is collected. Existing historical data remains available until it expires. -In addition to the UI, you can also disable the Top SQL feature by setting the TiDB system variable [`tidb_enable_top_sql`](/system-variables.md#tidb_enable_top_sql-new-in-v540): +In addition to the UI, you can also disable the TopSQL feature by setting the TiDB system variable [`tidb_enable_top_sql`](/system-variables.md#tidb_enable_top_sql-new-in-v540): {{< copyable "sql" >}} @@ -119,32 +162,50 @@ In addition to the UI, you can also disable the Top SQL feature by setting the T SET GLOBAL tidb_enable_top_sql = 0; ``` +### Disable TiKV Network IO collection + +If you want to stop collecting TiKV `Network Bytes`, `Logical IO Bytes`, and related multi-dimensional data while keeping TopSQL CPU analysis enabled, turn off **Enable TiKV Network IO collection (multi-dimensional)** in the settings panel. + +After this setting is disabled: + +- Historical network IO and logical IO data remains viewable until it expires. +- New `Network Bytes`, `Logical IO Bytes`, and `By Region` data is no longer collected. + ## Frequently asked questions -**1. Top SQL cannot be enabled and the UI displays "required component NgMonitoring is not started"**. +**1. TopSQL cannot be enabled and the UI displays "required component NgMonitoring is not started".** See [TiDB Dashboard FAQ](/dashboard/dashboard-faq.md#a-required-component-ngmonitoring-is-not-started-error-is-shown). -**2. Will performance be affected after enabling Top SQL?** +**2. Will performance be affected after enabling TopSQL?** -This feature has a slight impact on cluster performance. According to our benchmark, the average performance impact is usually less than 3% when the feature is enabled. +TopSQL itself has a slight impact on cluster performance. According to our benchmark, the average performance impact is usually less than 3%. If you also enable **TiKV Network IO collection (multi-dimensional)**, there is additional storage and query overhead. **3. What is the status of this feature?** It is now a generally available (GA) feature and can be used in production environments. -**4. What is the meaning of "Other Statements"?** +**4. What does `Others` mean in the UI?** + +`Others` represents the aggregated result of all non-Top N records under the current sort dimension. You can use it to understand how much of the total load comes from the Top N records. + +**5. What is the relationship between the CPU overhead displayed by TopSQL and the actual CPU usage of the process?** + +Their correlation is strong but they are not exactly the same thing. For example, the cost of writing multiple replicas is not counted in the TiKV CPU overhead displayed by TopSQL. In general, SQL statements with higher CPU usage result in higher CPU overhead displayed in TopSQL. -"Other Statement" counts the total CPU overhead of all non-Top 5 statements. With this information, you can learn the CPU overhead contributed by the Top 5 statements compared with the overall. +**6. What does the Y-axis of the TopSQL chart mean?** -**5. What is the relationship between the CPU overhead displayed by Top SQL and the actual CPU usage of the process?** +The Y-axis represents resource consumption under the currently selected sort dimension. When `Order By CPU` is selected, it represents CPU time. When `Order By Network` is selected, it represents network bytes. When `Order By Logical IO` is selected, it represents logical IO bytes. -Their correlation is strong but they are not exactly the same thing. For example, the cost of writing multiple replicas is not counted in the TiKV CPU overhead displayed by Top SQL. In general, SQL statements with higher CPU usage result in higher CPU overhead displayed in Top SQL. +**7. Does TopSQL collect running (unfinished) SQL statements?** -**6. What is the meaning of the Y-axis of the Top SQL chart?** +Yes. At each point in time, the TopSQL chart shows the load of all currently running SQL statements under the selected dimension, so unfinished SQL statements are included as well. -It represents the size of CPU resources consumed. The more resources consumed by a SQL statement, the higher the value is. In most cases, you do not need to care about the meaning or unit of the specific value. +**8. Why can't I see new `Order By Network`, `Order By Logical IO`, or `By Region` data?** -**7. Does Top SQL collect running (unfinished) SQL statements?** +These views depend on **TiKV Network IO collection (multi-dimensional)**. Check the following items: -Yes. The bars displayed in the Top SQL chart at each moment indicate the CPU overhead of all running SQL statements at that moment. +- Make sure that you have selected a TiKV instance. +- Make sure that **Enable TiKV Network IO collection (multi-dimensional)** is turned on in the settings panel. +- Make sure that the relevant TiKV nodes have successfully enabled the configuration. If only some nodes are enabled, the page warns that newly collected data might be incomplete. +- If you recently scaled out new TiKV nodes, enable `resource-metering.enable-network-io-collection` in the TiKV default configuration managed by TiUP. Otherwise, newly added nodes do not automatically inherit the setting. diff --git a/media/dashboard/top-sql-access.png b/media/dashboard/top-sql-access.png index 62a14c98b07ce..3f7cf525df95c 100644 Binary files a/media/dashboard/top-sql-access.png and b/media/dashboard/top-sql-access.png differ diff --git a/media/dashboard/top-sql-details.png b/media/dashboard/top-sql-details.png index c38be5bd57786..7357ba92c51cd 100644 Binary files a/media/dashboard/top-sql-details.png and b/media/dashboard/top-sql-details.png differ diff --git a/media/dashboard/top-sql-settings-enable-tikv-network-io.png b/media/dashboard/top-sql-settings-enable-tikv-network-io.png new file mode 100644 index 0000000000000..0511908eb42e9 Binary files /dev/null and b/media/dashboard/top-sql-settings-enable-tikv-network-io.png differ diff --git a/media/dashboard/top-sql-usage-agg-by-db-detail.png b/media/dashboard/top-sql-usage-agg-by-db-detail.png index 116cbd055bab8..40959319968af 100644 Binary files a/media/dashboard/top-sql-usage-agg-by-db-detail.png and b/media/dashboard/top-sql-usage-agg-by-db-detail.png differ diff --git a/media/dashboard/top-sql-usage-change-timerange.png b/media/dashboard/top-sql-usage-change-timerange.png index ee92126357a92..0dd5399c1b4fc 100644 Binary files a/media/dashboard/top-sql-usage-change-timerange.png and b/media/dashboard/top-sql-usage-change-timerange.png differ diff --git a/media/dashboard/top-sql-usage-chart.png b/media/dashboard/top-sql-usage-chart.png index 7817a7f09e46b..560812d79d6e9 100644 Binary files a/media/dashboard/top-sql-usage-chart.png and b/media/dashboard/top-sql-usage-chart.png differ diff --git a/media/dashboard/top-sql-usage-refresh.png b/media/dashboard/top-sql-usage-refresh.png index 1e365fd480924..a93449751eebb 100644 Binary files a/media/dashboard/top-sql-usage-refresh.png and b/media/dashboard/top-sql-usage-refresh.png differ diff --git a/media/dashboard/top-sql-usage-select-agg-by.png b/media/dashboard/top-sql-usage-select-agg-by.png index 8d88f81d757a1..a81848639ab39 100644 Binary files a/media/dashboard/top-sql-usage-select-agg-by.png and b/media/dashboard/top-sql-usage-select-agg-by.png differ diff --git a/media/dashboard/top-sql-usage-select-instance.png b/media/dashboard/top-sql-usage-select-instance.png index 2276791202025..995946eec11fe 100644 Binary files a/media/dashboard/top-sql-usage-select-instance.png and b/media/dashboard/top-sql-usage-select-instance.png differ diff --git a/media/dashboard/top-sql-usage-select-order-by.png b/media/dashboard/top-sql-usage-select-order-by.png new file mode 100644 index 0000000000000..a643c494a93c0 Binary files /dev/null and b/media/dashboard/top-sql-usage-select-order-by.png differ diff --git a/tidb-cloud/tidb-cloud-clinic.md b/tidb-cloud/tidb-cloud-clinic.md index 12885991e8381..f112fabb80125 100644 --- a/tidb-cloud/tidb-cloud-clinic.md +++ b/tidb-cloud/tidb-cloud-clinic.md @@ -93,7 +93,7 @@ For more information, see [Slow Queries in TiDB Dashboard](https://docs.pingcap. ## Monitor TopSQL -TiDB Cloud Clinic provides TopSQL information, enabling you to monitor and visually explore the CPU overhead of each SQL statement in your database in real time. This helps you optimize and resolve database performance issues. +TiDB Cloud Clinic provides TopSQL information to help you visually analyze the most resource-intensive queries on a specific TiDB or TiKV instance over a period of time. By default, TopSQL continuously collects CPU load data. For TiKV instances, if TiKV network IO collection is enabled, you can also inspect `Network Bytes` and `Logical IO Bytes`, and analyze hotspots by `Query`, `Table`, `DB`, or `Region`. This helps you identify and troubleshoot performance issues from multiple resource dimensions instead of CPU alone. To view TopSQL, take the following steps: @@ -103,9 +103,9 @@ To view TopSQL, take the following steps: 3. Select a specific TiDB or TiKV instance to observe its load. You can use the time picker or select a time range in the chart to refine your analysis. -4. Analyze the charts and tables displayed by TopSQL. +4. Analyze the charts and tables displayed by TopSQL. Depending on the selected instance and enabled metrics, you can use `Order By` and the available aggregation dimensions to inspect CPU, network, or logical IO hotspots. -For more information, see [TopSQL in TiDB Dashboard](https://docs.pingcap.com/tidb/stable/top-sql). +For more information, see [TopSQL in TiDB Dashboard](https://docs.pingcap.com/tidb/stable/top-sql). ## Generate benchmark reports