Skip to content

Commit 11c7ad9

Browse files
authored
Add cpu and traffic into performance overview (#18698)
1 parent 2a4c91b commit 11c7ad9

6 files changed

+138
-60
lines changed

dashboard/dashboard-monitoring.md

+64-44
Original file line numberDiff line numberDiff line change
@@ -21,31 +21,31 @@ If the TiDB cluster is deployed using TiUP, you can also view the Performance Ov
2121

2222
The Performance Overview dashboard orchestrates the metrics of TiDB, PD, and TiKV, and presents each of them in the following sections:
2323

24-
- Overview: Database time and SQL execution time summary. By checking different colors in the overview, you can quickly identify the database workload profile and the performance bottleneck.
24+
- **Overview**: Database time and SQL execution time summary. By checking different colors in the overview, you can quickly identify the database workload profile and the performance bottleneck.
2525

26-
- Load profile: Key metrics and resource usage, including database QPS, connection information, the MySQL command types the application interacts with TiDB, database internal TSO and KV request OPS, and resource usage of the TiKV and TiDB.
26+
- **Load profile**: Key metrics and resource usage, including database QPS, connection information, the MySQL command types the application interacts with TiDB, database internal TSO and KV request OPS, and resource usage of the TiKV and TiDB.
2727

28-
- Top-down latency breakdown: Query latency versus connection idle time ratio, query latency breakdown, TSO/KV request latency during execution, breakdown of write latency within TiKV.
28+
- **Top-down latency breakdown**: Query latency versus connection idle time ratio, query latency breakdown, TSO/KV request latency during execution, breakdown of write latency within TiKV.
2929

3030
The following sections illustrate the metrics on the Performance Overview dashboard.
3131

3232
### Database Time by SQL Type
3333

34-
- database time: Total database time per second
35-
- sql_type: Database time consumed by each type of SQL statements per second
34+
- `database time`: Total database time per second
35+
- `sql_type`: Database time consumed by each type of SQL statements per second
3636

3737
### Database Time by SQL Phase
3838

39-
- database time: Total database time per second
40-
- get token/parse/compile/execute: Database time consumed in four SQL processing phases
39+
- `database time`: Total database time per second
40+
- `get token/parse/compile/execute`: Database time consumed in four SQL processing phases
4141

4242
The SQL execution phase is in green and other phases are in red on general. If non-green areas are large, it means much database time is consumed in other phases than the execution phase and further cause analysis is required.
4343

4444
### SQL Execute Time Overview
4545

46-
- execute time: Database time consumed during SQL execution per second
47-
- tso_wait: Concurrent TSO waiting time per second during SQL execution
48-
- kv request type: Time waiting for each KV request type per second during SQL execution. The total KV request wait time might exceed SQL execution time, because KV requests are concurrent.
46+
- `execute time`: Database time consumed during SQL execution per second
47+
- `tso_wait`: Concurrent TSO waiting time per second during SQL execution
48+
- `kv request type`: Time waiting for each KV request type per second during SQL execution. The total KV request wait time might exceed SQL execution time, because KV requests are concurrent.
4949

5050
Green metrics stand for common KV write requests (such as prewrite and commit), blue metrics stand for common read requests, and metrics in other colors stand for unexpected situations which you need to pay attention to. For example, pessimistic lock KV requests are marked red and TSO waiting is marked dark brown.
5151

@@ -77,50 +77,70 @@ Generally, `tso - request` divided by `tso - cmd` is the average size of TSO req
7777

7878
### Connection Count
7979

80-
- total: Number of connections to all TiDB instances
81-
- active connections: Number of active connections to all TiDB instances
80+
- `total`: Number of connections to all TiDB instances
81+
- `active connections`: Number of active connections to all TiDB instances
8282
- Number of connections to each TiDB instance
8383

84-
### TiDB CPU
84+
### TiDB CPU/Memory
8585

86-
- avg: Average CPU utilization across all TiDB instances
87-
- delta: Maximum CPU utilization of all TiDB instances minus minimum CPU utilization of all TiDB instances
88-
- max: Maximum CPU utilization across all TiDB instances
86+
- `CPU-Avg`: Average CPU utilization across all TiDB instances
87+
- `CPU-Delta`: Maximum CPU utilization of all TiDB instances minus minimum CPU utilization of all TiDB instances
88+
- `CPU-Max`: Maximum CPU utilization across all TiDB instances
89+
- `CPU-Quota`: Number of CPU cores that can be used by TiDB
90+
- `Mem-Max`: Maximum memory utilization across all TiDB instances
8991

90-
### TiKV CPU/IO MBps
92+
### TiKV CPU/Memory
9193

92-
- CPU-Avg: Average CPU utilization of all TiKV instances
93-
- CPU-Delta: Maximum CPU utilization of all TiKV instances minus minimum CPU utilization of all TiKV instances
94-
- CPU-MAX: Maximum CPU utilization among all TiKV instances
95-
- IO-Avg: Average MBps of all TiKV instances
96-
- IO-Delt: Maximum MBps of all TiKV instances minus minimum MBps of all TiKV instances
97-
- IO-MAX: Maximum MBps of all TiKV instances
94+
- `CPU-Avg`: Average CPU utilization across all TiKV instances
95+
- `CPU-Delta`: Maximum CPU utilization of all TiKV instances minus minimum CPU utilization of all TiKV instances
96+
- `CPU-Max`: Maximum CPU utilization across all TiKV instances
97+
- `CPU-Quota`: Number of CPU cores that can be used by TiKV
98+
- `Mem-Max`: Maximum memory utilization across all TiKV instances
99+
100+
### PD CPU/Memory
101+
102+
- `CPU-Max`: Maximum CPU utilization across all PD instances
103+
- `CPU-Quota`: Number of CPU cores that can be used by PD
104+
- `Mem-Max`: Maximum memory utilization across all PD instances
105+
106+
### Read Traffic
107+
108+
- `TiDB -> Client`: The outbound traffic statistics from TiDB to the client
109+
- `Rocksdb -> TiKV`: The data flow that TiKV retrieves from RocksDB during read operations within the storage layer
110+
111+
### Write Traffic
112+
113+
- `Client -> TiDB`: The inbound traffic statistics from the client to TiDB
114+
- `TiDB -> TiKV: general`: The rate at which foreground transactions are written from TiDB to TiKV
115+
- `TiDB -> TiKV: internal`: The rate at which internal transactions are written from TiDB to TiKV
116+
- `TiKV -> Rocksdb`: The flow of write operations from TiKV to RocksDB
117+
- `RocksDB Compaction`: The total read and write I/O flow generated by RocksDB compaction operations
98118

99119
### Duration
100120

101-
- Duration: Execution time
121+
- `Duration`: Execution time
102122

103123
- The duration from receiving a request from the client to TiDB till TiDB executing the request and returning the result to the client. In general, client requests are sent in the form of SQL statements; however, this duration can include the execution time of commands such as `COM_PING`, `COM_SLEEP`, `COM_STMT_FETCH`, and `COM_SEND_LONG_DATA`.
104124
- TiDB supports Multi-Query, which means the client can send multiple SQL statements at one time, such as `select 1; select 1; select 1;`. In this case, the total execution time of this query includes the execution time of all SQL statements.
105125

106-
- avg: Average time to execute all requests
107-
- 99: P99 duration to execute all requests
108-
- avg by type: Average time to execute all requests in all TiDB instances, collected by type: `SELECT`, `INSERT`, and `UPDATE`
126+
- `avg`: Average time to execute all requests
127+
- `99`: P99 duration to execute all requests
128+
- `avg by type`: Average time to execute all requests in all TiDB instances, collected by type: `SELECT`, `INSERT`, and `UPDATE`
109129

110130
### Connection Idle Duration
111131

112132
Connection Idle Duration indicates the duration of a connection being idle.
113133

114-
- avg-in-txn: Average connection idle duration when the connection is within a transaction
115-
- avg-not-in-txn: Average connection idle duration when the connection is not within a transaction
116-
- 99-in-txn: P99 connection idle duration when the connection is within a transaction
117-
- 99-not-in-txn: P99 connection idle duration when the connection is not within a transaction
134+
- `avg-in-txn`: Average connection idle duration when the connection is within a transaction
135+
- `avg-not-in-txn`: Average connection idle duration when the connection is not within a transaction
136+
- `99-in-txn`: P99 connection idle duration when the connection is within a transaction
137+
- `99-not-in-txn`: P99 connection idle duration when the connection is not within a transaction
118138

119139
### Parse Duration, Compile Duration, and Execute Duration
120140

121-
- Parse Duration: Time consumed in parsing SQL statements
122-
- Compile Duration: Time consumed in compiling the parsed SQL AST to execution plans
123-
- Execution Duration: Time consumed in executing execution plans of SQL statements
141+
- `Parse Duration`: Time consumed in parsing SQL statements
142+
- `Compile Duration`: Time consumed in compiling the parsed SQL AST to execution plans
143+
- `Execution Duration`: Time consumed in executing execution plans of SQL statements
124144

125145
All these three metrics include the average duration and the 99th percentile duration in all TiDB instances.
126146

@@ -134,25 +154,25 @@ Average time consumed in executing gRPC requests in all TiKV instances based on
134154

135155
### PD TSO Wait/RPC Duration
136156

137-
- wait - avg: Average time in waiting for PD to return TSO in all TiDB instances
138-
- rpc - avg: Average time from sending TSO requests to PD to receiving TSO in all TiDB instances
139-
- wait - 99: P99 time in waiting for PD to return TSO in all TiDB instances
140-
- rpc - 99: P99 time from sending TSO requests to PD to receiving TSO in all TiDB instances
157+
- `wait - avg`: Average time in waiting for PD to return TSO in all TiDB instances
158+
- `rpc - avg`: Average time from sending TSO requests to PD to receiving TSO in all TiDB instances
159+
- `wait - 99`: P99 time in waiting for PD to return TSO in all TiDB instances
160+
- `rpc - 99`: P99 time from sending TSO requests to PD to receiving TSO in all TiDB instances
141161

142162
### Storage Async Write Duration, Store Duration, and Apply Duration
143163

144-
- Storage Async Write Duration: Time consumed in asynchronous write
145-
- Store Duration: Time consumed in store loop during asynchronously write
146-
- Apply Duration: Time consumed in apply loop during asynchronously write
164+
- `Storage Async Write Duration`: Time consumed in asynchronous write
165+
- `Store Duration`: Time consumed in store loop during asynchronously write
166+
- `Apply Duration`: Time consumed in apply loop during asynchronously write
147167

148168
All these three metrics include the average duration and P99 duration in all TiKV instances.
149169

150170
Average storage async write duration = Average store duration + Average apply duration
151171

152172
### Append Log Duration, Commit Log Duration, and Apply Log Duration
153173

154-
- Append Log Duration: Time consumed by Raft to append logs
155-
- Commit Log Duration: Time consumed by Raft to commit logs
156-
- Apply Log Duration: Time consumed by Raft to apply logs
174+
- `Append Log Duration`: Time consumed by Raft to append logs
175+
- `Commit Log Duration`: Time consumed by Raft to commit logs
176+
- `Apply Log Duration`: Time consumed by Raft to apply logs
157177

158178
All these three metrics include the average duration and P99 duration in all TiKV instances.

media/performance/titan_disable.png

120 KB
Loading

media/performance/titan_enable.png

108 KB
Loading

media/performance/tpcc_cpu_memory.png

549 KB
Loading
442 KB
Loading

performance-tuning-methods.md

+74-16
Original file line numberDiff line numberDiff line change
@@ -216,34 +216,92 @@ In this workload, only `ANALYZE` statements are running in the cluster:
216216
- The total number of KV requests per second is 35.5 and the number of Cop requests per second is 9.3.
217217
- Most of the KV processing time is spent on `Cop-internal_stats`, which indicates that the most time-consuming KV request is `Cop` from internal `ANALYZE` operations.
218218

219-
#### TiDB CPU, TiKV CPU, and IO usage
219+
#### CPU and memory usage
220220

221-
In the TiDB CPU and TiKV CPU/IO MBps panels, you can observe the logical CPU usage and IO throughput of TiDB and TiKV, including average, maximum, and delta (maximum CPU usage minus minimum CPU usage), based on which you can determine the overall CPU usage of TiDB and TiKV.
221+
In the CPU/Memory panels of TiDB, TiKV, and PD, you can monitor their respective logical CPU usage and memory consumption, such as average CPU, maximum CPU, delta CPU (maximum CPU usage minus minimum CPU usage), CPU quota, and maximum memory usage. Based on these metrics, you can determine the overall resource usage of TiDB, TiKV, and PD.
222222

223-
- Based on the `delta` value, you can determine if CPU usage in TiDB is unbalanced (usually accompanied by unbalanced application connections) and if there are read/write hot spots among the cluster.
224-
- With an overview of TiDB and TiKV resource usage, you can quickly determine if there are resource bottlenecks in your cluster and whether TiKV or TiDB needs scale-out.
223+
- Based on the `delta` value, you can determine if CPU usage in TiDB or TiKV is unbalanced. For TiDB, a high `delta` usually means unbalanced application connections among the TiDB instances; For TiKV, a high `delta` usually means there are read/write hot spots in the cluster.
224+
- With an overview of TiDB, TiKV, and PD resource usage, you can quickly determine if there are resource bottlenecks in your cluster and whether TiKV, TiDB, or PD needs scale-out or scale-up.
225225

226-
**Example 1: High TiDB resource usage**
226+
**Example 1: High TiKV resource usage**
227227

228-
In this workload, each TiDB and TiKV is configured with 8 CPUs.
228+
In the following TPC-C workload, each TiDB and TiKV is configured with 16 CPUs. PD is configured with 4 CPUs.
229229

230-
![TPC-C](/media/performance/tidb_high_cpu.png)
230+
![TPC-C](/media/performance/tpcc_cpu_memory.png)
231231

232-
- The average, maximum, and delta CPU usage of TiDB are 575%, 643%, and 136%, respectively.
233-
- The average, maximum, and delta CPU usage of TiKV are 146%, 215%, and 118%, respectively. The average, maximum, and delta I/O throughput of TiKV are 9.06 MB/s, 19.7 MB/s, and 17.1 MB/s, respectively.
232+
- The average, maximum, and delta CPU usage of TiDB are 761%, 934%, and 322%, respectively. The maximum memory usage is 6.86 GiB.
233+
- The average, maximum, and delta CPU usage of TiKV are 1343%, 1505%, and 283%, respectively. The maximum memory usage is 27.1 GiB.
234+
- The maximum CPU usage of PD is 59.1%. The maximum memory usage is 221 MiB.
234235

235-
Obviously, TiDB consumes more CPU, which is near the bottleneck threshold of 8 CPUs. It is recommended that you scale out the TiDB.
236+
Obviously, TiKV consumes more CPU, which is expected because TPC-C is a write-heavy scenario. To improve performance, it is recommended to scale out TiKV.
236237

237-
**Example 2: High TiKV resource usage**
238+
#### Data traffic
238239

239-
In the TPC-C workload below, each TiDB and TiKV is configured with 16 CPUs.
240+
The read and write traffic panels offer insights into traffic patterns within your TiDB cluster, allowing you to monitor data flow from clients to the database and between internal components comprehensively.
240241

241-
![TPC-C](/media/performance/tpcc_cpu_io.png)
242+
- Read traffic
242243

243-
- The average, maximum, and delta CPU usage of TiDB are 883%, 962%, and 153%, respectively.
244-
- The average, maximum, and delta CPU usage of TiKV are 1288%, 1360%, and 126%, respectively. The average, maximum, and delta I/O throughput of TiKV are 130 MB/s, 153 MB/s, and 53.7 MB/s, respectively.
244+
- `TiDB -> Client`: the outbound traffic statistics from TiDB to the client
245+
- `Rocksdb -> TiKV`: the data flow that TiKV retrieves from RocksDB during read operations within the storage layer
245246

246-
Obviously, TiKV consumes more CPU, which is expected because TPC-C is a write-heavy scenario. It is recommended that you scale out the TiKV to improve performance.
247+
- Write traffic
248+
249+
- `Client -> TiDB`: the inbound traffic statistics from the client to TiDB
250+
- `TiDB -> TiKV: general`: the rate at which foreground transactions are written from TiDB to TiKV
251+
- `TiDB -> TiKV: internal`: the rate at which internal transactions are written from TiDB to TiKV
252+
- `TiKV -> Rocksdb`: the flow of write operations from TiKV to RocksDB
253+
- `RocksDB Compaction`: the total read and write I/O flow generated by RocksDB compaction operations. If `RocksDB Compaction` is significantly higher than `TiKV -> Rocksdb`, and your average row size is larger than 512 bytes, you can enable Titan to reduce the compaction I/O flow as follows, with min-blob-size set to `"512B"` or `"1KB"` and blob-file-compression set to `"zstd"`.
254+
255+
```toml
256+
[rocksdb.titan]
257+
enabled = true
258+
[rocksdb.defaultcf.titan]
259+
min-blob-size = "1KB"
260+
blob-file-compression = "zstd"
261+
```
262+
263+
**Example 1: Read and write traffic in the TPC-C workload**
264+
265+
The following is an example of read and write traffic in the TPC-C workload.
266+
267+
- Read traffic
268+
269+
- `TiDB -> Client`: 14.2 MB/s
270+
- `Rocksdb -> TiKV`: 469 MB/s. Note that both read operations (`SELECT` statements) and write operations (`INSERT`, `UPDATE`, and `DELETE` statements) require reading data from RocksDB into TiKV before committing a transaction.
271+
272+
- Write traffic
273+
274+
- `Client -> TiDB`: 5.05 MB/s
275+
- `TiDB -> TiKV: general`: 13.1 MB/s
276+
- `TiDB -> TiKV`: internal: 5.07 KB/s
277+
- `TiKV -> Rocksdb`: 109 MB/s
278+
- `RocksDB Compaction`: 567 MB/s
279+
280+
![TPC-C](/media/performance/tpcc_read_write_traffic.png)
281+
282+
**Example 2: Write traffic before and after Titan is enabled**
283+
284+
The following example shows the performance changes before and after Titan is enabled. For an insert workload with 6 KB records, Titan significantly reduces write traffic and compaction I/O, enhancing overall performance and resource utilization of TiKV.
285+
286+
- Write traffic before Titan is enabled
287+
288+
- `Client -> TiDB`: 510 MB/s
289+
- `TiDB -> TiKV: general`: 187 MB/s
290+
- `TiDB -> TiKV: internal`: 3.2 KB/s
291+
- `TiKV -> Rocksdb`: 753 MB/s
292+
- `RocksDB Compaction`: 10.6 GB/s
293+
294+
![Titan Disable](/media/performance/titan_disable.png)
295+
296+
- Write traffic after Titan is enabled
297+
298+
- `Client -> TiDB`: 586 MB/s
299+
- `TiDB -> TiKV: general`: 295 MB/s
300+
- `TiDB -> TiKV: internal`: 3.66 KB/s
301+
- `TiKV -> Rocksdb`: 1.21 GB/s
302+
- `RocksDB Compaction`: 4.68 MB/s
303+
304+
![Titan Enable](/media/performance/titan_enable.png)
247305

248306
### Query latency breakdown and key latency metrics
249307

0 commit comments

Comments
 (0)