Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 83 additions & 18 deletions docs/gettingStarted/before-you-start-the-poc.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,41 @@
{
"title": "Before You Start the POC",
"language": "en",
"description": "Common issues new users encounter with table design, data loading, and query tuning in Apache Doris.",
"description": "Apache Doris POC checklist: covers table design (data model, sort key, partitioning, bucketing), data loading best practices, query tuning, and data lake (Hive, Iceberg, Paimon) query optimization to help new users complete POC validation quickly.",
"sidebar_label": "Before You Start the POC"
}
---

# Before You Start the POC
This document highlights common issues that new users may encounter, with the goal of accelerating the POC process. The content is organized by the typical POC workflow:

This document highlights common issues that new users may encounter, with the goal of accelerating the POC process.
1. **Table Design** — Choose the data model, sort key, partitioning, and bucketing strategy.
2. **Data Loading** — Pick the right loading method and avoid common pitfalls.
3. **Query Tuning** — Diagnose slow queries and optimize bucketing and index configuration.
4. **Data Lake Queries** — Additional optimization tips for Lakehouse scenarios.

## Table Design

Creating a table in Doris involves four decisions that affect load and query performance.
Creating a table in Doris involves four decisions that affect load and query performance: data model, sort key, partitioning, and bucketing.

### Data Model

| If your data is... | Use | Why |
Choose the model based on how your data is written:

| Data Characteristics | Recommended Model | Why |
|---|---|---|
| Append-only (logs, events, facts) | **Duplicate Key** (default) | Keeps all rows. Best query performance. |
| Updated by primary key (CDC, upsert) | **Unique Key** | New rows replace old rows with the same key. |
| Pre-aggregated metrics (PV, UV, sums) | **Aggregate Key** | Rows are merged with SUM/MAX/MIN at write time. |
| Append-only (logs, events, facts) | **Duplicate Key** (default) | Keeps all rows, best query performance |
| Updated by primary key (CDC, upsert) | **Unique Key** | New rows replace old rows with the same key |
| Pre-aggregated metrics (PV, UV, sums) | **Aggregate Key** | Rows are merged with SUM/MAX/MIN at write time |

**Duplicate Key works for most scenarios.** See [Data Model Overview](../table-design/data-model/overview).

### Sort Key

Put the column you filter on most frequently first, with fixed-size types (INT, BIGINT, DATE) before VARCHAR. Doris builds a [prefix index](../table-design/index/prefix-index) on the first 36 bytes of key columns but stops at the first VARCHAR. Add [inverted indexes](../table-design/index/inverted-index/overview) for other columns that need fast filtering.
Doris builds a [prefix index](../table-design/index/prefix-index) on the first 36 bytes of key columns. Follow these principles when setting the sort key:

- **Frequently filtered columns first**: Put the columns most commonly used in WHERE conditions at the front.
- **Fixed-size types first**: Place INT, BIGINT, DATE, and other fixed-size types before VARCHAR, because the prefix index stops at the first VARCHAR column.
- **Add inverted indexes**: For columns not covered by the prefix index, add [inverted indexes](../table-design/index/inverted-index/overview) to speed up filtering.

### Partitioning

Expand All @@ -39,10 +48,12 @@ Default is **Random bucketing** (recommended for Duplicate Key tables). Use `DIS

**How to choose bucket count:**

1. **Multiple of BE count** to ensure even data distribution. When BEs are added later, queries typically scan multiple partitions, so performance holds up.
2. **As low as possible** to avoid small files.
3. **Compressed data per bucket ≤ 20 GB** (≤ 10 GB for Unique Key). Check with `SHOW TABLETS FROM your_table`.
4. **No more than 128 per partition.** Consider partitioning first if you need more. In extreme cases the upper bound is 1024, but this is rarely needed in production.
| Principle | Details |
|---|---|
| Multiple of BE count | Ensures even data distribution. When BEs are added later, queries typically scan multiple partitions, so performance holds up |
| As low as possible | Avoids producing small files |
| Compressed data per bucket ≤ 20 GB | ≤ 10 GB for Unique Key tables. Check with `SHOW TABLETS FROM your_table` |
| No more than 128 per partition | Consider adding more partitions first if you need more. In extreme cases the upper bound is 1024, but this is rarely needed in production |

### Example Templates

Expand Down Expand Up @@ -94,9 +105,9 @@ AUTO PARTITION BY RANGE(date_trunc(`dt`, 'day'))
DISTRIBUTED BY HASH(site_id) BUCKETS 10;
```

## Performance Pitfalls
## Data Loading

### Load
Choose the right loading method and follow these best practices to avoid common performance issues:

- **Don't use `INSERT INTO VALUES` for bulk data.** Use [Stream Load](../data-operate/import/import-way/stream-load-manual) or [Broker Load](../data-operate/import/import-way/broker-load-manual) instead. See [Loading Overview](../data-operate/import/load-manual).
- **Batch writes on the client side.** High-frequency small imports cause version accumulation. If not feasible, use [Group Commit](../data-operate/import/group-commit-manual).
Expand All @@ -105,11 +116,65 @@ DISTRIBUTED BY HASH(site_id) BUCKETS 10;

See [Load Best Practices](../data-operate/import/load-best-practices).

### Query
## Query Tuning

### Bucketing

Bucket count directly affects query parallelism and scheduling overhead — strike a balance between the two:

- **Don't over-bucket.** Too many small tablets create scheduling overhead and can degrade query performance by up to 50%.
- **Don't under-bucket.** Too few tablets limit CPU parallelism.
- **Avoid data skew.** Check tablet sizes with `SHOW TABLETS`. Switch to Random bucketing or a higher-cardinality bucket column if sizes vary significantly.
- **Don't over-bucket.** Too many small tablets create scheduling overhead and can degrade query performance by up to 50%. See [Bucketing](#bucketing) for sizing guidelines.
- **Don't under-bucket.** Too few tablets limit CPU parallelism. See [Bucketing](#bucketing) for sizing guidelines.

See [Bucketing](#bucketing) for sizing guidelines.

### Indexes

- **Put the right columns in the sort key.** Unlike systems such as PostgreSQL, Doris only indexes the first 36 bytes of key columns and stops at the first VARCHAR. Columns beyond this prefix won't benefit from the sort key. Add [inverted indexes](../table-design/index/inverted-index/overview) for those columns. See [Sort Key](#sort-key).

### Diagnostic Tools

See [Query Profile](../query-acceleration/query-profile) to diagnose slow queries.

## Data Lake Queries

If your POC involves querying data in Hive, Iceberg, Paimon, or other data lakes through Doris (i.e., a Lakehouse scenario), the following points have the greatest impact on test results.

### Ensure Partition Pruning is Effective

Lake tables often hold massive amounts of data. Always include partition columns in your WHERE conditions so that Doris only scans the necessary partitions. Use `EXPLAIN <SQL>` to check the `partition` field and verify that pruning is working:

```
0:VPAIMON_SCAN_NODE(88)
partition=203/0 -- 203 partitions pruned, 0 actually scanned
```

If the partition count is much higher than expected, check whether your WHERE conditions correctly match the partition columns.

### Enable Data Cache

Remote storage (HDFS/object storage) has significantly higher IO latency than local disks. Data Cache caches recently accessed remote data on BE local disks, **delivering near-internal-table query performance for repeated queries on the same dataset**.

- Cache is disabled by default. See the [Data Cache](../lakehouse/data-cache) documentation to configure and enable it.
- Since version 4.0.2, **cache warmup** is supported, allowing you to proactively load hot data before POC testing.

:::tip
During POC, run a query once to populate the cache, then use the latency of the second query as the benchmark. This more accurately reflects steady-state production performance.
:::

### Address Small Files

Data lake storage often contains a large number of small files. Small files get split into many splits, increasing FE memory pressure (potentially causing OOM) and raising query planning overhead.

- **Fix at source (recommended):** Periodically compact small files on the Hive/Spark side, keeping each file above 128 MB.
- **Doris-side safeguard:** Use `SET max_file_split_num = 50000;` (supported since 4.0.4) to limit the maximum number of splits per scan and prevent OOM.

### Use Query Profile for Diagnosis

The bottleneck of data lake queries is typically IO rather than computation. [Query Profile](../query-acceleration/query-profile) can help locate the root cause of slow queries. Focus on:

- **Split count and data volume**: Determine if too much data is being scanned.
- **MergeIO metrics**: If `MergedBytes` is much larger than `RequestBytes`, read amplification is severe. Reduce `merge_io_read_slice_size_bytes` (default 8 MB) to mitigate.
- **Cache hit rate**: Confirm that Data Cache is working effectively.

For more optimization techniques, see [Data Lake Query Optimization](../lakehouse/best-practices/optimization).
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,41 @@
{
"title": "POC 前必读",
"language": "zh-CN",
"description": "新用户在 Apache Doris 建表设计、数据导入和查询调优中常见的问题。",
"description": "Apache Doris POC 前必读:涵盖建表设计(数据模型、排序键、分区、分桶)、数据导入最佳实践、查询调优以及数据湖(Hive、Iceberg、Paimon)查询优化,帮助新用户快速完成 POC 验证。",
"sidebar_label": "POC 前必读"
}
---

# POC 前必读
本文档汇总了新用户常见的问题,旨在加速 POC 进程。内容按照 POC 的典型流程组织:

本文档汇总了新用户常见的问题,旨在加速 POC 进程。
1. **建表设计** — 选择数据模型、排序键、分区与分桶策略。
2. **数据导入** — 选择合适的导入方式,避免常见陷阱。
3. **查询调优** — 排查慢查询,优化分桶与索引配置。
4. **数据湖查询** — 针对 Lakehouse 场景的额外优化要点。

## 建表设计

在 Doris 中建表涉及四个影响导入和查询性能的决策。
在 Doris 中建表涉及四个影响导入和查询性能的决策:数据模型、排序键、分区和分桶

### 数据模型

| 数据特征 | 使用 | 原因 |
根据数据写入方式选择合适的模型:

| 数据特征 | 推荐模型 | 原因 |
|---|---|---|
| 仅追加(日志、事件、事实表) | **Duplicate Key**(默认) | 保留所有行查询性能最好 |
| 按主键更新(CDC、Upsert) | **Unique Key** | 新行按相同 Key 替换旧行 |
| 预聚合指标(PV、UV、汇总) | **Aggregate Key** | 写入时按 SUM/MAX/MIN 合并行 |
| 仅追加(日志、事件、事实表) | **Duplicate Key**(默认) | 保留所有行查询性能最好 |
| 按主键更新(CDC、Upsert) | **Unique Key** | 新行按相同 Key 替换旧行 |
| 预聚合指标(PV、UV、汇总) | **Aggregate Key** | 写入时按 SUM/MAX/MIN 合并行 |

**Duplicate Key 适用于大多数场景。**详见[数据模型概述](../table-design/data-model/overview)。

### Sort Key(排序键)

将最常用于过滤的列放在最前面,定长类型(INT、BIGINT、DATE)放在 VARCHAR 之前。Doris 在排序键的前 36 字节上构建[前缀索引](../table-design/index/prefix-index),但遇到 VARCHAR 会立即截断。其他需要快速过滤的列可添加[倒排索引](../table-design/index/inverted-index/overview)。
Doris 在排序键的前 36 字节上构建[前缀索引](../table-design/index/prefix-index),设置排序键时注意以下原则:

- **高频过滤列优先**:将最常用于 WHERE 条件的列放在最前面。
- **定长类型优先**:INT、BIGINT、DATE 等定长类型放在 VARCHAR 之前,因为遇到 VARCHAR 时前缀索引会立即截断。
- **补充倒排索引**:前缀索引覆盖不到的列,可添加[倒排索引](../table-design/index/inverted-index/overview)加速过滤。

### 分区

Expand All @@ -39,10 +48,12 @@

**如何选择分桶数:**

1. **设为 BE 数量的整数倍**,确保数据均匀分布。后续扩容 BE 时,查询通常涉及多个分区,性能不会受影响。
2. **尽可能少**,避免小文件。
3. **每个分桶的压缩后数据 ≤ 20 GB**(Unique Key 表 ≤ 10 GB)。可通过 `SHOW TABLETS FROM your_table` 查看。
4. **每个分区不超过 128 个分桶。**需要更多时优先考虑分区。极端情况下上限为 1024,但生产环境中很少需要。
| 原则 | 说明 |
|---|---|
| 设为 BE 数量的整数倍 | 确保数据均匀分布。后续扩容 BE 时,查询通常涉及多个分区,性能不会受影响 |
| 尽可能少 | 避免产生小文件 |
| 每个分桶压缩后数据 ≤ 20 GB | Unique Key 表 ≤ 10 GB。可通过 `SHOW TABLETS FROM your_table` 查看 |
| 每个分区不超过 128 个分桶 | 需要更多时优先考虑增加分区。极端情况下上限为 1024,但生产环境中很少需要 |

### 建表模板

Expand Down Expand Up @@ -94,9 +105,9 @@ AUTO PARTITION BY RANGE(date_trunc(`dt`, 'day'))
DISTRIBUTED BY HASH(site_id) BUCKETS 10;
```

## 性能陷阱
## 数据导入

### 导入
选择合适的导入方式并遵循以下最佳实践,可以有效避免常见的性能问题:

- **批量数据不要用 `INSERT INTO VALUES`。**请使用 [Stream Load](../data-operate/import/import-way/stream-load-manual) 或 [Broker Load](../data-operate/import/import-way/broker-load-manual)。详见[导入概述](../data-operate/import/load-manual)。
- **优先在客户端合并写入。**高频小批次导入导致版本堆积。如不可行,使用 [Group Commit](../data-operate/import/group-commit-manual)。
Expand All @@ -105,11 +116,65 @@ DISTRIBUTED BY HASH(site_id) BUCKETS 10;

详见[导入最佳实践](../data-operate/import/load-best-practices)。

### 查询
## 查询调优

### 分桶相关

分桶数直接影响查询并行度和调度开销,需要在两者之间取得平衡:

- **不要分桶过多。**过多的小 tablet 会产生调度开销,查询性能最多可下降 50%。
- **不要分桶过少。**过少的 tablet 会限制 CPU 并行度。
- **避免数据倾斜。**通过 `SHOW TABLETS` 检查 tablet 大小。差异明显时切换为 Random 分桶或选择基数更高的分桶列。
- **不要分桶过多。**过多的小 tablet 会产生调度开销,查询性能最多可下降 50%。参见[分桶](#分桶)了解分桶数选择。
- **不要分桶过少。**过少的 tablet 会限制 CPU 并行度。参见[分桶](#分桶)了解分桶数选择。

参见[分桶](#分桶)了解分桶数选择。

### 索引相关

- **正确设置排序键。**与 PostgreSQL 等系统不同,Doris 仅对排序键的前 36 字节建立索引,且遇到 VARCHAR 会立即截断。超出前缀范围的列无法从排序键受益,需添加[倒排索引](../table-design/index/inverted-index/overview)。参见 [Sort Key(排序键)](#sort-key排序键)。

### 诊断工具

诊断慢查询请使用 [Query Profile](../query-acceleration/query-profile)。

## 数据湖查询

如果 POC 涉及通过 Doris 查询 Hive、Iceberg、Paimon 等湖上数据(即 Lakehouse 场景),以下几点对测试结果影响最大。

### 确保分区裁剪生效

湖上表往往有海量数据,查询时务必在 WHERE 条件中包含分区列,使 Doris 只扫描必要的分区。可通过 `EXPLAIN <SQL>` 查看 `partition` 字段确认裁剪是否生效:

```
0:VPAIMON_SCAN_NODE(88)
partition=203/0 -- 203 个分区被裁剪,实际扫描 0 个
```

如果分区数远大于预期,检查查询的 WHERE 条件是否正确匹配分区列。

### 开启 Data Cache

远端存储(HDFS/对象存储)的 IO 延迟比本地磁盘高出数倍。Data Cache 将最近访问的远端数据缓存到 BE 本地磁盘,**重复查询同一批数据时可获得接近内表的查询性能**。

- 缓存默认关闭,请参阅 [数据缓存](../lakehouse/data-cache) 文档进行配置。
- 自 4.0.2 版本起支持**缓存预热**,可在 POC 测试前主动加载热数据。

:::tip
POC 中建议先执行一次查询完成缓存加载,再以第二次查询的延迟作为基准。这样可以更准确地评估生产环境的常态性能。
:::

### 治理小文件

湖上数据常存在大量小文件。小文件会被拆分为大量 Split,导致 FE 内存压力增大甚至 OOM,查询规划开销上升。

- **从源头治理(推荐):**在 Hive/Spark 侧定期合并小文件,使每个文件保持在 128 MB 以上。
- **Doris 侧兜底:**通过 `SET max_file_split_num = 50000;`(4.0.4 起支持)限制每次扫描的最大 Split 数量,防止 OOM。

### 使用 Query Profile 诊断

湖上查询的瓶颈通常在 IO 而非计算。[Query Profile](../query-acceleration/query-profile) 可以定位慢查询根因,重点关注:

- **Split 数量和数据量**:判断是否扫描了过多数据。
- **MergeIO 指标**:若 `MergedBytes` 远大于 `RequestBytes`,说明读放大严重,可通过调小 `merge_io_read_slice_size_bytes`(默认 8 MB)来缓解。
- **Cache 命中率**:确认 Data Cache 是否在有效工作。

更多优化手段请参阅[数据湖查询调优](../lakehouse/best-practices/optimization)。
Loading
Loading