Postgres存储引擎性能简测：Hydra vs Heap

Hydra是为Postgres设计的OLAP引擎。和原生的Heap引擎相比，它的优势究竟如何呢？

最近在Aliyun上做了简单的测试对比了一下，感觉名副其实。

测试环境

Server端操作系统使用了Fedora官方镜像Fedora-Cloud-Base-38-1.6.x86_64.qcow2，自行编译安装了Postgres 14.9。

Client端操作系统选择了Arch官方镜像Arch-Linux-x86_64-cloudimg-20230915.178838.qcow2，自行编译安装了Postgres 14.9。

服务器选择了ecs.r6.xlarge，为4vCore+32GB。服务器CPU为Intel Xeon Platinum 8269CY（Cascade Lake），2.5GHz主频，睿频3.2GHz。

数据挂载使用了ESSD AutoPL型块存储，大小为100GB，开起了性能突发模式。（血泪的教训，千万不要开启这个模式，哪怕用大一点的硬盘也比这个强）。

测试工具

这里直接选用了Postgres官方提供了pgbench工具。pgbench主要针对OLTP场景，因此在OLAP场景下，自己另外选择了几个分析性SQL语句，简单的对比了一下。

pgbench的数据集模拟了银行储户场景。其中，pgbench_branches表存储银行机构，pgbench_accounts表存储储蓄账户，pgbench_history表存储储蓄账户的交易记录。每家银行机构对应10万储蓄账户。如果需要增加数据集大小，可以在初始化的时候提供更大的scale factor因子，工具可以生成更多的银行机构数量。

数据集初始化

这次scale factor选择了4000。也就是，生成4000个银行机构，和4亿储蓄账户。

/pg/14/bin/pgbench -i -s 4000 -h ${PG_HOST} -U postgres postgres

在Heap引擎中，生成的数据大小为61GB。

在Hydra引擎中，生成的数据大小仅为12GB。五分之一！OMG，硬盘空间还是很节省的。

用时对比

Engine	Total Time	Insert Time	Vacuum Time	Primary Key Index Creation Time
Heap	1534.9	515.27	519.48	500.15
Hydra	478.46	273.79	25.36	179.30

可以看出，无论是批量插入数据，还是Vacuum、索引构建这种维护性操作，Hydra在用时上都大为节省。

单点查询测试

单点查询对于OLTP场景更重要。Hydra因为列式存储的原因，并不擅长此场景。

/pg/14/bin/pgbench -c 100 -j 150 -t 1000 -h ${PG_HOST} --select-only -U postgres postgres

对Heap引擎，测试结果还不错

pgbench (14.9)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 4000
query mode: simple
number of clients: 100
number of threads: 1
number of transactions per client: 100
number of transactions actually processed: 10000/10000
latency average = 21.048 ms
latency stddev = 14.351 ms
initial connection time = 229.074 ms
tps = 4468.516841 (without initial connection time)

但对于Hydra引擎，上述参数测试极慢。这里命令会每10s报告一次阶段性的结果，做个参考。

$> /pg/14/bin/pgbench -c 100 -j 150 -t 1000 -h ${PG_HOST} --select-only -U postgres -P 10 postgres
pgbench (14.9)
starting vacuum...end.
progress: 10.0 s, 288.5 tps, lat 337.378 ms stddev 34.884
progress: 20.0 s, 302.1 tps, lat 331.063 ms stddev 25.183
progress: 30.0 s, 301.8 tps, lat 331.303 ms stddev 22.239
progress: 40.0 s, 303.1 tps, lat 331.062 ms stddev 23.108
progress: 50.0 s, 299.2 tps, lat 334.344 ms stddev 23.207
progress: 60.0 s, 300.4 tps, lat 332.583 ms stddev 23.476
progress: 70.0 s, 300.3 tps, lat 332.280 ms stddev 24.434
progress: 80.0 s, 301.6 tps, lat 332.331 ms stddev 27.919
^C

分析性查询

在初始化后的数据集上，这里跑了一下pgbench的默认脚本。默认脚本能够，修改储蓄账户的当前账额，增加交易记录。当储蓄账户有了变更，就可以做一些分析了。

/pg/14/bin/pgbench -c 10 -j 15 -t 100 -h ${PG_HOST} -U postgres postgres

分析场景一

这个场景较为简单。找出变更后的pgbench_accounts表中最大的100个账户。

select * from pgbench_accounts order by abalance desc limit 100;

在Heap引擎中，explain analyze结果是这样的

QUERY PLAN                                                                            
----------
 Limit  (cost=14595447.87..14595459.54 rows=100 width=97) (actual time=256246.481..256252.433 rows=100 loops=1)
   ->  Gather Merge  (cost=14595447.87..53488445.92 rows=333345280 width=97) (actual time=256246.480..256252.424 rows=100 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=14594447.85..15011129.45 rows=166672640 width=97) (actual time=256243.351..256243.356 rows=67 loops=3)
               Sort Key: abalance DESC
               Sort Method: top-N heapsort  Memory: 39kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 39kB
               ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..8224339.40 rows=166672640 width=97) (actual time=0.341..246211.910 rows=133333333 loops=3)
 Planning Time: 0.073 ms
 Execution Time: 256252.458 ms
(12 rows)

PG启动了2个worker进程，并行地扫描了全表。

而在Hydra引擎中，explain analyze结果是这样的

QUERY PLAN                                                                                   
----------
 Limit  (cost=2206605.35..2206617.64 rows=100 width=97) (actual time=40191.632..40198.202 rows=100 loops=1)
   ->  Gather Merge  (cost=2206605.35..51349585.35 rows=400001000 width=97) (actual time=40191.630..40198.193 rows=100 loops=1)
         Workers Planned: 7
         Workers Launched: 7
         ->  Sort  (cost=2205605.23..2348462.73 rows=57143000 width=97) (actual time=40150.003..40150.010 rows=88 loops=8)
               Sort Key: abalance DESC
               Sort Method: top-N heapsort  Memory: 39kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 2:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 3:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 4:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 5:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 6:  Sort Method: top-N heapsort  Memory: 39kB
               ->  Parallel Custom Scan (ColumnarScan) on pgbench_accounts  (cost=0.00..21640.86 rows=57143000 width=97) (actual time=11.977..18766.711 rows=50000000 loops=8)
                     Columnar Projected Columns: aid, bid, abalance, filler
 Planning Time: 4.511 ms
 Execution Time: 40198.252 ms
(18 rows)

PG无视了默认参数启动了7个worker进程，并行地扫描了全表。

当修改了默认参数后，set max_parallel_workers_per_gather = 8，Heap引擎的表现稍有加强，但仍然劣于Hydra的表现

postgres=# explain analyze select * from pgbench_accounts order by abalance desc limit 100;
QUERY PLAN                                                                           
----------
 Limit  (cost=8969663.60..8969675.96 rows=100 width=97) (actual time=127956.821..127957.997 rows=100 loops=1)
   ->  Gather Merge  (cost=8969663.60..58436286.42 rows=400014336 width=97) (actual time=127956.819..127957.986 rows=100 loops=1)
         Workers Planned: 8
         Workers Launched: 7
         ->  Sort  (cost=8968663.46..9093667.94 rows=50001792 width=97) (actual time=127946.697..127946.707 rows=90 loops=8)
               Sort Key: abalance DESC
               Sort Method: top-N heapsort  Memory: 39kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 2:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 3:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 4:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 5:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 6:  Sort Method: top-N heapsort  Memory: 39kB
               ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..7057630.92 rows=50001792 width=97) (actual time=0.118..120697.600 rows=50000000 loops=8)
 Planning Time: 0.066 ms
 Execution Time: 127958.024 ms
(17 rows)

只有给abalance列添加索引后，Heap引擎的分析时间才减下来。但这个属于空间换时间了。

postgres=# CREATE INDEX acc_abalance_idx ON pgbench_accounts (abalance);
CREATE INDEX
Time: 425301.879 ms (07:05.302)

postgres=# explain analyze select * from pgbench_accounts order by abalance desc limit 100;
QUERY PLAN                                                                             
----------
 Limit  (cost=0.57..4.05 rows=100 width=97) (actual time=0.052..3.333 rows=100 loops=1)
   ->  Index Scan Backward using acc_abalance_idx on pgbench_accounts  (cost=0.57..13911160.57 rows=400000000 width=97) (actual time=0.050..3.324 rows=100 loops=1)
 Planning Time: 0.388 ms
 Execution Time: 3.347 ms
(4 rows)

Time: 4.220 ms

分析场景二

这次是，找出3个帐额不为0的储蓄账户。

select * from pgbench_accounts where abalance != 0 limit 3;

在Heap引擎中，explain analyze结果是这样的

QUERY PLAN                                                                   
----------
 Limit  (cost=1000.00..7183635.50 rows=1 width=97) (actual time=186459.205..271126.284 rows=3 loops=1)
   ->  Gather  (cost=1000.00..7183635.50 rows=1 width=97) (actual time=186459.203..271126.281 rows=3 loops=1)
         Workers Planned: 8
         Workers Launched: 7
         ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..7182635.40 rows=1 width=97) (actual time=133853.479..254216.098 rows=2 loops=8)
               Filter: (abalance <> 0)
               Rows Removed by Filter: 49971072
 Planning Time: 0.282 ms
 Execution Time: 271126.301 ms
(9 rows)

在Hydra引擎中，explain analyze结果是这样的

QUERY PLAN                                                                        
----------
 Limit  (cost=0.00..0.30 rows=3 width=97) (actual time=24996.779..24996.782 rows=3 loops=1)
   ->  Custom Scan (ColumnarScan) on pgbench_accounts  (cost=0.00..40151586.02 rows=400001000 width=97) (actual time=24996.777..24996.779 rows=3 loops=1)
         Columnar Projected Columns: aid, bid, abalance, filler
         Columnar Vectorized Filter: (abalance <> 0)
 Planning Time: 7.015 ms
 Execution Time: 25065.955 ms
(6 rows)

不等谓词真是万恶之源，PG选择了全表扫描。

让人疑惑的是，Heap引擎中，哪怕有索引也不用。也许是个bug。

而Hydra有自己的定制的扫描和过滤方法。尽管规划的时间较长，但总体执行时间只有Heap引擎的十分之一！

分析场景三

这次是一个比较复杂的查询。找出非零帐额账户数量最高的10家银行机构。

select bid, count(0) as cnt 
from pgbench_accounts
where abalance != 0 
group by bid 
order by cnt desc 
limit 10;

在Heap引擎中，explain analyze结果是这样的

QUERY PLAN                                                                             
----------
 Limit  (cost=7183635.54..7183635.54 rows=1 width=12) (actual time=256178.105..256178.221 rows=10 loops=1)
   ->  Sort  (cost=7183635.54..7183635.54 rows=1 width=12) (actual time=256178.104..256178.219 rows=10 loops=1)
         Sort Key: (count(0)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=7183635.51..7183635.53 rows=1 width=12) (actual time=256175.444..256177.863 rows=3681 loops=1)
               Group Key: bid
               ->  Sort  (cost=7183635.51..7183635.51 rows=1 width=4) (actual time=256175.434..256176.439 rows=9997 loops=1)
                     Sort Key: bid
                     Sort Method: quicksort  Memory: 853kB
                     ->  Gather  (cost=1000.00..7183635.50 rows=1 width=4) (actual time=0.344..256171.733 rows=9997 loops=1)
                           Workers Planned: 8
                           Workers Launched: 7
                           ->  Parallel Seq Scan on pgbench_accounts  (cost=0.00..7182635.40 rows=1 width=4) (actual time=132371.762..256165.095 rows=1250 loops=8)
                                 Filter: (abalance <> 0)
                                 Rows Removed by Filter: 49998750
 Planning Time: 0.993 ms
 Execution Time: 256178.292 ms
(17 rows)

在Hydra引擎中，explain analyze结果是这样的

QUERY PLAN                                                                                      
----------
 Limit  (cost=11820.57..11820.57 rows=1 width=12) (actual time=4849.581..4853.673 rows=10 loops=1)
   ->  Sort  (cost=11820.57..11820.57 rows=1 width=12) (actual time=4849.579..4853.671 rows=10 loops=1)
         Sort Key: (count(0)) DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=11820.54..11820.56 rows=1 width=12) (actual time=4849.074..4853.536 rows=877 loops=1)
               Group Key: bid
               ->  Sort  (cost=11820.54..11820.54 rows=1 width=4) (actual time=4849.062..4853.249 rows=999 loops=1)
                     Sort Key: bid
                     Sort Method: quicksort  Memory: 71kB
                     ->  Gather  (cost=1000.00..11820.53 rows=1 width=4) (actual time=4806.238..4852.964 rows=999 loops=1)
                           Workers Planned: 7
                           Workers Launched: 7
                           ->  Parallel Custom Scan (ColumnarScan) on pgbench_accounts  (cost=0.00..10820.43 rows=57143000 width=4) (actual time=4792.031..4794.666 rows=125 loops=8)
                                 Columnar Projected Columns: bid, abalance
                                 Columnar Vectorized Filter: (abalance <> 0)
 Planning Time: 4.714 ms
 Execution Time: 4862.387 ms
(16 rows) 

Hydra只在两个必要的列上使用7个worker执行了并行全列扫描，并使用了向量化过滤器完成过滤，仅仅使用了Heap引擎1.9%的时间，堪称碾压。

总结

在OLAP场景中，动辄需要全列分析时：Heap因为行式元组的原因，被迫全表扫描；而Hydra只读取必要的列，而且还是压缩过的，所以在硬盘IO方面有很大优势。

要想在OLAP场景中使用Heap引擎，需要设计良好的索引结构，用空间换时间。

对于存储类似历史记录这样的分析性数据，通常没有OLTP的需求。假如对分析延迟没有强烈的需求，不想增加很多索引，那么可以考虑Hydra引擎，既节省空间，分析性能还还不错。