Continuing with the setup of my new server, I moved on to filesystem settings. Unlike previous servers, this server runs FreeBSD with a ZFS file system. ZFS has many features but importantly for osm2pgsql, it allows transparent on-disk compression and adjusting the disk record size.
There is contradictory information available about both these tunables. Gregory Smith’s PostgreSQL 9.0 High Performance suggests adjusting the recordsize to match the PostgreSQL block size of 8k if doing scattered random IO, and using transparent compression for scans of data too small to be compressed by TOAST. On the other hand, some mailing list posts have suggested 128K for ETL workloads like osm2pgsql.
I was unable to find any ZFS tuning suggestions specific to spatial databases, so I was operating completely in the dark, which called for benchmarking.
Using the previous postgresql tuning, I ran a series of imports, adjusting the ZFS recordsize and compression.
Results
Compression | 8K recordsize | 128K recordsize |
---|---|---|
None | 6.3h | 8.4h |
lz4 | 6.1h | 10.8h |
Processing time includes processing nodes, ways and relations, and is single-threaded. A faster SSD-only server with a faster CPU can complete this stage in 4 hours.
Compression | 8K recordsize | 128K recordsize |
---|---|---|
None | 18.4h | 21.8h |
lz4 | 12.8h | 30.4h |
This time is dominated by the creation of an 100GB GIN index on a bigint[] column.
Compression | 8K recordsize | 128K recordsize |
---|---|---|
None | 27.2h | 32.7h |
lz4 | 21.5h | 43.7h |
Not broken out seperately is pending ways, as this is largely uneffected by ZFS settings.
Record size
The one clear conclusion is that 8K recordsizes are faster than 128K recordsizes. This holds true if you further subdivide the import into individual tasks like processing ways or relations. I found no cases where 128K was faster.
Compression and 128K recordsize is a particularly bad combination. I’m not familiar enough with the internals of ZFS, but it’s possible that every time PostgreSQL does a write of 8K ZFS is forced to fetch the 128K record, decompress it, add the new 8K, and recompress it.
An examination of the CPU usage supports this, with the CPU difference between 128K compressed and uncompressed being bigger than the difference between 8K compressed and uncompressed. This is true for both CPU utilization, and total CPU time used.
Compression
Compression is more interesting. It’s a performance gain on 8K recordsizes, while a loss on 128K recordsizes. 8K is more important, as that’s all-around faster, and it’s worth looking at it in more detail.
Compression setting | off | lz4 | |
---|---|---|---|
Processing | node | 1516 | 1514 |
way | 5589 | 5938 | |
relation | 15853 | 14716 | |
Total | 22958 | 22168 | |
Pending ways | 8746 | 8889 | |
PostgreSQL index and cluster | 66330 | 46365 | |
Total | 98058 | 77441 |
Pending ways is unique in that it is the only purely write part of the import that is IO bound, with the load consisting of many COPY statements. It is also the only part that is slower on a compressed ZFS volume.
Total CPU usage
With compression, CPU utilization can be expected to be higher since the import completes faster. On a sufficiently fast IO system, CPU could again be the limiting factor. The 8K lz4 import used 33.9 cpu-hours while the 8K uncompressed import used 34.2 CPU-hours. On a modern system with multiple cores, CPU usage is unlikely to be a reason to reject lz4 compression. It may be an issue with slower methods like lzjb.
Conclusions
A recordsize of 8K is preferrable to 128K for ZFS volumes with a tablespace under all conditions found. For anything but the purely write portions of the import, lz4 compression was found to offer speed advantages, proving to be very significant speed advantages for large GIN index creation.
Further work
ZFS settings were kept constant with 128K recordsize and uncompressed for the xlog. This may not be optimal, given the gains shown with 8K recordsize on the tablespace volume.
Large PostGIS geometries are stored compressed with TOAST and MAIN
storage. This compression is on top of the lz4 compression done by ZFS, so results in duplicated work. Better performance might be obtained by switching all MAIN
storage to PLAIN
or EXTERNAL
.