The tile list used for performance testing a stylesheet is critical. It needs to represent a realistic mix of zooms and tile complexity, and be large enough to have reasonable caching behavior. The best source for a tile list is logs from a real rendering server.
Logs for the tile.openstreetmap.org rendering servers are available, both tile accesses and one day’s worth of rendering. The former contains a lot of cache hits, but the latter is the actual workload for the rendering server.
There’s lots of interesting information in the file, but all that’s needed here is the log lines indicating the start of rendering a tile. In the file, this corresponds to lines like
1
|
|
After START TILE are the name of the style, zoom level, x range, y range, and age of the old rendered tile.
Making a list
A bit of magic with sed can turn the log file into a list of tiles in a standard z x y form
1 2 3 4 5 |
|
Looking at the file, we can see how many tiles were rendered at each zoom
1 2 3 4 |
|
This shows that of the 354 659 requests, only 12 were from zoom levels below 13. Low-zoom tiles have a different caching logic, and instead of being frequently re-rendered, are re-rendered in bulk every month. For a stable benchmark, these low-zoom tiles can be discarded. All the tiles need to be shifted up 3 zoom levels as well, and can be put into the same z/x/y format as used before
1
|
|
The full list of about 354k requests is too long for a reasonable benchmark. Instead, a list of about 20k should represent about 90 minutes of load on the rendering server. Generating this list is easy with head -n20000 all_tiles.txt > tiles.txt
.
It’s important that this list is big enough. This can be checked by clearing the memory cache and generating the tiles. It takes about a quarter of the list to use all the RAM.
Running the benchmark
Running the benchmark as before with time parallel -a tiles.txt -j8 --progress curl -s -o /dev/null http://localhost:8080/{}.pbf
and discarding the first run results in an average time of 1216.5 seconds and a standard deviation of 2.7 seconds.
Experience tells me that this is reasonable. Too long of a tile list will improve the standard deviation, but take too long to run, while too short of a list has too much of an error.
Indexing problems
When you create a GiST index, the result is non-deterministic, so the rendering performance changes with a REINDEX DATABASE command. This used to be particularly bad when clustering on a GiST index because both the table order and indexes then were non-deterministic. If purely testing a stylesheet change, this doesn’t matter because the same indexes can be used with multiple versions of the style, but if the testing involves a reimport with osm2pgsql, it throws a problem into the mix.
The only fix is to reindex multiple times and run the benchmark on each index result. Doing this five times for a total of 25 results gives an average time of 1203 seconds, and a standard deviation of 11 seconds, much higher than before.