Sampling Slow Postgres Queries

Part of optimizing a style or application using PostgreSQL is finding what queries are taking up the most time, and improving them. The first part of this is finding the slow queries.

A popular way to do this is using log_min_duration_statement to log all queries over a certain duration or to use log_statement to log all queries. These both have downsides.

Logging slow queries gives a misleading view on what is using the most resources. A query that takes 100 seconds to run but only runs once a day will use less resource than a 100ms query running every second, but the fast often repeated query might not appear in the logs. This is common with map rendering where low-zoom queries are slow but infrequently run.

Logging all queries is comprehensive but can have a noticeable performance impact. Every query causes a write to the server logs. This can be an unacceptable performance drag on a production server.

Both methods suffer from adjusting log file settings. This means adjusting the server configuration, having the setting apply to all databases, having to remove extraneous information from the logs, and generally being hard to parse. In a shared environment this can be impossible to get done. If you have statements with newlines the newlines make it into the logfile unescaped, making it hard to parse them with normal tools.

Fortunately, there is another method: sampling running queries.

Sampling

pg_stat_activity is a system view contains information about active server processes, including the queries they are running. By periodically quering this we can gather a sample of what queries are running and how often. This method doesn’t come with a downside. pg_stat_activity will cut off very long queries, something which is common with stylesheets where a complex layer can involve several KB of SQL. Fortunately with rendering, we can identify the queries by looking at the style definitions.

Not all the columns of pg_stat_activity are useful and we want to only get queries on the rendering database, and we can do this with some SQL:

SELECT
    pid,
    NOW(),
    NOW()-query_start AS query_duration,
    query
  FROM pg_stat_activity
  WHERE datname = 'gis'
    AND pid != pg_backend_pid()
    AND state != 'idle';

This restricts what we are interested in, and instead of when the query was started we have how long it’s been running. Because the query could contain newlines we wrap the entire statement in a COPY statement to take care of escaping and call it with a loop from bash, logging to a file.

for i in `seq 1 163`; do 
psql -AtqX -d postgres -c "COPY (SELECT pid, NOW(), NOW()-query_start AS query_duration, query FROM pg_stat_activity WHERE datname = 'gis' AND pid != pg_backend_pid() AND state != 'idle') TO STDOUT;" >> query_log.txt;
sleep 22;
done

22 seconds is chosen to avoid dividing into 1 minutes or 5 minutes. Monitoring and osm2pgsql updates often happen at those intervals and this avoids sampling during every single update or during none of them.

Loading

Lots of tools are used for log analysis, but because understanding these logs requires knowledge of PostgreSQL, we’ll use it for the analysis too. A suitable table for the data can be created and loaded with

CREATE TABLE query_log (
  pid int,
  sample_time timestamptz,
  run_length interval,
  query text,
  PRIMARY KEY (sample_time, pid)
);
\copy query_log FROM query_log.txt
CLUSTER query_log USING query_log_pkey;

Need help optimizing your PostGIS queries? I have experience finding slow parts of queries and improving them. If you want to hire me for some work, send me an email.

Paul’s Blog

A blog without a good name

Sampling Slow Postgres Queries

Sampling

Loading