Loading a Pgsnapshot Schema With a Planet: Take 2

Awhile back I ran my own jxapi server, running with postgresql 8.4. I recently upgraded my server, and in the process I decided to re-import the database into postgresql 9.1.

There are two ways to load the data into the databases. The first of these is to use osmosis and --write-pgsql. The second is to use osmosis, --write-pgsql-dump and \copy statements to load the data into postgresql. This is reportedly faster and allows for more tuning of postgresql. Last time I used --write-pgsql but this time around I decided to use --write-pgsql-dump.

Getting the data

Even though you don’t need the data right away, you want to start downloading your planet.osm right away. You can do this with a program like curl, or you can try the planet torrents. The torrents are always at least as fast as downloading directly if you use a client that supports webseed, which most do. Most torrent clients also support throttling the bandwidth used which is useful if you want to use your internet connection while downloading.

Installing osmosis

You want osmosis-0.39 or later, which can be found on bretth’s site. When I tried I was unable to download the -latest version and had to download -0.39 specifically.

Reading the planet with osmosis

If you finish your planet download early or intend to use the planet file for other purposes, you should recompress it from bzip to gzip, which, although less space efficient, is much faster to decompress. Do this with bunzip2 -c planet-xxxxxx.osm.bz2 | gzip -3 > planet-xxxxxx.osm.gz. This is fairly quick on a dual-core CPU since one core will decompress while the other compresses.

Once you have osmosis and planet.osm, you want to start osmosis on processing the planet file. The osmosis documentation is extensive. Before you start, you will want to allocate more RAM to osmosis. You do this by creating a ~/.osmosis file and insert export JAVACMD_OPTIONS="-Xmx8G". You want to give osmosis as much RAM as you can spare, particularly if using InMemory node storage. I gave it 16 GB.

You may need to chmod +x bin/osmosis to continue.

Crafting osmosis command lines is somewhat of a confusing art. What is essential to remember is order matters. You will end up with something like zcat osm/planet/planet-111109.osm.gz | ./osmosis-0.39/bin/osmosis --fast-read-xml file=- --log-progress interval=15 --write-pgsql-dump directory=pgsqldump enableBboxBuilder=yes enableLineStringBuilder=yes nodeLocationStoreType=InMemory.

Breaking this down into parts:

zcat osm/planet/planet-111109.osm.gz | ./osmosis-0.39/bin/osmosis: This reads the gzipped planet file and feeds it to osmosis. If you didn’t recompress to gzip, then use bzcat on the bzip.
--fast-read-xml file=-: This reads from the standard input (“-”) and produces an output stream which is then used by…
--log-progress interval=60: This takes the output stream from the –fast-read-xml task and prints progress reports every 60 seconds. It sends the output stream to the next task unchanged.
--write-pgsql-dump directory=pgsqldump enableBboxBuilder=no enableLinestringBuilder=yes nodeLocationStoreType=InMemory: This takes the output stream and converts it to pgsql dump files in the directory pgsqldump. Let’s take a loot at the options
directory=pgsqldump: The directory where the output will be saved. This needs to be local to the machine, network latency will kill performance. AS of late 2011 you need approximately 240 GB for these files.
enableBboxBuilder=no: The information generated by this setting is reportedly not used by jxapi, so we don’t need to generate it.
enableLinestringBuilder=yes: You can generate this on import to the database, but it makes the import more complicated. You have to do the appropriate analyze commands mid-import or it will take years to finish. Just set it to yes and have osmosis do the work at this stage.
nodeLocationStoreType=InMemory: how osmosis stores nodes. If you have enough memory, you want InMemory. If not you want either TempFile or CompactTempFile. See the osmosis documentation for more information. When running with -Xmx12G java threw a garbage collection error towards the end of the nodes. I fixed it with export JAVACMD_OPTIONS="-Xmx16G -XX:-UseGCOverheadLimit" in my .osmosis file.

When you finally run the command, you want to use screen or another program to keep it running in the background. This took about 5 hours for me.

Installing postgresql

You need to install postgresql. This may vary between distributions, but on ubuntu sudo apt-get install postgresql-9.1 postgresql-contrib-9.1 postgresql-9.1-postgis should get you everything you need. If you don’t have them, you’ll want also do sudo apt-get install screen openjdk-7-jre-headless to get other programs you’ll need.

Optimizing the database

A pgsnapshot database differs from other databases in a few ways. Data loss on power outages can be tolerated since osmosis will just reprocess the diff file anyways, and it’s a big database. The postgresql wiki has a page on database tuning.

The important settings are shared_buffers, checkpoint_segments, and the write-ahead-log settings. There are also some settings you want to change specifically for the import.

I found the following settings worked for me with 16GB of ram:

shared_buffers = 2GB
work_mem = 128MB
maintenance_work_mem = 1GB
wal_level = minimal
synchronous_commit = off
checkpoint_segments = 64
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
default_statistics_target = 1000

The wal_level and synchronous_commit settings give the potential for data loss if a power outage occurs mid-update, but osmosis will re-run any diffs that were interrupted so this does not matter.

There are additional settings you can change for a faster import.

autovacuum = off
fsync = off

The first turns off auto-vacuum during the import and allows you to run a vacuum at the end. The second will introduce data corruption in case of a power outage and is dangerous. If you have a power outage while importing the data you will have to drop the data from the database and re-import, but it’s faster. Just remember to change these settings back after importing. fsync has no effect on query times once the data is loaded.

Also remember to restart postgresql. /etc/init.d/postgresql restart or equivalent.

You may get an error and need to increase the shared memory size. Edit /etc/sysctl.d/30-postgresql-shm.conf and run sudo sysctl -p /etc/sysctl.d/30-postgresql-shm.conf. I use kernel.shmmax=17179869184 and kernel.shmall=4194304 for a 16GB segment size.

Setting up the database

The jxapi documentation contains some commands to run to set up the database. These are only valid on 8.4 and can screw up the database on 9.1. You want to run these commands instead

sudo su - postgres

createdb xapi

createlang plpgsql xapi may already be done by default

createuser xapi You do want this to be a superuser

psql -d xapi -c "ALTER ROLE xapi PASSWORD 'xapi';"

psql -d xapi -f /usr/share/postgresql/9.1/contrib/postgis-1.5/postgis.sql

psql -d xapi -f /usr/share/postgresql/9.1/contrib/postgis-1.5/spatial_ref_sys.sql These two paths might change from version to version.

psql -d xapi -c "CREATE EXTENSION hstore" This is the big change from 8.4. What was hstore-new is now hstore, and extensions are enabled differently.

psql -d xapi -f ~/osmosis-0.39/script/pgsnapshot_schema_0.6.sql

psql -d xapi -f ~/osmosis-0.39/script/pgsnapshot_schema_0.6_linestring Vary with wherever you installed osmosis.

psql -d xapi -c "CREATE INDEX idx_nodes_tags ON nodes USING GIN(tags);"

psql -d xapi -c "CREATE INDEX idx_ways_tags ON ways USING GIN(tags);"

psql -d xapi -c "CREATE INDEX idx_relations_tags ON relations USING GIN(tags);"

exit will take you back to the original user.

Setting up the import

The pgsnapshot_load_0.6.sql file requires some changes before use, so place a copy of it in the pgsqldump directory. Open it up and comment out the `SELECT DropGeometryColumn, SELECT AddGeometryColumn, UPDATE ways SET bbox and SELECT MakeLine(c.geom) AS way_line statements. Also add \timing before the \copy statements to track how long it takes.

It is also possible to comment out the CLUSTER statements. CLUSTER places nearby node near each other on disk. This is useful if you are running bbox queries against your jxapi, but not very useful if you are filtering by tags.

The CREATE INDEX idx_ways_bbox ON ways USING gist (bbox); statement can be commented out as jxapi does not use this information.

Running the import

In the pgsqldump directory as the postgresql user, run psql -d xapi -f pgsnapshot_load_0.6.sql

If you placed pgsnapshot_load_0.6.sql in a different directory you need to be in the pgsqldump directory, not the directory where the .sql is. It will then run. Go and find something else to do, this will take some time.

users.txt:            0h  0m 0.67 s
nodes.txt:            4h 54m
ways.txt:             3h  3m
way_nodes.txt:           50m
relations.txt:            1m 14   s
relation_members.txt:        29.1 s

Creating indexes:

node primary key:                48m
way primary key:                 12m
way_nodes primary key:           55m
relations primary key:           14s
relation_members primary key:     9s
nodes geometry index:          14h8m
way_nodes node ID index:       1h29m
realtion_members member index:   15s
ways bounding box index:         n/a
way linestring index:            52m

Next are the two CLUSTER statements. I aborted these after half a day. CLUSTER requires 1-2x table size in free disk space which I do not have. Based on disk usage I estimated it would take 2-3 days to CLUSTER. Reportedly the correct way to do this is to pre-sort the files before loading.

Do not forget the ANALYZE. Without it, queries will not finish in your lifetime. ANALYZE took me 20 minutes.

Setting up jxapi

Next you want to set up jxapi. This can be done concurrently with the other steps but you will be unable to test until the ANALYZE command is finished.

First of all, install tomcat7 with sudo apt-get install tomcat7. Then download the latest version of jxapi from github. Copy it as xapi.war to /var/lib/tomcat7/webapps

You’re done! Enjoy your data, and don’t forget to turn fsync and autovacuum back on.

Paul’s Blog

A blog without a good name