Update: I have updated the post with information about the Sqoop problem with reserved keywords. See below or this issue.
Update 2: The problem with Hue seems to be HUE-54. Hue just doesn’t seem to handle anything beyond ASCII at the moment (Python and UTF-8 always were a frickle beast)
Update 3: Last problem solved. Direct import with PostgreSQL seems to be working. See SQOOP-38 and below.
Update 4: There’s yet another problem with the direct import option for PostgreSQL. Don’t use it for any tables that contain boolean columns:
SQOOP-43

As you might or might not know (depending on how you found your way to this blog post) I’m a heavy user of OpenStreetMap (OSM) and I try to promote it whenever I can. I run the smallish website OSMdoc which analyzes a bit of the OSM data. As I also work with the HStack (Hadoop, HBase, etc.) I always wanted to combine those two. So this article shows how to install Hadoop and Hive on a fresh installation of Ubuntu 10.4 and load OSM data into it to run queries against it.

I am using Cloudera’s CDH3 (version Beta 2) distribution for this. I could have also used their pre-built Virtual Machine but I wanted to learn more and install everything myself. If you follow this post you should hopefully end up with a working way to use Hive and OSM yourself. While the way I use may not be perfect it is one that doesn’t require us to write any code for now.

Here is the outline of what I’m doing:

  • Starting with a fresh updated installation of Ubuntu 10.4 in a virtual machine (I’m using Virtualbox)
  • Install PostgreSQL and Osmosis
  • Import OSM data into PostgreSQL
  • Install Hadoop, Hue, Sqoop and Hive
  • Load data from PostgreSQL to Hive using Sqoop
  • Run Hive queries

Install and set up PostgreSQL

We’ll need PostgreSQL to store the OSM data as is usual in the OSM world. Later on we’ll also use it to host the metastore for Hive.

This one is easy thanks to the package system of Ubuntu:

sudo apt-get -y install postgresql
sudo -u postgres createuser -S -D -R -P osm
sudo -u postgres createdb -O osm osm
sudo sed -i "/^# \"local\"/i\local all all md5" /etc/postgresql/8.4/main/pg_hba.conf
sudo service postgresql-8.4 restart

This install PostgreSQL 8.4, creates a osm user which owns a database called osm and allows access for this user from the local machine with the password. Chose any password, just remember it.

Install Osmosis

For those who don’t know: Osmosis is the tool of choice for any tasks related to OSM data. It is used by almost everyone to keep databases updated and for various other tasks. It also runs on the OSM servers to provide diff files for OSM data. We use it here to import OSM data in a simple database schema.

Again this is relatively easy. Unfortunately there is no .deb package to install for Osmosis (yet) so we’ve got to do it manually. The first step is to install Java. It is recommended to use the Sun version of Java for Hadoop but as of Ubuntu 10.04 OpenJDK is the default. We’ve got to add another repository so we can install the Sun version first. If that is not possible for you or you want to use OpenJDK it should still work but there were some bugs in there that made the Hadoop/HBase guys recommend the Sun JDK.

As we’ll need Osmosis only occasionally or for a one-off job I won’t bother setting it up and installing it in the system directories. You’ll also have to get some OSM data. For testing I just use a small extract provided by the Geofabrik but you might want to parse the whole planet or other extracts.

sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get -y install sun-java6-jdk
cd ~
wget http://dev.openstreetmap.org/~bretth/osmosis-build/osmosis-bin-latest.tgz
tar xvfz osmosis-bin-latest.tgz
psql -U osm -f osmosis-0.35.1/script/contrib/apidb_0.6.sql -d osm
osmosis-0.35.1/bin/osmosis --read-xml file="<OSM XML file here>" --write-apidb host="localhost" database="osm" user="osm" password="<your password>"

Depending on which OSM data set you chose to import this can take a while. But the beautiful thing is that we can continue with installing Hadoop while this import is running. So open a new terminal and continue, Osmosis will eventually finish.

Installing Hadoop, Hue, Hive and Sqoop

Thanks to Cloudera this is pretty straightforward:

# Add Cloudera repositories
sudo sh -c 'echo "deb http://archive.cloudera.com/debian `lsb_release -c -s`-cdh3 contrib" > /etc/apt/sources.list.d/cloudera.list'
sudo sh -c 'echo "deb-src http://archive.cloudera.com/debian `lsb_release -c -s`-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list'
wget -q -O - http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install hadoop

# Install Hue
sudo apt-get -y install hadoop-0.20-conf-pseudo-hue

# Install Sqoop
sudo apt-get -y install sqoop

# Install Hive
sudo apt-get -y install hadoop-hive
sudo -u postgres createuser -S -D -R -P hive
sudo -u postgres createdb -O hive hive

After this is finished all those tools should be installed and PostgreSQL is prepared for Hive. What is left to do is to set up Hive and Hue to use PostgreSQL instead of SQLlite/Derby so they can share the metastore.

You need to edit the file /etc/hive/conf/hive-site.xml and in particular the following properties:

  • javax.jdo.option.ConnectionURL: In our current setup this would be jdbc:postgresql://localhost/osm
  • javax.jdo.option.ConnectionUserName: hive
  • javax.jdo.option.ConnectionPassword: The password you chose earlier for the database user
  • javax.jdo.option.ConnectionDriverName: org.postgresql.Driver

Additionally you need to set the property hive_conf_dir in the file /etc/hue/hue-beeswax.ini to /etc/hive/conf.

If you followed my post so far and used the password hive for the PostgreSQL user you can just run the following commands:

wget http://gist.github.com/raw/485836/512357fef1be0ac9cf8596770939355fc61a4d1c/hive-site.xml
sudo mv hive-site.xml /etc/hive/conf
sudo chown root:root /etc/hive/conf/hive-site.xml
wget http://gist.github.com/raw/485836/e32b00bc69744509123ef584be226328c37ccf77/hue-beeswax.ini
sudo mv hue-beeswax.ini /etc/hue

Now you need to download the current JDBC driver for PostgreSQL. At the time of this writing this is version 8.4-701:

wget http://jdbc.postgresql.org/download/postgresql-8.4-701.jdbc4.jar
sudo mv postgresql-8.4-701.jdbc4.jar /usr/lib/hadoop-0.20/lib/
sudo chown -R hadoop:hadoop /usr/lib/hadoop-0.20/lib/postgresql-8.4-701.jdbc4.jar

Now all that is left is to start Hadoop and Hue. Please note that the startup takes a while (at least for me and I’m not sure if that is correct):

for x in /etc/init.d/hadoop-0.20-*; do sudo $x start; done
sudo /etc/init.d/hue start

After this you should have three web interfaces:

Take the time to look at all those sites and make sure that everything seems fine. Also browse through HDFS using Hue and see if Beeswax and Hive work by clicking on the Tables button; there should be no tables for now.

Importing the OSM data from PostgreSQL to HDFS

We’re almost done. All that’s left to do is to get the data from PostgreSQL to HDFS and into Hive. That’s what Sqoop is for and here is how to run it:

sqoop import --connect "jdbc:postgresql://localhost/osm" --username osm --password <your password> --table node_tags --hive-import

An alternative to this command is the so called direct mode which does not use JDBC to connect to PostgreSQL but uses the psql tool to issue COPY commands which provides a speed boost to the export. There is currently a bug in Sqoop (see below) so the command to start Sqoop is a little bit different to circumvent this:

sqoop import --direct --connect "jdbc:postgresql://localhost:5432/osm" --username osm --password <your password> --table node_tags --hive-import

Note: Do not forget the port number in the JDBC URL!

The options should be pretty straight forward and easy to understand. But there’s a caveat. Or two, or three. I’ll update this post if anything changes with these points:

  • The import of the nodes, ways, relations, etc. tables doesn’t work with this option as Sqoop seems to have a bug (at least I think Sqoop can do something about it) with reserved words in Hive. timestamp is one such word and it is unfortunately the name of a column in all those tables. So the above command would fail with an error if you used the nodes table instead. I’ll update this post when I had time to investigate this further. Update: I’ve opened a ticket for the problem and attached a patch. Reserved words like timestamp can be escaped by backticks: `timestamp`. So for now you’ll have to create the table manually. See the Getting Started guide and the Data Definition Language for details.
  • Sqoop tells me that I should use the –direct option for better performance but I only got error messages. I’ll investigate this further. The above method may be slower but it works for now. Update: Found the problem and reported it to SQOOP-38. I’ve updated the text above with a way to use the direct import
  • Unfortunately Hue does seem to have problems with non ASCII data somewhere. This means that Beeswax doesn’t work for us at the moment as we got almost every character from the known Unicode set in the database somewhere. The command line however works and I’ll show a quick example. Update: This seems to be HUE-54

Now to actually query your data – and I’ll use the node_tags table as an example here – you just have to start Hive with the hive command and enter your query:

SELECT * FROM node_tags;
SELECT k, COUNT(k) AS count FROM node_tags GROUP BY k ORDER BY count DESC;
SELECT k, count FROM (SELECT k, COUNT(k) AS count FROM node_tags GROUP BY k) sub WHERE sub.count > 100 ORDER BY count DESC;

Those are pretty basic but they are the basis for the OSMdoc data so it’s what I wanted in the first place. The last query looks a bit complicated but Hive doesn’t have the HAVING clause yet so this is a workaround. It might not be very fast on small data sets but at least it is predictable. On PostgreSQL similar queries on a whole planet would take hours or days. And this is scalable.

Conclusion

I was happy to find that the whole setup was pretty easy. I first did this over a year ago and it was much more involved then. It’s great seeing the Hadoop community still going strong. As I’ve mentioned there are a few drawbacks and a few problems left but for my use case this is enough for now to provide a much needed refresh of the OSMdoc data. The last update was about a year ago.

My method has room for improvement – I for one don’t really need the PostgreSQL database and would love to store all the data in HBase but I’m lacking the resources. So for now PostgreSQL is an intermediate step. I’ve also not yet evaluated the performance but to compare PostgreSQL and Hive would be unfair at best. And the last thing I’ll need to do is to keep the data in HDFS updated and synchronized with the latest OSM updates.

Let me know if there are any questions or problems and I’d be glad to help either in the OpenStreetMap or the Hadoop world.

I’d like to end this post with a call for sponsors: OSMdoc in particular and the OpenStreetMap community in general could benefit greatly from a few more resources for our toy projects. I’d love to host a HBase version of OSM somewhere to allow for queries against it and to allow for much improved analysis in OSMdoc. Strato has kindly donated three of their largest servers to FOSSGIS for use in the OSM community but they are overloaded and there’s already a huge waiting list for new projects. So if you or your company has anything to spare please contact me.

Reference

You’ll find all the necessary commands in the following files: