Update May 2012: Some or all of this will be outdated now because the latest versions of YCSB are using Maven. I haven’t updated this article to reflect this.
HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java.
Yahoo has published a paper and the accompanying tool (YCSB) about Benchmarking Cloud Serving Systems with YCSB. At the moment I am not interested in comparing different database systems against each other but instead to only benchmark HBase. This is useful to test custom patches and their performance impact or to test different configuration options.
No matter which kind of workload you choose however keep in mind that this is an artificial benchmark and it can’t replace a test with your real data and load.
In this short blog post I’m going to outline how to get YCSB running against a current version of HBase. I’m going to show this on a single machine. In a real test setup you should of course be running YCSB on a different machine (or multiple machines) than your HBase cluster. A YCSB benchmark consists of two phases: a load and a transaction phase. The load phase measures various statistics while importing a bunch of data into the database while the transaction phase does just that, i.e. transactions on the data. There are multiple predefined workloads that mimic typical database usage scenarios and you can also define your own.
I am using a clean Ubuntu 10.04 installation but this should work on other distributions just as well.
While you’ll probably run it against an already set up cluster I will be using HBase in standalone mode here in its second development release of 0.89.
For YSCB I’ve used the latest version checked out from Github but the latest released version (0.1.2 at the time of this writing) should work equally well. So do this:
$ sudo apt-get -y install ant openjdk-6-jdk git-core $ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ $ wget http://apache.easy-webs.de/hbase/hbase-0.89.20100726/hbase-0.89.20100726-bin.tar.gz $ tar xvzf hbase-0.89.20100726-bin.tar.gz $ hbase-0.89.20100726/bin/start-hbase.sh $ hbase-0.89.20100726/bin/hbase shell create 'usertable', 'family' exit $ git clone http://github.com/brianfrankcooper/YCSB.git $ cp hbase-0.89.20100726/lib/* YCSB/db/hbase/lib $ cd YCSB $ ant $ ant dbcompile-hbase
As you can see YCSB requires a table called
usertable in HBase and it has to contain one column family with an arbitrary name (i.e.
family in my case). YCSB also needs all the libraries (jars) that the HBase client needs to run. The easiest is to just copy everything from HBase’s
lib directory to the appropriate directory in YCSB.
At this point we should have HBase running somewhere and YCSB and its HBase driver compiled. Time to load some data into HBase.
java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -load -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p recordcount=1000 -s > load.dat
A few things to note here:
- This loads only 1000 records into HBase. You will want to increase the number to 100 million or more on a real test.
- The documentation is pretty good so make sure to read it should you have problems.
- The documentation suggests not specifying properties (like recordcount) on the command line but in a property file instead. You’ll find instructions on how to do this on the aforementioned page.
-sparameter causes YCSB to print status messages to System.err every ten seconds, remove it if you don’t want them.
- After the load operation has finished you can find statistics in the
Now we’ll run the transactions part of the workload (again, for explanations see the documentation of YCSB):
java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p operationcount=1000000 -s -threads 10 -target 100 > transactions.dat
java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p operationcount=1000000 -s -threads 10 -target 100 -p measurementtype=timeseries -p timeseries.granularity=2000 > transactions.dat
After each run you should inspect the
transactions.dat file. For explanations I’ll once again refer to the documentation. We’ve used
workloada in these examples but there are in fact multiple predefined workloads (which are listed and explained in the documentation).
That’s it. As you can see YCSB is pretty easy to set up. I still hope this guide was helpful in getting started with it. Let me know if you have any questions.