<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>new Put(&#34;lars&#34;.toBytes(&#34;UTF-8&#34;))</title>
	<atom:link href="http://blog.lars-francke.de/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.lars-francke.de</link>
	<description>some crap about the future</description>
	<lastBuildDate>Thu, 17 May 2012 13:35:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Some open source news/releases &#8211; April 2012</title>
		<link>http://blog.lars-francke.de/2012/05/15/some-open-source-newsreleases-april-2012/</link>
		<comments>http://blog.lars-francke.de/2012/05/15/some-open-source-newsreleases-april-2012/#comments</comments>
		<pubDate>Tue, 15 May 2012 09:15:11 +0000</pubDate>
		<dc:creator>Lars Francke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.lars-francke.de/?p=246</guid>
		<description><![CDATA[Due to a stolen laptop and a holiday I won&#8217;t have much to say about April and May. I&#8217;ve missed a lot but I decided to also record all the new releases of libraries and related software I&#8217;m using. This is by no means complete. I hope to get back on to Hadoop + HBase&#8230;]]></description>
			<content:encoded><![CDATA[<p>Due to a stolen laptop and a holiday I won&#8217;t have much to say about April and May. I&#8217;ve missed a lot but I decided to also record all the new releases of libraries and related software I&#8217;m using. This is by no means complete. I hope to get back on to Hadoop + HBase news</p>
<ul>
<li>2.4.2012: Bigtop 0.3.0: <a href="http://incubator.apache.org/bigtop/release-notes.html" rel="nofollow">http://incubator.apache.org/bigtop/release-notes.html</a></li>
<li>5.4.2012: Hadoop 1.0.2 <a href="http://hadoop.apache.org/common/docs/r1.0.2/releasenotes.html" rel="nofollow">http://hadoop.apache.org/common/docs/r1.0.2/releasenotes.html</a></li>
<li>5.4.2012: Ganglia 3.3.5: <a href="http://ganglia.info/?p=506" rel="nofollow">http://ganglia.info/?p=506</a></li>
<li>5.4.2012: Tomcat 7.0.27 <a href="http://tomcat.apache.org/tomcat-7.0-doc/changelog.html" rel="nofollow">http://tomcat.apache.org/tomcat-7.0-doc/changelog.html</a> includes WebSocket</li>
<li>11.4.2012: Commons Compress 1.4 <a href="http://commons.apache.org/compress/changes-report.html#a1.4" rel="nofollow">http://commons.apache.org/compress/changes-report.html#a1.4</a></li>
<li>11.4.2012: Puppet 2.7.13 <a href="https://projects.puppetlabs.com/projects/puppet/wiki/Release_Notes#2.7.13" rel="nofollow">https://projects.puppetlabs.com/projects/puppet/wiki/Release_Notes#2.7.13</a></li>
<li>12.4.2012: Lucene &amp; Solr 3.6: <a href="http://www.lucidimagination.com/blog/2012/04/12/lucene-solr-3-6-released/" rel="nofollow">http://www.lucidimagination.com/blog/2012/04/12/lucene-solr-3-6-released/</a></li>
<li>15.4.2012: Commons IO 2.3: <a href="http://commons.apache.org/io/upgradeto2_3.html" rel="nofollow">http://commons.apache.org/io/upgradeto2_3.html</a></li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1328140">20.4.2012</a>: <a href="https://issues.apache.org/jira/browse/HBASE-3614">HBASE-3614</a> (<em>Expose per-region request rate metrics</em>)</li>
<li>24.4.2012: Apache Cassandra 1.1: <a href="https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces26" rel="nofollow">https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces26</a></li>
<li>24.4.2012: Apache Gora 0.2: <a href="http://gora.apache.org/releases.html#24+April%2C+2012%3A+Apache+Gora+0.2+released" rel="nofollow">http://gora.apache.org/releases.html#24+April%2C+2012%3A+Apache+Gora+0.2+released</a></li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1329574">24.4.2012</a>: <a href="https://issues.apache.org/jira/browse/HBASE-4393">HBASE-4393</a> (<em>Implement a canary monitoring program</em>) a tool &#8220;to gather a list of the regions in the cluster, then iterate over them doing lightweight operations (eg short scans) to provide metrics about latency as well as alert on availability issues.&#8221;</li>
<li>24.4.2012: <a href="http://www.cloudera.com/blog/2012/04/introducing-cdh4-beta-2/">CDH4 Beta 2</a> has been released. Scheduled to be the last beta before release</li>
<li>26.4.2012: Logback 1.0.2: <a href="http://logback.qos.ch/news.html" rel="nofollow">http://logback.qos.ch/news.html</a></li>
<li>26.4.2012: Java 7 Update 4: <a href="http://www.oracle.com/us/corporate/press/1603497" rel="nofollow">http://www.oracle.com/us/corporate/press/1603497</a> the first JDK released for Mac OS. This does not include all the necessary bits and pieces to replace Apple&#8217;s Java yet as far as I can tell. That&#8217;s scheduled for Update 6.</li>
</ul>
 <p><a href="http://blog.lars-francke.de/?flattrss_redirect&amp;id=246&amp;md5=4b931d6e10484297da4434fa4144d71c" title="Flattr" target="_blank"><img src="http://blog.lars-francke.de/wp-content/plugins/flattrss/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.lars-francke.de/2012/05/15/some-open-source-newsreleases-april-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=29341&amp;popout=1&amp;url=http%3A%2F%2Fblog.lars-francke.de%2F2012%2F05%2F15%2Fsome-open-source-newsreleases-april-2012%2F&amp;language=en_GB&amp;category=text&amp;title=Some+open+source+news%2Freleases+%26%238211%3B+April+2012&amp;description=Due+to+a+stolen+laptop+and+a+holiday+I+won%26%238217%3Bt+have+much+to+say+about+April+and+May.+I%26%238217%3Bve+missed+a+lot+but+I+decided+to+also+record+all+the...&amp;tags=blog" type="text/html" />
	</item>
		<item>
		<title>Hadoop + HBase commit log March 2012</title>
		<link>http://blog.lars-francke.de/2012/04/01/hadoop-hbase-commit-log-march-2012/</link>
		<comments>http://blog.lars-francke.de/2012/04/01/hadoop-hbase-commit-log-march-2012/#comments</comments>
		<pubDate>Sun, 01 Apr 2012 20:34:06 +0000</pubDate>
		<dc:creator>Lars Francke</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>

		<guid isPermaLink="false">http://blog.lars-francke.de/?p=237</guid>
		<description><![CDATA[I&#8217;ve been looking at the commits for Hadoop and HBase for a while now but seeing as others might be interested in a short list of the highlights as well I started writing these commits down. These mostly include notable new features or improvements. As usual there are a lot of bug fixes and other&#8230;]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been looking at the commits for Hadoop and HBase for a while now but seeing as others might be interested in a short list of the highlights as well I started writing these commits down. These mostly include notable new features or improvements. As usual there are a lot of bug fixes and other commits not mentioned here. This list just represents what I personally found to be interesting last month.</p>
<p>I only monitor the trunk of both projects.</p>
<ul>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1296534" rel="nofollow">2.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HDFS-1623" rel="nofollow">HDFS-1623</a> (<em>High Availability Framework for HDFS NN</em>) has been merged (documented at <a href="https://issues.apache.org/jira/browse/HDFS-2733" rel="nofollow">HDFS-2733</a>) to provide High Availability for HDFS using two NameNodes (Active and Standby). For now this requires a shared NFS directory and does not support automatic failover.</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1298641" rel="nofollow">8.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HBASE-5074" rel="nofollow">HBASE-5074</a> (<em>support checksums in HBase block cache</em>) has been committed. reduces iops by saving block checksums in the HFile (version upgraded to 2.1) which are verified when reading a block into the block cache. Only when this fails a roundtrip is made to HDFS. This is helpful because HDFS saves checksums in a separate file from the actual data blocks and thus requiring two random reads</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1301165" rel="nofollow">15.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HBASE-4608" rel="nofollow">HBASE-4608</a> (<em>HLog Compression</em>) should speed up HBase write speed. This is missing documentation and is disabled by default. Enable by setting <code>hbase.regionserver.wal.enablecompression</code> in <code>hbase-site.xml</code>to <code>true</code>. This is a custom compression using a dictionary that ensures that on a crash all data is still recoverable.</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1302740" rel="nofollow">20.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HADOOP-8121" rel="nofollow">HADOOP-8121</a> (<em>A</em><em>ctive Directory Group Mapping Service</em>) allows group memberships for users to be read from Active Directory (or LDAP in general)</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1303474" rel="nofollow">21.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HDFS-2834" rel="nofollow">HDFS-2834</a> (<em>ByteBuffer-based read API for DFSInputStream</em>) for a read performance increase of up to a factor of two. Read <a href="https://issues.apache.org/jira/browse/HDFS-2834?focusedCommentId=13222216&amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13222216" rel="nofollow">comment</a> for more information</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1304616" rel="nofollow">23.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HBASE-5616" rel="nofollow">HBASE-5616</a> (<em>Make compaction code standalone</em>) for now mainly an internal code change but allows easier profiling of the code for future improvements. This introduced a tool that allows to run Compactions from the command line.</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1305499" rel="nofollow">26.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HBASE-5533" rel="nofollow">HBASE-5533</a> (<em>Add more metrics to HBase</em>) adds some really useful (sounding) Metrics</li>
<li><a href="http://svn.apache.org/viewvc?view=revision&amp;revision=1305673" rel="nofollow">26.3.2012</a>: <a href="https://issues.apache.org/jira/browse/HADOOP-8206" rel="nofollow">HADOOP-8206</a> (<em>Common portion of ZK-based failover controller</em>) is a first step in the direction of better failover detection which wasn&#8217;t part of the initial HDFS-1623 commit. This is tracked in <a href="https://issues.apache.org/jira/browse/HDFS-2185" rel="nofollow">HDFS-2185</a>.</li>
<li>27.3.2012: There was a vote and Hadoop now has a more sensible version naming. What was branch-0.23 is now called version 2, branch-0.22 will stay as it is and will probably be released as such as well. Trunk was called 0.24 so far and what exactly is going to happen is still being <a href="http://thread.gmane.org/gmane.comp.jakarta.lucene.hadoop.general/1050/">discussed</a>.</li>
</ul>
<div id="labels-section"></div>
<p>&nbsp;</p>
 <p><a href="http://blog.lars-francke.de/?flattrss_redirect&amp;id=237&amp;md5=dde6c129d8c97fb4c40eaf2132ad3fe2" title="Flattr" target="_blank"><img src="http://blog.lars-francke.de/wp-content/plugins/flattrss/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.lars-francke.de/2012/04/01/hadoop-hbase-commit-log-march-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=29341&amp;popout=1&amp;url=http%3A%2F%2Fblog.lars-francke.de%2F2012%2F04%2F01%2Fhadoop-hbase-commit-log-march-2012%2F&amp;language=en_GB&amp;category=text&amp;title=Hadoop+%2B+HBase+commit+log+March+2012&amp;description=I%26%238217%3Bve+been+looking+at+the+commits+for+Hadoop+and+HBase+for+a+while+now+but+seeing+as+others+might+be+interested+in+a+short+list+of+the+highlights+as+well...&amp;tags=blog" type="text/html" />
	</item>
		<item>
		<title>Setting up a Hadoop cluster &#8211; Part 1: Manual Installation</title>
		<link>http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/</link>
		<comments>http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/#comments</comments>
		<pubDate>Wed, 26 Jan 2011 15:23:01 +0000</pubDate>
		<dc:creator>Lars Francke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cdh]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://blog.lars-francke.de/?p=85</guid>
		<description><![CDATA[Introduction This has also been posted on the GBIF Developer blog. I&#8217;ll answer questions in both places and update both blogs as needed. In the last few months I was tasked several times with setting up Hadoop clusters. Those weren&#8217;t huge &#8211; two to thirteen machines &#8211; but from what I read and hear this&#8230;]]></description>
			<content:encoded><![CDATA[<h1>Introduction</h1>
<p><em>This has also been <a href="http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html">posted</a> on the <a href="http://gbif.blogspot.com/">GBIF Developer blog</a>. I&#8217;ll answer questions in both places and update both blogs as needed.</em></p>
<p>In the last few months I was tasked several times with setting up Hadoop clusters. Those weren&#8217;t huge &#8211; two to thirteen machines &#8211; but from what I read and hear this is a common use case especially for companies just starting with Hadoop or setting up a first small test cluster.  While there is a huge amount of documentation in form of official documentation, blog posts, articles and books most of it stops just where it gets interesting: Dealing with all the stuff you really have to do to set up a cluster, cleaning logs, maintaining the system, knowing what and how to tune etc.</p>
<p>I&#8217;ll try to describe all the hoops we had to jump through and all the steps involved to get our Hadoop cluster up and running. Probably trivial stuff for experienced Sysadmins but if you&#8217;re a Developer and finding yourself in the &#8220;Devops&#8221; role all of a sudden I hope it is useful to you.</p>
<p>While working at <a href="http://www.gbif.org">GBIF</a> I was asked to set up a Hadoop cluster on 15 existing and 3 new machines. So the first interesting thing about this setup is that it is a heterogeneous environment: Three different configurations at the moment. This is where our first goal came from: We wanted some kind of automated configuration management. We needed to try different cluster configurations and we need to be able to shift roles around the cluster without having to do a lot of manual work on each machine. We decided to use a tool called <a href="http://www.puppetlabs.com/">Puppet</a> for this task.</p>
<p>While Hadoop is not currently in production at GBIF there are mid- to long-term plans to switch parts of our infrastructure to various components of the HStack. Namely MapReduce jobs with Hive and perhaps Pig (there is already strong knowledge of SQL here) and also storing of large amounts of raw data in HBase to be processed asynchronously (~500 million records until next year) and indexed in a Lucene/Solr solution possibly using something like Katta to distribute indexes. For good measure we also have fairly complex geographic calculations and map-tile rendering that could be done on Hadoop. So we have those 18 machines and no real clue how they&#8217;ll be used and which services we&#8217;d need in the end.</p>
<h1>Environment</h1>
<p>As mentioned before we have three different server configurations. We&#8217;ve put those machines in three logical clusters <em>c1</em>, <em>c2</em> and <em>c3</em> and just counting up in those (our master for example is currently running on <em>c1n1</em>):</p>
<ul>
<li><em>c1</em> 10: Intel(R) Xeon(R) CPU X3363 @ 2.83GHz, 2x6MB (quad), 8 GB RAM, 2 x 500GB SATA 7.2K</li>
<li><em>c2</em> 3: 2 x Intel(R) Xeon(R) CPU E5630 @ 2.53GHz (quad), 24 GB RAM, 6 x 250 GB SATA 5.4K</li>
<li><em>c3</em> 5: Intel(R) Xeon(R) CPU X3220 @ 2.40GHz (quad), 4 GB RAM, 2 x 160 GB SATA 7.2K</li>
<li><a href="http://www.centos.org/">CentOS</a> 5.5</li>
<li>The machines are in different racks but connected to only one switch</li>
</ul>
<p>We realize that this is a very heterogeneous cluster configuration. We also realize that some people highly discourage use of old machines or machines with little RAM but the <em>c1</em> and <em>c3</em> clusters were old unused machines and this way they still serve a purpose and we&#8217;ve had no problems so far using this setup.</p>
<h1>Goal</h1>
<p>These were the goals we set out to achieve on our cluster and these are also all the things I&#8217;ll try to describe in this or a following post:</p>
<ul>
<li>Puppet for setting up the services and configuring machine state</li>
<li><a href="https://docs.cloudera.com/display/DOC/Hadoop+Installation">CDH3</a> (Beta 3)
<ul>
<li>Hadoop HDFS + MapReduce incl. Hadoop LZO</li>
<li>Hue</li>
<li>Zookeeper</li>
<li>HBase</li>
</ul>
</li>
<li>Easily distributable packages for Hadoop, Hive and Pig to be used by the employees to access the cluster from their own workstations</li>
<li>Benchmarks &amp; Optimizations</li>
</ul>
<p>Be warned: This is going to be a very long post and unfortunately it is the nature of these things that some of the information is bound to be outdated pretty quickly so let me know if something has changed and I&#8217;ll alter the post.</p>
<h1>Manual installation</h1>
<p>Before we use Puppet to do everything automatically I will show how it can be done manually. I think it is important to know all the steps in case something goes wrong or you decide not to use Puppet at all. When I talk about &#8220;the server&#8221; I always mean &#8220;all servers in your cluster&#8221; except when noted otherwise. I highly recommend not skipping this part even if you want to use Puppet.</p>
<h2>Operating System</h2>
<p>For now I&#8217;ll just assume a vanilla CentOS 5.5 installation is already present. There&#8217;s nothing special you need. I recommend just the bare minimum, everything else needed can be installed at a later time. A few words though about things you might want to do:  Your servers probably have multiple disks. You shouldn&#8217;t use any RAID or LVM on any of your slaves (i.e. DataNodes/TaskTracker). Just use a JBOD configuration. In our cluster all disks are in a simple structure:</p>
<ul>
<li><code>/mnt/disk1</code></li>
<li><code>/mnt/disk2</code></li>
<li>&#8230;</li>
</ul>
<p>There are also two tweaks for your slaves you can do:</p>
<ul>
<li>Mount your data disks with <code>noatime</code> (e.g. <code>/dev/sdc1 /mnt/disk3 ext3 defaults,noatime 1 2</code> which btw. implies <code>nodiratime</code>)</li>
<li>By default there are a certain number of blocks reserved on ext (not familiar with others) file systems (check by running <code>tune2fs -l /dev/sdc1</code> and look at the <em>Reserved block count</em>). While this is useful on system disks so that critical processes can still write some data when the disk is full otherwise this is wasted space on our data disks. By default 5% of a HDD are reserved for this. I recommend setting this down to 1% by running: <code>tune2fs -m 1 &lt;device&gt;</code> on all your data disks (i.e. <code>tune2fs -m 1 /dev/sdc1</code>) which frees up quite a bit of disk space. You can also set it to 0% if you want though I went with 1% for our cluster. Keep the default setting for your system disks though!</li>
</ul>
<p>On your NameNode however use any means you feel necessary to secure your data. You know your requirements better than I do. Use RAID and/or LVM however you like. We don&#8217;t have any special resources so our NameNode is running on one of our regular servers at the moment. We might change that in the future.</p>
<h2>A note on Cloudera&#8217;s Package system &amp; naming</h2>
<p>Cloudera provides the various components of Hadoop in different Packages but they follow a simple structure: There is one <code>hadoop-0.20</code> package which contains all the jars, config files, directories, etc. needed for all the roles. And then there are packages like <code>hadoop-0.20-namenode</code> which are only a few kilobytes and they only contain the appropriate start- and stopscripts for the role in question.</p>
<h2>1. Common requirements</h2>
<p>Most of the commands in this guide need to be executed as <code>root</code>. I&#8217;ve chosen the easy route here and just logged in as <code>root</code>. If you&#8217;re operating as a non-privileged user remember to use <code>su</code>, <code>sudo</code> or any other means to ensure you have the proper rights.</p>
<h3>Repository</h3>
<ul>
<li><a href="https://docs.cloudera.com/display/DOC/CDH3+Installation">Cloudera documentation</a></li>
</ul>
<p>As all the packages we&#8217;re going to install are provided by Cloudera we need to add their repository to our cluster:</p>
<pre class="brush:shell">curl http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo &gt; /etc/yum.repos.d/cloudera-cdh3.repo</pre>
<h3>Java installation</h3>
<ul>
<li><a href="https://docs.cloudera.com/display/DOC/Java+Development+Kit+Installation">Cloudera documentation</a></li>
<li><a href="http://www.oracle.com/technetwork/java/javase/downloads/index.html">Java downloads</a></li>
<li>We&#8217;re using JDK 6 Update 23</li>
</ul>
<p>You have to download the JDK from Oracle&#8217;s website yourself as license issues prevent it from being added to the repositories. Chose the correct system (probably Linux x64) and make sure to download the file ending in <code>-rpm.bin</code> (i.e. <code>jdk-6u23-linux-x64-rpm.bin</code>). You might have to do this from a client machine because you need a browser that works with the Oracle site. So on any one machine execute the following:</p>
<pre class="brush:shell">unzip jdk-6u23-linux-x64-rpm.bin</pre>
<p>You should now have a bunch of .rpm files but you only need one of them: <code>jdk-6u23-linux-amd64.rpm</code>. Copy this file to your servers and install it as root using rpm:</p>
<pre class="brush:shell">rpm -Uvh ./jdk-6u23-linux-amd64.rpm</pre>
<h3>Time</h3>
<p>While not a hard requirement it makes a lot of things easier if the clocks on your servers are synchronized. I added this part at the last minute because we just realized that <code>ntpd</code> was disabled on three of our machines (c2) by accident and had some problems with it. It is worth taking a look at the clocks now and set up <code>ntp</code> properly before you start.</p>
<h3>DNS</h3>
<p>It doesn&#8217;t matter if you use a DNS server or hosts files or any other means for the servers to find each other. But make sure this works! Do it now! Even if you think everything&#8217;s set up correctly. Another thing that you should check is if the local hostname resolves to the public IP address. If you&#8217;re using a DNS server you can use <code>dig</code> to test this but that doesn&#8217;t take into account the <code>/etc/hosts</code> file so here is a simple test to see if it is correct:</p>
<pre class="brush:shell">ping -c 1 `hostname`</pre>
<p>This should resolve to the public IP and not to <code>127.0.0.1</code>.</p>
<h3>Firewall</h3>
<p>Hadoop uses a lot of ports for its internal and external communications. We&#8217;ve just allowed all traffic between the servers in the cluster and clients. But if you don&#8217;t want to do that you can also selectively open the required ports. I try to mention them but they can all be changed in the configuration files. I might also miss some due to our config so I&#8217;d be glad if someone could point those out to me.</p>
<h3>Packages</h3>
<p>We&#8217;re going to use lzo compression, the Hadoop native libraries as well as hue so there are a few common dependencies on all machines in the cluster which can be easily installed:</p>
<pre class="brush:shell">rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
yum install -y lzo hue-plugins hadoop-0.20-native</pre>
<h3>Directories</h3>
<p>We also need some directories later on so we can just create them now:</p>
<pre class="brush:shell">mkdir &lt;data disk&gt;/hadoop
chown root:hadoop &lt;data disk&gt;/hadoop</pre>
<p>Cloudera uses the <a href="http://linux.die.net/man/8/alternatives">alternatives</a> system to manage configuration. In <code>/etc/hadoop/conf</code> is the currently activated configuration. Look at the contents of <code>/etc/hadoop</code> and you&#8217;ll find all the installed configurations. At the moment there is only a <code>conf.empty</code> directory which we&#8217;ll use as our starting point:</p>
<pre class="brush:shell">cp -R /etc/hadoop/conf.empty /etc/hadoop/conf.cluster</pre>
<p>Now feel free to edit the configuration files in <code>/etc/hadoop/conf.cluster</code> but we&#8217;ll go through them as well later in this post. The last step is to activate this configuration:</p>
<pre class="brush:shell">/usr/sbin/alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50</pre>
<h3>LZO</h3>
<ul>
<li><a href="http://code.google.com/p/hadoop-gpl-compression/">hadoop-gpl-compression project</a></li>
<li>Todd Lipcon&#8217;s <a href="https://github.com/toddlipcon/hadoop-lzo">hadoop-lzo</a> &amp; Kevin Weil&#8217;s <a href="https://github.com/kevinweil/hadoop-lzo">hadoop-lzo</a> projects</li>
</ul>
<p>Due to licensing issues the LZO bindings for Hadoop cannot be distributed the same way as the rest of the packages. So this &#8211; once again &#8211; involves a few manual steps. After these bindings were removed from Hadoop itself a few versions ago they moved tho the hadoop-gpl-compression project on Google Code which (as far as I know) still works but hasn&#8217;t seen any development for over a year. Thankfully though Twitter&#8217;s Kevin Weil and Cloudera&#8217;s Todd Lipcon have picked up the project and maintained it. They regularly sync their github repositories so both should have almost the same code. I&#8217;m going to use Todd&#8217;s version here as it should be better synced with CDH releases.  You have to download the code from the repository, build the native libraries as well as the jar file and distribute those files on your cluster. You need to do this only on one machine which ideally should run the same OS version as the servers in your cluster. When you&#8217;re finished you can just copy the result to all servers. We&#8217;re using version 0.4.9 so we use this to download and build:</p>
<pre class="brush:shell">rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
yum install -y lzo-devel
wget --no-check-certificate https://github.com/toddlipcon/hadoop-lzo/tarball/0.4.9
tar xvfz toddlipcon-hadoop-lzo-0.4.9-0-g0e70051.tar.gz
wget http://www.apache.org/dist/ant/binaries/apache-ant-1.8.2-bin.tar.bz2
tar jxvf apache-ant-1.8.2-bin.tar.gz
cd toddlipcon-hadoop-lzo-0e70051
JAVA_HOME=/usr/java/latest/ BUILD_REVISION="0.4.9" ../apache-ant-1.8.2/bin/ant tar</pre>
<p>The ant version that comes with CentOS 5.5 didn&#8217;t work for me that&#8217;s why I downloaded a new one. This should leave you with a <code>hadoop-lzo-0.4.9.tar.gz</code> file in the build directory which you can extract to get all the necessary files for your servers:</p>
<ul>
<li><code>hadoop-lzo-0.4.9.jar</code> needs to be copied into <code>/usr/lib/hadoop/lib</code> on each server</li>
<li><code>lib/native/Linux-amd64-64</code> needs to be copied into <code>/usr/lib/hadoop/lib/native</code> on each server</li>
</ul>
<h3>cron &amp; log cleaning</h3>
<p>We&#8217;ve had a problem with unintentional debug logs filling up our hard drives. The investigations that followed that incident resulted in a <a href="http://www.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/">blog post</a> by <a href="http://www.larsgeorge.com/">Lars George</a> explaining all the log files Hadoop writes. It is a worthwhile read.</p>
<p>Hadoop writes tons of logs in various processes and phases and you should make sure that these don&#8217;t fill up your hard drives. There are two instances in the current CDH3b3 where you have to manually interfere:</p>
<ul>
<li>Hadoop daemon logs</li>
<li>Job XML files on the JobTracker</li>
</ul>
<p>Hadoop uses a <a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/DailyRollingFileAppender.html"><code>DailyRollingFileAppender</code></a> which unfortunately doesn&#8217;t have a <a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/RollingFileAppender.html#maxBackupIndex"><code>maxBackupIndex</code></a> setting like the <a href="http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/RollingFileAppender.html"><code>RollingFileAppender</code></a>. So either change the appender or manually clean up logs after a few days. We chose the second path and added a very simple cron job to run daily:</p>
<pre class="brush:shell">find /var/log/hadoop/ -type f -mtime +14 -name "hadoop-hadoop-*" -delete</pre>
<p>This jobs deletes old log files after 14 days.  We&#8217;ll take care of the Job XML files in a similar way at the JobTracker.</p>
<h2>2. HDFS</h2>
<p>One property needs to be set for both the NameNode and the DataNodes in the file <code>/etc/hadoop/conf/core-site.xml</code>: <code>fs.default.name</code>. So just add this and replace <code>$namenode</code> with the IP or name of your NameNode:</p>
<pre class="brush:xml">&lt;property&gt;
  &lt;name&gt;fs.default.name&lt;/name&gt;
  &lt;value&gt;hdfs://$namenode:8020&lt;/value&gt;
&lt;/property&gt;</pre>
<h3>2.1. NameNode</h3>
<p>Installing the NameNode is straightforward:</p>
<pre class="brush:shell">yum install -y hadoop-0.20-namenode</pre>
<p>This installs the startup scripts for the NameNode. The core package was already installed in the previous step. Now we need to change the configuration, create some directories and format the NameNode.</p>
<p>In <code>/etc/hadoop/conf/hdfs-site.xml</code> add the <code>dfs.name.dir</code> property which <em>&#8220;determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.&#8221;</em> We mentioned before that we&#8217;re using a JBOD configuration. We do this even for our NameNode. So in our case the NameNode has two disks mounted at <code>/mnt/disk1</code> and <code>/mnt/disk2</code> but you might want to write to just one location if you use RAID. As it says in the documentation the NameNode will write to each of the locations. You can write to a third location: A NFS mount which serves as a backup. Our configuration looks like this:</p>
<pre class="brush:xml">&lt;property&gt;
  &lt;name&gt;dfs.name.dir&lt;/name&gt;
  &lt;value&gt;/mnt/disk1/hadoop/dfs/name,/mnt/disk2/hadoop/dfs/name&lt;/value&gt;
&lt;/property&gt;</pre>
<p>Make sure to create the <code>dfs</code> directories before starting the NameNode. They need to belong to <code>hdfs:hadoop</code>. Formatting the NameNode is all that&#8217;s left:</p>
<pre class="brush:shell">su hdfs -c "/usr/bin/hadoop namenode -format"</pre>
<p>Once you&#8217;ve done all that you can enable the service so it will be started upon system boot and start the NameNode:</p>
<pre class="brush:shell">chkconfig hadoop-0.20-namenode on
service hadoop-0.20-namenode start</pre>
<p>You should be able to see the web interface on your namenode at port 50070 now.  Ports that need to be opened to clients on the NameNode are 50070 (web interface, 50470 if you enabled SSL) and 8020 (for HDFS command line interaction). Only port 8020 needs to be enabled for all other servers in the cluster.</p>
<p>We also use a cron job to run the HDFS Balancer every evening:</p>
<pre class="brush:shell">/usr/lib/hadoop-0.20/bin/start-balancer.sh -threshold 5</pre>
<h3>2.2 DataNodes</h3>
<p>The DataNodes handle all the data by storing it and serving it to clients. You can run a DataNode on your NameNode and especially for small- or test clusters this is often done but as soon as you have more than three to five machines or rely on your cluster for production use you should use a dedicated NameNode. Setting the DataNodes up is easy though after all our preparations. We need to set the property <code>dfs.data.dir</code> in the file <code>/etc/hadoop/conf/hdfs-site.xml</code>. It <em>&#8220;determines where on the local filesystem an DFS data node should store its blocks.  If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.&#8221;</em> These are the directories where the real data bytes of HDFS will be written to. If you specify multiple directories the DataNode will write to them in turn which gives good performance when reading the data.</p>
<p>This is an example of what we are using:</p>
<pre class="brush:xml">&lt;property&gt;
  &lt;name&gt;dfs.data.dir&lt;/name&gt;
  &lt;value&gt;/mnt/disk1/hadoop/dfs/data,/mnt/disk2/hadoop/dfs/data&lt;/value&gt;
&lt;/property&gt;</pre>
<p>Make sure to create the <code>dfs</code> directories before starting the DataNodes. They need to belong to <code>hdfs:hadoop</code>. When that&#8217;s done you just need to install the DataNode, activate the startup scripts and start it:</p>
<pre class="brush:shell">yum install -y hadoop-0.20-datanode
chkconfig hadoop-0.20-datanode on
service hadoop-0.20-datanode start</pre>
<p>Your DataNode should be up and running and if you have configured it correctly should also have connected to the NameNode and be visible in the web interface in the <em>Live Nodes</em> list and the configured capacity should go up.  Ports that need to be opened to clients are 50075 (web interface, 50475 if you enabled SSL) and 50010 (for data transfer). For the cluster you need to open ports 50010 and 50020.</p>
<h2>3. MapReduce</h2>
<p>MapReduce is split in two parts as well: A JobTracker and multiple TaskTrackers. For small-ish clusters the NameNode and the JobTracker can run on the same server but depending on your usage and available memory you might need to run them on separate servers. We have 18 servers, 17 slaves and 1 master (with NameNode, JobTracker and other services) which isn&#8217;t a problem so far. We need three properties set on all servers (in <code>mapred-site.xml</code>) to get started.</p>
<ul>
<li><code>mapred.job.tracker</code>: <em>&#8220;The host and port that the MapReduce job tracker runs at.  If &#8216;local&#8217;, then jobs are run in-process as a single map and reduce task.&#8221;</em>
<ul>
<li>This just points to your JobTracker. There is no default port for this in Hadoop 0.20 but 8021 is often used.</li>
<li>Our value (replace <code>$jobtracker</code> with the name or IP of your designated JobTracker): <code>$jobtracker:8021</code></li>
</ul>
</li>
<li><code>mapred.local.dir</code>: <em>&#8220;The local directory where MapReduce stores intermediate data files.  May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.&#8221;</em>
<ul>
<li>As it says this is a local directory where MapReduce stores stuff an we spread it out over all our discs.</li>
<li>Our value: <code>/mnt/disk1/hadoop/mapreduce,/mnt/disk2/hadoop/mapreduce</code></li>
<li>Create the directories on each server with the owner <code>mapred:hadoop</code></li>
</ul>
</li>
<li><code>mapred.system.dir</code>: <em>&#8220;The shared directory where MapReduce stores control files.&#8221; </em>
<ul>
<li>This is a path in HDFS where MapReduce stores stuff</li>
<li>Our value: <code>/hadoop/mapreduce/system</code></li>
<li>If <code>dfs.permissions</code> are on you need to create this directory in HDFS. Execute this command on any server in your cluster: <code>su hdfs -c "/usr/bin/hadoop fs -mkdir /hadoop/mapreduce &amp;&amp; /usr/bin/hadoop fs -chown mapred:hadoop /hadoop/mapreduce"</code></li>
</ul>
</li>
</ul>
<h3>3.1 JobTracker</h3>
<p>The JobTracker is very easy to setup and start:</p>
<pre class="brush:shell">yum install -y hadoop-0.20-jobtracker
chkconfig hadoop-0.20-jobtracker on
service hadoop-0.20-jobtracker start</pre>
<p>The web interface should now be available at port 50030 on your JobTracker.  Ports 50030 (web interface) and 8021 (not well defined in Hadoop 0.20 but if you followed my configuration this is correct) need to be opened to clients. Only 8021 is necessary for the TaskTrackers.</p>
<p>If the JobTracker is restarted some old files will not be cleaned up. That&#8217;s why we added another small cronjob to run daily:</p>
<pre class="brush:shell">find /var/log/hadoop/ -type f -mtime +3 -name "job_*_conf.xml" -delete</pre>
<h3>3.2 TaskTracker</h3>
<p>The TaskTracker are as easy to install as the JobTracker:</p>
<pre class="brush:shell">yum install -y hadoop-0.20-tasktracker
chkconfig hadoop-0.20-tasktracker on
service hadoop-0.20-tasktracker start</pre>
<p>The TaskTracker should now be up and running and visible in the JobTracker&#8217;s Nodes list.  Only port 50060 needs to be opened to clients for a minimalistic web interface. Other than that no other ports are needed as TaskTrackers check in at the JobTracker regularly (heartbeat) and get assigned Tasks at the same time.</p>
<h2>4. Configuration</h2>
<p>I&#8217;ll discuss a few configuration properties here that in a range of &#8220;necessary to change&#8221; to &#8220;nice to know about&#8221;. I&#8217;ll mention the following things for each property:</p>
<ul>
<li>The default value,</li>
<li>the value we use for our cluster at GBIF,</li>
<li>some of the defaults are quite old and have never been changed so I might mention a value I deem safe to use for everybody,</li>
<li>if we set the property to final so it can&#8217;t be overridden by clients (we set a lot of the parameters to final for purely documentary reasons, even those that can&#8217;t be overwritten in the first place),</li>
<li>if the property has been renamed or deprecated in Hadoop 0.21,</li>
<li>and if this property is required in a client configuration file or only on the cluster, if I don&#8217;t mention it it&#8217;s not needed.</li>
</ul>
<p>Here are the default configuration files for Hadoop 0.20.2 and 0.21:</p>
<ul>
<li>core-default.xml: <a href="https://github.com/apache/hadoop-common/blob/release-0.20.2/src/core/core-default.xml">0.20.2</a>, <a href="https://github.com/apache/hadoop-common/blob/release-0.21.0/src/java/core-default.xml">0.21</a></li>
<li>hdfs-default.xml: <a href="https://github.com/apache/hadoop-common/blob/release-0.20.2/src/hdfs/hdfs-default.xml">0.20.2</a>, <a href="https://github.com/apache/hadoop-hdfs/blob/release-0.21.0/src/java/hdfs-default.xml">0.21</a></li>
<li>mapred-default.xml: <a href="https://github.com/apache/hadoop-common/blob/release-0.20.2/src/mapred/mapred-default.xml">0.20.2</a>, <a href="https://github.com/apache/hadoop-mapreduce/blob/release-0.21.0/src/java/mapred-default.xml">0.21</a></li>
</ul>
<p>And I know that there are some duplications to the section above but I want to keep this Configuration section as a reference.</p>
<h3><code>core-site.xml</code></h3>
<h4><code>fs.default.name</code></h4>
<ul>
<li>Default: <code>file:///</code></li>
<li>We: <code>hdfs://$namenode:8020</code></li>
<li>We set this to final</li>
<li>Renamed to <code>fs.defaultFS</code> in Hadoop 0.21</li>
<li>Needed on the clients</li>
</ul>
<p>This is used to specify the default file system and defaults to your local file system that&#8217;s why it needs be set to a HDFS address. This is important for client configuration as well so your local configuration file should include this element.</p>
<h4><code>hadoop.tmp.dir</code></h4>
<ul>
<li>Default: <code>/tmp/hadoop-${user.name}</code></li>
<li>CDH3 Default: <code>/var/lib/hadoop-0.20/cache/${user.name}</code></li>
<li>We: Left it at the CDH3 default</li>
<li>We set this to final</li>
</ul>
<p>As mentioned in the default file this is mainly a base for other temporary directories. If all other configuration options are set correctly there shouldn&#8217;t be too much data in here.</p>
<h4><code>fs.trash.interval</code></h4>
<ul>
<li>Default: <code>0</code></li>
<li>We: <code>10080</code></li>
<li>We set this to final</li>
</ul>
<p>Hadoop has a Trash feature were removed files (using the command line tools) are moved to a .Trash folder in the users home folder. If set to 0 this feature is disabled but if set to a non-zero value this is the amount of minutes between Trash cleaner runs. As we have a lot of users in our system using Hadoop for the first time we chose a safe value here.</p>
<h4><code>fs.checkpoint.dir</code></h4>
<ul>
<li>Default: <code>${hadoop.tmp.dir}/dfs/namesecondary</code></li>
<li>We: <code>/mnt/disk1/hadoop/dfs/namesecondary,/mnt/disk2/hadoop/dfs/namesecondary</code></li>
<li>We set this to final</li>
</ul>
<p>The secondary NameNode stores its images to merge here. If it is a comma separated list the data is replicated to all these locations on the local disks.</p>
<h4><code>io.file.buffer.size</code></h4>
<ul>
<li>Default: <code>4096</code></li>
<li>Safe: <code>65536</code></li>
<li>We: <code>131072</code> (32 * 4096)</li>
<li>Can be overwritten by clients</li>
<li>We set this to final</li>
</ul>
<p>This is used for buffers all over the place to copy, store and write data to. It should be a multiple of 4096 and it should be safe to use 65536 today but we use double that. The performance gain is not enormous but there have been blog posts in the past measuring the impact and it was positive. We&#8217;ve also done our own tests and saw a small performance gain. If you use HBase be careful not to set this too high.</p>
<h4><code>io.compression.codecs</code></h4>
<ul>
<li>Default: <code>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</code></li>
<li>We: <code>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</code></li>
<li>We set this to final</li>
</ul>
<p>This lists all installed compression codecs. If you followed my manual you&#8217;ve got to add two more to the default list of codecs: <code>LzoCodec</code> and <code>LzopCodec</code>.</p>
<h4><code>io.compression.codec.lzo.class</code></h4>
<ul>
<li>We: <code>com.hadoop.compression.lzo.LzopCodec</code></li>
</ul>
<p>I have actually no idea why this setting is needed as I couldn&#8217;t find any reference where it is actually used in the code but I didn&#8217;t look very hard so I might be wrong. All I know is that the documentation mentions that this property needs to be set.</p>
<h4><code>webinterface.private.actions</code></h4>
<ul>
<li>Default: <code>false</code></li>
<li>We: <code>true</code></li>
<li>We set this to final</li>
</ul>
<p>By setting this to <code>true</code> the web interfaces for the JobTracker and NameNode gain some advanced options like killing a job. It makes life a lot easier while still in development or evaluation. But you probably should set this to false once you rely on your Hadoop cluster for production use.</p>
<h3><code>hdfs-site.xml</code></h3>
<h4><code>dfs.name.dir</code></h4>
<ul>
<li>Default: <code>${hadoop.tmp.dir}/dfs/name</code></li>
<li>We: <code>/mnt/disk1/hadoop/dfs/name,/mnt/disk2/hadoop/dfs/name</code></li>
<li>We set this to final</li>
</ul>
<p>This is an important setting to set that&#8217;s why I&#8217;ve already mentioned it above. The NameNode stores stuff in these directories by replicating all information to all these disks. One of them could be a mount on a remote disk (e.g. NFS) to have a backup.</p>
<h4><code>dfs.data.dir</code></h4>
<ul>
<li>Default: <code>${hadoop.tmp.dir}/dfs/data</code></li>
<li>We: <code>/mnt/disk1/hadoop/dfs/data,/mnt/disk2/hadoop/dfs/data</code></li>
<li>We set this to final</li>
</ul>
<p>This is another important setting as explained above. Different to <code>dfs.name.dir</code> in that the data is not replicated to all disks but distributed among all those locations. The DataNodes save the actual data in these locations. So more space is better. The easiest thing is to use dedicated disks for this. If you save other stuff than Hadoop data on the disks make sure to set <code>dfs.datanode.du.reserved</code> (see below).</p>
<h4><code>dfs.namenode.handler.count</code></h4>
<ul>
<li>Default: <code>10</code></li>
<li>We: <code>20</code></li>
<li>Safe: 10-20</li>
<li>We set this to final</li>
</ul>
<p>The number of threads the NameNode uses to serve requests. This depends highly on your usage and size of your cluster. We&#8217;ve tried a bunch of different values and settled on 20 without seeing any notable differences. <code>nnbench</code> is probably a good tool to benchmark this. If you&#8217;ve got a large cluster or many file operations (create or delete) you can try upping this value.</p>
<h4><code>dfs.datanode.handler.count</code></h4>
<ul>
<li>Default: <code>3</code></li>
<li>We: <code>5</code></li>
<li>Safe: 5-10</li>
<li>We set this to final</li>
</ul>
<p>The number of threads DataNodes use. I can&#8217;t tell what a good value is for large clusters but the <code>TestDFSIO</code> benchmark seems like a good test to run to find a good value here. Just play around. We&#8217;ve tried a bunch of different values up to 20 and didn&#8217;t see a difference so we chose a value slightly larger than the default.</p>
<h4><code>dfs.datanode.du.reserved</code></h4>
<ul>
<li>Default: <code>0</code></li>
<li>We: Left the default</li>
</ul>
<p>This many bytes will be left free on the volumes used by the DataNodes (see <code>dfs.data.dir</code>). As our drives are dedicated to Hadoop we left this at 0 but if the drives host other stuff as well set this to an appropriate value.</p>
<h4><code>dfs.permissions</code></h4>
<ul>
<li>Default: <code>true</code></li>
<li>We: <code>true</code></li>
<li>Renamed to <code>dfs.permissions.enabled</code> in Hadoop 0.21</li>
<li>We set this to final</li>
</ul>
<p>This enables permission checking in HDFS. Unless you use Secure Hadoop (which we don&#8217;t that&#8217;s why I don&#8217;t cover it here) it is still easy for anyone to read, write and delete anything on the cluster as there is no authentication of users done. So this is purely for safety reasons to avoid messing with the wrong data by accident.</p>
<h4><code>dfs.replication</code></h4>
<ul>
<li>Default: <code>3</code></li>
<li>We: <code>3</code></li>
<li>Can be used in the client configuration</li>
</ul>
<p>This is the default replication level used for new files in HDFS. if you change this value later on no existing files will be changed (that can be done on the command line though). Every file in HDFS can have a different replication level. This just sets the default.</p>
<h4><code>dfs.block.size</code></h4>
<ul>
<li>Default: <code>67108864</code></li>
<li>We: <code>134217728</code></li>
<li>Renamed to <code>dfs.blocksize</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>This factor is on a per file basis and only used for new files. Files saved to HDFS are split in blocks at most this large (64 MB by default). This has multiple implications. The more blocks you have the more load there is on your NameNode. So if you have many files that are larger than the blocksize you might set this larger. If your files are mostly smaller than this you waste no space. All files only take as much space as they actually have data (this is unlike other file systems where a file takes up at least one block no matter how large it really is). So NameNode load (memory requirements as well) are one factor. The deciding factor for us to set this higher per default is that a lot of our calculations in MapReduce are very fast and Mappers finish quickly. As one Mapper usually processes one block and Mappers take a while to set up we chose a higher block size so that each Mapper has more data to process.</p>
<p>This can be set on a per file basis so you really have to find your own perfect value, perhaps even on a per dataset basis.</p>
<h4><code>dfs.balance.bandwidthPerSec</code></h4>
<ul>
<li>Default: <code>1048576</code></li>
<li>We: <code>2097152</code></li>
<li>Renamed to <code>dfs.datanode.balance.bandwidthPerSec</code> in Hadoop 0.21</li>
<li>We set this to final</li>
</ul>
<p>This property configures the amounts of bytes per second (default is 1 MB/s) that a DFS balancing operation can use per DataNode. The default is pretty low so we doubled it. We don&#8217;t use a lot of bandwidth in our cluster at the moment so this is not a problem. Depends on your use case. The higher this number the faster balancing operations will complete. We run balancing every night on a cron job so we want it to be finished by morning.</p>
<h4><code>dfs.hosts</code></h4>
<ul>
<li>Default: no default set</li>
<li>We: <code>/etc/hadoop/conf/allowed_hosts</code></li>
<li>We set this to final</li>
</ul>
<p>This file has to contain one name per line. Every name is the name of a DataNode that is allowed to connect to the NameNode. This prevents accidents like what happened to me: I test everything in Virtual Machines so I started a bunch of them, deployed the live config and forgot to change the NameNode so all of a sudden a bunch of Virtual Machines joined our HDFS cluster and blocks began replicating there&#8230;. So it is a good thing to explicitly list all allowed hosts in this file.</p>
<h4><code>dfs.support.append</code></h4>
<ul>
<li>Default: <code>false</code></li>
<li>We: <code>true</code></li>
<li>As far as I know this option has been removed in Hadoop 0.21 and is enabled by default</li>
<li>We set this to final</li>
</ul>
<p>This option has quite a history. To make it short: If you&#8217;re using CDH3 set this to true, otherwise leave it false. You want/need this on true if you plan to use HBase.</p>
<h4><code>dfs.datanode.max.xcievers</code></h4>
<ul>
<li>Default: <code>256</code></li>
<li>Safe: <code>1024</code></li>
<li>We: <code>2048</code></li>
<li>Yes, this is misspelt in Hadoop and it hasn&#8217;t been fixed in Hadoop 0.21.</li>
<li>We set this to final</li>
</ul>
<p>This is the maximum number of threads a DataNode may use (for example for file access to the local file system). There used to be bugs in Hadoop so that the default was a bit to low and needed to be set higher. Even today it&#8217;s worth it to set it higher without a lot of risk. Especially if you&#8217;re using HBase.</p>
<h3><code>mapred-site.xml</code></h3>
<p>HDFS is pretty straightforward to configure and benchmark. MapReduce is more of a black art unfortunately. I&#8217;ll describe the MapReduce process here because it is important to understand where all the properties come in so you can safely change their values and tweak the performance. In my first draft of this post I wrote that I won&#8217;t go into much detail on the internals of the MapReduce process. (Un-)fortunately this wasn&#8217;t as easy as I thought and it has grown into a full blown explanation of everything I know. It is very possible that something&#8217;s wrong here so please correct me if you see something that is off. And if you&#8217;re not interested in how this works just skip to the descriptions of the properties itself.</p>
<p>All of this is valid for Hadoop 0.20.2+737 (the CDH version). I know that some things have changed in Hadoop 0.21 but that&#8217;s left for another time.</p>
<h4>The Map side</h4>
<p>While a Map is running it is collecting output records in an in-memory buffer called <code>MapOutputBuffer</code>, if there are no reducers a <code>DirectMapOutputCollector</code> is used which makes most of the rest obsolete as it writes immediately to disk. The total size of this in memory buffer is set by the <code>io.sort.mb</code> property and defaults to <em>100 MB</em> (which is converted to a byte value using a bit shift operation [<code>100 &lt;&lt; 20 = 104857600</code>]). Out of these <em>100 MB</em> <code>io.sort.record.percent</code> are reserved for tracking record boundaries. This property defaults to <em>0.05</em> (i.e. <em>5%</em> which means <em>5 MB</em> in the default case). Each record to track takes <em>16 bytes</em> (4 integers of 4 bytes each) of memory which means the buffer can track <em>327680</em> map output records with the default settings. The rest of the memory (<em>104857600 bytes &#8211; (16 bytes * 327680) = 99614720 bytes</em>) is used to store the actual bytes to be collected (in the default case this will be <em>95 MB</em>). While Map outputs are collected they are stored in the remaining memory and their location in the in-memory buffer is tracked as well. Once one of these two buffers reaches a threshold specified by <code>io.sort.spill.percent</code>, which defaults to <em>0.8</em> (i.e. <em>80%</em>), the buffer is flushed to disk:</p>
<pre class="brush:plain">0.8 * 99614720 = 79691776
0.8 * 327680 = 262144</pre>
<p>Look in the log output of your Maps and you&#8217;ll see these three lines at the beginning of every log:</p>
<pre class="brush:plain">2010-12-05 01:33:04,912 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2010-12-05 01:33:04,996 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2010-12-05 01:33:04,996 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680</pre>
<p>You should recognize these numbers!</p>
<p>Now while the Map is running you might see log lines like these:</p>
<pre class="brush:plain">2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 19361312; bufvoid = 99614720
2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
2010-12-05 01:33:09,558 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0</pre>
<p>This means we&#8217;ve reached the maximum number of records we can track even though our buffer is still pretty empty (<em>99614720 - 19361312 bytes</em> still free). If however your buffer is the cause of your spill you&#8217;ll see a line like this:</p>
<pre class="brush:plain">2010-12-05 01:33:08,823 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full = true</pre>
<p>All of this spilling to disk is done in a separate thread so that the Map can continue running. That&#8217;s also the reason why the spill begins early (when the buffer is only <em>80%</em> full) so it doesn&#8217;t fill up before a spill is finished. If one single Map output is too large to fit into the in memory buffer a single spill is done for this one value. A spill actually consists of one file per partition, meaning one file per Reducer.</p>
<p>After a Map task has finished there may be multiple spills on the TaskTracker. Those files have to be merged into one single sorted file per partition which is then fetched by the Reducers. The property <code>io.sort.factor</code> says how many of those spill files will be merged into one file at a time. The lower the number is the more passes will be required to arrive at the goal. The default is very low and it was considered to set the default to <em>100</em> (and in fact looking at the code it sometimes is set to <em>100</em> by default). This property can make a pretty huge difference if your Mappers output a lot of data. Not much memory is needed for this property but the larger it is the more open files there will be so make sure to set this to a reasonable value. To find such a value you should run a few MapReduce jobs that you&#8217;d expect to see in production use and carefully monitor the log files.</p>
<p>Watch out for log messages like these:</p>
<ul>
<li><code>Merging &lt;numSegments&gt; sorted segments</code></li>
<li><code>Down to the last merge-pass, with &lt;numSegments&gt; segments left of total size: &lt;totalBytes&gt; bytes</code></li>
<li><code>Merging &lt;segmentsToMerge.size()&gt; intermediate segments out of a total of &lt;totalSegments&gt;</code></li>
</ul>
<p>This is the process on the Map side where this factor is used. If your Mappers only have on spill file all of this doesn&#8217;t matter. So if you try to benchmark this make sure to use a job with a lot of Map output data. If you only see a line like &#8220;<code>Finished spill 0</code>&#8221; but none of the above you&#8217;re only producing one spill file which doesn&#8217;t require any merging or further sorting. This is the ideal situation and you should try to get the number of spilled records/files as low as possible.</p>
<h4>The Reduce side</h4>
<p>The reduce phase has three different steps: Copy, Sort (which should really be called Merge) and Reduce.</p>
<p>During the Copy phase the Reducer tries to fetch the output of the Maps from the TaskTrackers and store it on the Reducer either in memory or on disk. The property <code>mapred.reduce.parallel.copies</code> (which defaults to <em>5</em>) defines how many Threads are started per Reduce task to fetch Map output from the TaskTrackers.</p>
<p>Here&#8217;s an example log from the beginning of a Reducer log:</p>
<pre class="brush:plain">2010-12-05 01:53:03,846 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=334063200, MaxSingleShuffleLimit=83515800
2010-12-05 01:53:03,879 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Need another 1870 map output(s) where 0 is already in progress
2010-12-05 01:53:03,880 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread started: Thread for merging on-disk files
2010-12-05 01:53:03,880 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread waiting: Thread for merging on-disk files
2010-12-05 01:53:03,880 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread started: Thread for merging in memory files
2010-12-05 01:53:03,881 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Thread started: Thread for polling Map Completion Events</pre>
<p>You can see two things in these log lines. First of all the <code>ShuffleRamManager</code> is started and afterwards you see that this Reducer needs to fetch 1870 map outputs (meaning we had 1870 Mappers). The map output is fetched and shuffled into memory (that&#8217;s what the <code>ShuffleRamManager</code> is for). You can control its behavior using the <code>mapred.job.shuffle.input.buffer.percent</code> (default is <em>0.7</em>). <a href="http://download.oracle.com/javase/6/docs/api/java/lang/Runtime.html#maxMemory()">Runtime.getRuntime().maxMemory()</a> is used to get the available memory which unfortunately returns slightly <a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4686462">incorrect</a> values so be careful when setting this. We&#8217;ll get back to the last four lines later.</p>
<p>Our child tasks are running with <code>-Xmx512m</code> (536870912 bytes) so 70% of that should be <em>375809638 bytes</em> but the <code>ShuffleRamManager</code> reports <em>334063200</em>. No big deal, just be aware of it. There&#8217;s a hardcoded limit of 25% of the buffer that a single map output may not surpass. If it is larger than that it will be written to disk (see the MaxSingleShuffleLimit value above: 334063200 * 0.25 = 83515800).</p>
<p>Now that everything&#8217;s set up the copiers will start their work and fetch the output. You&#8217;ll see a bunch of log lines like these:</p>
<pre class="brush:plain">2010-12-05 01:53:11,114 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201012031527_0021_m_000011_0, compressed len: 454055, decompressed len: 454051
2010-12-05 01:53:11,114 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 454051 bytes (454055 raw bytes) into RAM from attempt_201012031527_0021_m_000011_0
2010-12-05 01:53:11,133 INFO org.apache.hadoop.mapred.ReduceTask: Read 454051 bytes from map-output for attempt_201012031527_0021_m_000011_0
2010-12-05 01:53:11,133 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201012031527_0021_m_000011_0 -&gt; (70, 6) from c1n7.gbif.org</pre>
<p>In the first line you see that a map output was successfully copied and it could read the size of the data from the headers. The next line is actually what we&#8217;ve talked about earlier: The map output will now be decompressed (if it was compressed) and saved into memory using the <code>ShuffleRamManager</code>. The third line acknowledges that this succeeded. And the last line is information for a <a href="https://issues.apache.org/jira/browse/HADOOP-3647">bug</a> and should have been removed already according to a comment in the source code.</p>
<p>If for whatever reason the map output doesn&#8217;t fit into memory you will see a similar log line to the second one above but &#8220;<code>RAM</code>&#8221; will be replaced by &#8220;<code>Local-FS</code>&#8221; and the fourth line will be missing. You obviously want as much data into memory as possible so shuffling on to the Local-FS is a warning sign or at least a sign for possible optimizations.</p>
<p>While all this goes on until all map outputs have been fetched there are two threads (Thread for merging on-disk files and Thread for merging in memory files) waiting for some conditions until they become active. The conditions are as follows:</p>
<ul>
<li>The used memory in the in-memory buffer is above <code>mapred.job.shuffle.merge.percent</code> (default ist 66%, in our example that would mean 334063200 * 0.66 = 220481712 bytes) <em>and</em> there are at least two map outputs in the buffer</li>
<li>or there are more than <code>mapred.inmem.merge.threshold</code> (defaults to 1000) map outputs in the in-memory buffer, independent of the size</li>
<li>or if there are more than <code>io.sort.factor</code> * 2 -1 files on <em>disk</em>.</li>
</ul>
<p>When one of the first two condition triggers you&#8217;ll see something like this:</p>
<pre class="brush:plain">2010-12-05 01:53:42,106 INFO org.apache.hadoop.mapred.ReduceTask: Initiating in-memory merge with 501 segments...
2010-12-05 01:53:42,114 INFO org.apache.hadoop.mapred.Merger: Merging 501 sorted segments
...
2010-12-05 01:53:46,492 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012031527_0021_r_000103_0 Merge of the 501 files in-memory complete. Local file is /mnt/disk1/hadoop/mapreduce/local/taskTracker/lfrancke/jobcache/job_201012031527_0021/attempt_201012031527_0021_r_000103_0/output/map_1.out of size 220545981
2</pre>
<p>This could actually trigger the third condition as it writes a new file to disk. When that happens you&#8217;ll see something like this:</p>
<pre class="brush:plain">2010-12-10 14:28:23,289 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201012101346_0001_r_000012_0We have  19 map outputs on disk. Triggering merge of 10 files</pre>
<p>The <code>io.sort.factor</code> was set to the default of 10. 10 (out of the 19) files will be merged into one, leaving 10 on disk (i.e. <code>io.sort.factor</code>).</p>
<p>Both of these (the in-memory and the on-disk merge, the latter is also called <em>Interleaved on-disk merge</em>) will produce a new single output file and write it to disk. All of this is only going on as long as map outputs are still fetched. When that&#8217;s finished we wait for running merges to finish but won&#8217;t start any new ones in these threads:</p>
<pre class="brush:plain">2010-12-05 01:59:10,598 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 3 files left.
2010-12-05 01:59:10,599 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 314 files left.</pre>
<p>As you can see by the timestamps no merges were running in our case so everything just shut down. During the copy phase we finished a total of three in-memory merges that&#8217;s why we currently have three files on the disk. 314 more map outputs are still in the in-memory buffer. This concludes the Copy phase and the Sort phase begins:</p>
<pre class="brush:plain">2010-12-05 01:59:10,605 INFO org.apache.hadoop.mapred.Merger: Merging 314 sorted segments
2010-12-05 01:59:10,605 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 314 segments left of total size: 127512782 bytes
2010-12-05 01:59:13,903 INFO org.apache.hadoop.mapred.ReduceTask: Merged 314 segments, 127512782 bytes to disk to satisfy reduce memory limit
2010-12-05 01:59:13,904 INFO org.apache.hadoop.mapred.ReduceTask: Merging 4 files, 788519164 bytes from disk
2010-12-05 01:59:13,905 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
2010-12-05 01:59:13,905 INFO org.apache.hadoop.mapred.Merger: Merging 4 sorted segments
2010-12-05 01:59:14,493 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 4 segments left of total size: 788519148 bytes</pre>
<p>There are two things happening here. First of all the remaining 314 files that are still in memory are merged into one file on the disk (the first three lines). So now there are four files on the disk. These four files are merged into one.</p>
<p>There is an option <code>mapred.job.reduce.input.buffer.percent</code> which is set to 0 by default which allows the Reducer to keep some map output files in memory. The following is a snippet with this property set to 0.7:</p>
<pre class="brush:plain">2010-12-05 23:11:55,657 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3 files, 661137901 bytes from disk
2010-12-05 23:11:55,660 INFO org.apache.hadoop.mapred.ReduceTask: Merging 312 segments, 127381881 bytes from memory into reduce
2010-12-05 23:11:55,661 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments
2010-12-05 23:11:55,688 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 661137889 bytes
2010-12-05 23:11:55,688 INFO org.apache.hadoop.mapred.Merger: Merging 313 sorted segments
2010-12-05 23:11:55,689 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 313 segments left of total size: 788519778 bytes</pre>
<p>You can see that instead of merging the 312 segments from memory to disk they are kept in memory while the three files on disk are merged into one and all of the resulting 313 segments are streamed into the reducer.</p>
<p>There seems to be a bug in Hadoop though. I&#8217;m not 100% sure about this one so any insight would be appreciated. When the following conditions are true segments from the memory don&#8217;t seem to be written to disk even if they should be according to the configuration:</p>
<ul>
<li>There are segments in memory that should be written to disk before the reduce task begins according to <code>mapred.job.reduce.input.buffer.percent</code></li>
<li><em>and</em> there are more files on disk than <code>io.sort.factor</code></li>
</ul>
<p>If this happens you see this:</p>
<pre class="brush:plain">2010-12-10 16:39:40,671 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 14 segments, 18888592 bytes in memory for intermediate, on-disk merge
2010-12-10 16:39:40,673 INFO org.apache.hadoop.mapred.ReduceTask: Merging 10 files, 4143441520 bytes from disk
2010-12-10 16:39:40,674 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
2010-12-10 16:39:40,674 INFO org.apache.hadoop.mapred.Merger: Merging 24 sorted segments
2010-12-10 16:39:40,859 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 24 segments left of total size: 4143441480 bytes</pre>
<p>So the steps being done in the Sort phase are the following:</p>
<ol>
<li>Merge all segments (= map outputs) that are still in memory and don&#8217;t fit into the memory specified by <code>mapred.job.reduce.input.buffer.percent</code> into one file on disk <em>if</em> there are less than <code>io.sort.factor</code> files on disk so we end up with at most <code>io.sort.factor</code> files on the disk after this step. If there are already <code>io.sort.factor</code> or more files on disk but there are map outputs that need to be written out of memory keep them in memory for now
<ol>
<li>In the first case you&#8217;ll see a log message like this: <code>Merged 314 segments, 127512782 bytes to disk to satisfy reduce memory limit</code></li>
<li>In the second case you&#8217;ll see this: <code>Keeping 14 segments, 18888592 bytes in memory for intermediate, on-disk merge</code></li>
</ol>
</li>
<li>All files on disk and all remaining files in memory that need to be merged (case 1.b) are determined. You&#8217;ll see a log message like this: &#8220;<code>Merging 4 files, 788519164 bytes from disk</code>&#8220;.</li>
<li>All files that remain in memory during the Reduce phase are determined: &#8220;<code>Merging 312 segments, 127381881 bytes from memory into reduce</code>&#8220;.</li>
<li>All files (on disk + in-memory) from step 2. are merged together using <code>io.sort.factor</code> as the merge factor. Which means that there might be intermediate merges to disk.</li>
<li>Merge all remaining in-memory (from step 3.) and on-disk files (from step 4.) into one stream to be read by the Reducer. This is done in a streaming fashion without writing new data to disk and just returning an Iterator to the Reduce phase.</li>
</ol>
<p>This Iterator is given to the Reducer and so the Reduce phase starts.</p>
<p>Well, it turned out to be a rather detailed description of the process which is helpful to understand the configuration properties available to you. See below for a detailed list of all the relevant properties:</p>
<h4><code>io.sort.factor</code></h4>
<ul>
<li>Default: <code>10</code></li>
<li>We: <code>100</code></li>
<li>Safe: 20-100</li>
<li>Renamed to <code>mapreduce.task.io.sort.factor</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>I&#8217;ve explained pretty thoroughly what this parameter does so I won&#8217;t go into detail here. The whole situation with <code>io.sort.factor</code> and <code>io.sort.mb</code> is not ideal but as long as they are the options we have and the defaults are very low it is pretty safe to change them to a more reasonable value. It is worthwhile to take a look at your logs and search for the lines mentioned in the explanation above. This can be set on a per-job basis and for jobs that run frequently it&#8217;s worth to find a good job specific value.</p>
<h4><code>io.sort.mb</code></h4>
<ul>
<li>Default: <code>100</code></li>
<li>Renamed to <code>mapreduce.task.io.sort.mb</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>You can adjust the amount of memory used in the Mappers to collect Map outputs with this parameter. This parameter obviously depends heavily on the amount of memory you have available in total for your child VMs and on the memory requirements of your tasks. Your goal should be to minimize the amount of spilling that has to be performed as explained above and to utilize the available as best as possible. If your Map tasks don&#8217;t need a lot of memory themselves you can use almost all available memory here. The default settings allocate 200 MB for child VMs and half of that is used for the output buffer so your Map tasks has about 100 MB available by default.</p>
<h4><code>io.sort.record.percent</code></h4>
<ul>
<li>Default: <code>0.05</code></li>
<li>This has been removed in favor of <a href="https://issues.apache.org/jira/browse/MAPREDUCE-64">automatic configuration</a> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>The output buffer on the map side is split in two parts. One stores the actual bytes of the output data and the other one stores 16 bytes of metadata per output. This property specifies how much memory of the buffer (io.sort.mb) is used for tracking the metadata. The default is 5% and is often very low for jobs which output only small amounts of data in their map tasks. Look for lines indicating whether a spill to disk occurs because of <code>record full = true</code>. If this happens try to increase this value. This is another property which is very specific to the jobs you&#8217;re running so it might need tuning for each and every job.</p>
<p>Thankfully this mechanism has been replaced in Hadoop 0.21.</p>
<h4><code>io.sort.spill.percent</code></h4>
<ul>
<li>Default: <code>0.8</code></li>
<li>Renamed to <code>mapreduce.map.sort.spill.percent</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>This property just configures when the data from the map output buffer will be written (spilled) to disk. The spilling process is running in a separate thread and output will be collected while it is running so it is important to start this process before the buffer is completely full as the map tasks will pause until there is space available.</p>
<h4><code>mapred.job.tracker</code></h4>
<ul>
<li>Default: <code>local</code></li>
<li>We: <code>&lt;jobtracker&gt;:8021</code></li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.jobtracker.address</code> in Hadoop 0.21</li>
<li>Needed on the clients</li>
</ul>
<p>This lets the client know where to find the JobTracker and it lets the JobTracker know which port to bind to.</p>
<h4><code>mapred.local.dir</code></h4>
<ul>
<li>Default: <code>${hadoop.tmp.dir}/mapred/local</code></li>
<li>We: <code>/mnt/disk1/hadoop/mapreduce/local,/mnt/disk2/hadoop/mapreduce/local</code></li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.cluster.local.dir</code> in Hadoop 0.21</li>
</ul>
<p>This lets the MapReduce servers know where to store intermediate files. This may be a comma-separated list of directories to spread the load. Make sure there&#8217;s enough space here for all your intermediate files. We share the same disks for MapReduce and HDFS.</p>
<h4><code>mapred.system.dir</code></h4>
<ul>
<li>Default: <code>${hadoop.tmp.dir}/mapred/system</code></li>
<li>We: <code>/hadoop/mapred/system</code></li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.jobtracker.system.dir</code> in Hadoop 0.21</li>
</ul>
<p>This is a folder in the <code>defaultFS</code> where MapReduce stores some control files. In our case that would be a directory in HDFS. If you have <code>dfs.permissions</code> (which it is by default) enabled make sure that this directory exists and is owned by mapred:hadoop.</p>
<h4><code>mapred.temp.dir</code></h4>
<ul>
<li>Default: <code>${hadoop.tmp.dir}/mapred/temp</code></li>
<li>We: <code>/tmp/mapreduce</code></li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.cluster.temp.dir</code> in Hadoop 0.21</li>
</ul>
<p>This is a folder to store temporary files in. It is hardly &#8211; if at all used. If I understand the description correctly this is supposed to be in HDFS but I&#8217;m not entirely sure by reading the source code. So we set this to a directory that exists on the local filesystem as well as in HDFS.</p>
<h4><code>mapred.map.tasks</code></h4>
<ul>
<li>Default: <code>2</code></li>
<li>Renamed to <code>mapreduce.job.maps</code> in Hadoop 0.21</li>
</ul>
<p>It is important to realize that this is just a hint for MapReduce as to the number of Maps it should use. In most cases this value is ignored and the actual number of Maps is dependent on the input data and generated automatically. For those rare cases where this value is used we set it to about 90% of our map slot capacity. This can be set client-side per job so if you have a job that relies on this property you better set it there to an appropriate value.</p>
<h4><code>mapred.reduce.tasks</code></h4>
<ul>
<li>Default: <code>1</code></li>
<li>Renamed to <code>mapreduce.job.reduces</code> in Hadoop 0.21</li>
</ul>
<p>This is different than the property for map tasks in that it is often not possible to calculate a &#8220;native&#8221; or optimal number of reduce tasks for a job. With this property you can specify the number of reduce tasks to start for a given job. The default is very low. The description suggests to set this to 99% of the cluster capacity so that all reduces finish in one wave. This is sensible when you use the default scheduler but as soon as multiple jobs run in parallel it&#8217;s hard to guarantee that all reduces of one job finish in one wave. We&#8217;re constantly playing around with this and currently have this at about 50% of our capacity.</p>
<p>This too can be specified on a per-job basis.</p>
<h4><code>mapred.jobtracker.taskScheduler</code></h4>
<ul>
<li>Default: <code>org.apache.hadoop.mapred.JobQueueTaskScheduler</code></li>
<li>We: <code>org.apache.hadoop.mapred.FairScheduler</code></li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.jobtracker.taskscheduler</code> in Hadoop 0.21</li>
</ul>
<p>With the default configuration all jobs are placed in a priority FIFO queue and submitted one after the other. This is fine for testing but it doesn&#8217;t utilize the available resources very well. This property allows you to change the scheduler used. These are the available schedulers in CDH3b3:</p>
<ul>
<li>JobQueueTaskScheduler</li>
<li><a href="http://archive.cloudera.com/cdh/3/hadoop/fair_scheduler.html">FairScheduler</a></li>
<li><a href="http://archive.cloudera.com/cdh/3/hadoop/capacity_scheduler.html">CapacityScheduler</a></li>
</ul>
<p>Depending on the scheduler you decide to use there may be additional properties which I&#8217;m not going to mention here. Have a look at the dedicated documentation.</p>
<h4><code>mapred.reduce.parallel.copies</code></h4>
<ul>
<li>Default: <code>5</code></li>
<li>We: ~20-50</li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.reduce.shuffle.parallelcopies</code> in Hadoop 0.21</li>
</ul>
<p>The reduce tasks have to fetch the map outputs from the remote servers. They have to fetch the output from each map of which there may be thousands. This option allows to parallelize the copy process. Tuning this to a value is very worthwhile. In our first tests this property gave us one of the best performance increases of all properties. We started to increase this property in steps of 5 and looked very carefully at the logs and our monitoring system to find a value that works for us. We&#8217;ve not yet finished this process but values between 20 and 50 seem to mostly work without problems.</p>
<h4><code>mapred.tasktracker.map.tasks.maximum</code> &amp; <code>mapred.tasktracker.reduce.tasks.maximum</code></h4>
<ul>
<li>Default: 2</li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.tasktracker.map.tasks.maximum</code> &amp; <code>mapreduce.tasktracker.map.tasks.maximum</code> in Hadoop 0.21</li>
</ul>
<p>This setting is very important and we&#8217;ve yet to find values that we are comfortable with. This setting can be different on each TaskTracker and defines how many map or reduce task &#8220;slots&#8221; there are on a specific TaskTracker. You need to set these to values that don&#8217;t overload your servers while still fully utilizing them. You also need to make sure that there&#8217;s enough memory for all tasks and services running on a server (see mapred.child.java.opts).</p>
<p>By setting this property to different values depending on your server configuration you can easily use heterogeneous hardware in your cluster. Each distinct hardware configuration will have these properties set to different values.</p>
<p>A general rule from the <a href="http://oreilly.com/catalog/0636920010388">Hadoop Definitive Guide</a> book says that these properties can be set to <code>number of cores - 1</code>. We&#8217;ve tried various settings now but found the load on the servers to be very high with those settings so we&#8217;ll have to do more benchmarking.</p>
<h4><code>mapred.child.java.opts</code></h4>
<ul>
<li>Default: <code>-Xmx200m</code></li>
<li>Can be used in the client configuration</li>
</ul>
<p>These are the options given to each child JVM started (map- and reduce tasks). The default just sets the maximum memory to 200 MB. This can be set on the client to pass options needed for a specific job. GC logging for example can be enabled as well. This isn&#8217;t configurable on a per TaskTracker basis so you have to make sure that every machine in your cluster fulfills the requirements. Available memory needs to be at least <code>(mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum) * Xmx</code>.</p>
<h4><code>mapred.inmem.merge.threshold</code></h4>
<ul>
<li>Default: <code>1000</code></li>
<li>We: <code>0</code></li>
<li>Renamed to <code>mapreduce.reduce.merge.inmem.threshold</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>I&#8217;ve explained the effect of this property in the MapReduce description above but to reiterate: The reduce side fetches map outputs to memory. Once the memory is full or this many map outputs are in memory they are merged together to one file on the disk. This can be set on a per job basis but as a default we&#8217;ve disabled this behavior and just flush to disk when the memory is full. This seems to have been better for all our jobs so far but it&#8217;s definitely a property to look out for.</p>
<h4><code>mapred.job.shuffle.merge.percent</code></h4>
<ul>
<li>Default: <code>0.66</code></li>
<li>Renamed to <code>mapreduce.reduce.shuffle.merge.percent</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>Once the memory buffer in the copy (shuffle) phase of the reduce task is this full a background thread will start to merge all map outputs collected in memory so far and write them to a single file on disk. This is similar to what&#8217;s happening on the map side. In the default configuration the <code>mapred.inmem.merge.threshold</code> parameter might actually trigger a merge before this value is hit. We haven&#8217;t yet played around with this property but you&#8217;d have to be careful to turn it not too high so that the copy processes have to wait for the buffer to be empty again. That could be a huge performance hit.</p>
<p>An addition to Hadoop&#8217;s logging would be nice that lets us know how full the buffer is the moment a merge finishes.</p>
<h4><code>mapred.job.shuffle.input.buffer.percent</code></h4>
<ul>
<li>Default: <code>0.7</code></li>
<li>Renamed to <code>mapreduce.reduce.shuffle.input.buffer.percent</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>This is the amount of memory from the total available memory (specified by mapred.child.java.opts) that&#8217;s allocated for collecting map outputs in memory on the reduce side. Another parameter we haven&#8217;t played around with but my guess would be that this can be easily set a little bit higher.</p>
<h4><code>mapred.job.reduce.input.buffer.percent</code></h4>
<ul>
<li>Default: <code>0.0</code></li>
<li>Renamed to <code>mapreduce.reduce.input.buffer.percent</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>Usually map outputs would be written to disk when the sort phase (on the reduce) ends. If you have reduce tasks that don&#8217;t need a lot of memory themselves you can set this to a higher value so that map outputs up to this amount of memory (in percent of the total available memory) aren&#8217;t written to disk but kept in memory. This is obviously faster than an intermediate spill to disk. Should be considered on a per-job basis.</p>
<h4><code>mapred.map.tasks.speculative.execution</code> &amp; <code>mapred.reduce.tasks.speculative.execution</code></h4>
<ul>
<li>Default: <code>true</code></li>
<li>We: <code>false</code></li>
<li>We will set this to final once we&#8217;re in production</li>
<li>Renamed to <code>mapreduce.map.speculative</code> &amp; <code>mapreduce.reduce.speculative</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>Speculative Execution starts multiple instances of certain map or reduce tasks when it detects certain circumstances (like an unusually slow task or node) to avoid waiting for stragglers too long. This sounds like a good idea and we&#8217;ve got it enabled at the moment but when we go to production this will probably be disabled as it uses valuable resources on the cluster that mostly goes to waste and while one job may finish faster all the others have to wait longer.</p>
<h4><code>mapred.job.reuse.jvm.num.tasks</code></h4>
<ul>
<li>Default: <code>1</code></li>
<li>Renamed to <code>mapreduce.job.jvm.numtasks</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>Child JVMs are spawned for the map and reduce tasks. This parameter lets you reuse these VMs for multiple tasks. The default value creates a new JVM for each task which has some overhead (the book says about one second per JVM). We&#8217;ve played around with it a bit and it can make things faster but you&#8217;ve got to be careful with memory leaks and shared state. Basically you should be sure that your jobs can handle this. If you have a performance critical job you can play around with this but we&#8217;ve had some OutOfMemory errors when using this so we&#8217;re conservative at the moment. If you set it to <code>-1</code> a JVM will never be destroyed.</p>
<h4><code>tasktracker.http.threads</code></h4>
<ul>
<li>Default: <code>40</code></li>
<li>We: <code>80</code></li>
<li>We set this to final</li>
<li>Renamed to <code>mapreduce.tasktracker.http.threads</code> in Hadoop 0.21</li>
</ul>
<p>The map output is fetched by the reducers from the TaskTrackers via HTTP. This property lets you adjust the number of threads that server those requests. When we upped the parallel copies we had some errors about fetch-failures so we slowly increased this value. Those two parameters need to be carefully tuned. 80 seemed to cause no problems for us so we stuck to it for now. You have to restart your TaskTrackers after changing this value.</p>
<h4><code>mapred.compress.map.output</code></h4>
<ul>
<li>Default: <code>false</code></li>
<li>We: <code>true</code></li>
<li>Renamed to <code>mapreduce.map.output.compress</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>Turning this on will compress the output of your Mappers using SequenceFile compression. Depending on the codec you chose this computation may be CPU intensive and result in varying degrees of compression. We&#8217;ve benchmarked jobs of different sizes with this intermediate compression enabled and disabled and while some of them took slightly longer than before it is still good to enable it. The cost isn&#8217;t too high and there is a lot less intermediate data generated. Less I/O in general is good especially if multiple jobs are running.</p>
<h4><code>mapred.map.output.compression.codec</code></h4>
<p><code> </code></p>
<ul><code> </code></p>
<li><code>Default: org.apache.hadoop.io.compress.DefaultCodec</code></li>
<li>We: <code>com.hadoop.compression.lzo.LzoCodec</code></li>
<li>Renamed to <code>mapreduce.map.output.compress.codec</code> in Hadoop 0.21</li>
<li>Can be used in the client configuration</li>
</ul>
<p>With this property you specify the specific compression codec to use for Map output compression. So far we&#8217;ve only tried LZO. This choice was based on the experience of others and the general properties of the algorithm being very fast but sacrificing a bit of compression efficiency for its speed. We plan to test the other algorithms as well.</p>
<h4><code>mapred.hosts</code></h4>
<ul>
<li>Default: no default set</li>
<li>We: <code>/etc/hadoop/conf/allowed_hosts</code></li>
<li>Renamed to <code>mapreduce.jobtracker.hosts.filename</code> in Hadoop 0.21</li>
<li>We set this to final</li>
</ul>
<p>This is the same as <code>dfs.hosts</code> just specifies which TaskTrackers are allowed to get work from the JobTracker. They both have the same format so it&#8217;s quite common for them to be the same file.</p>
<h1>Conclusion</h1>
<p>After setting up all these parameters the way you like them you should have a fully functional but basic Hadoop cluster running. You can submit jobs, use HDFS etc. But there are a few more things that we can do like installing Hive, Hue, Pig, Sqoop, etc. We&#8217;ve also yet to cover Puppet. All of this is hopefully forthcoming in more blog posts in the future.</p>
<p>We&#8217;re also very interested in other users (or interested people and companies) of Hadoop, HBase &amp; Co. in Scandinavia who would be interested in a Hadoop Meetup. We&#8217;re located in Copenhagen. And I personally am also interested in other users from the Hamburg area. So contact me if you&#8217;re interested.</p>
<p>If you have any questions or spot any problems or mistakes please let me know in the comments or by <a href="mailto:lars.francke@gmail.com">mail</a>.</p>
 <p><a href="http://blog.lars-francke.de/?flattrss_redirect&amp;id=85&amp;md5=d6859bff3a940a914350398fb3c77743" title="Flattr" target="_blank"><img src="http://blog.lars-francke.de/wp-content/plugins/flattrss/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.lars-francke.de/2011/01/26/setting-up-a-hadoop-cluster-part-1-manual-installation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=29341&amp;popout=1&amp;url=http%3A%2F%2Fblog.lars-francke.de%2F2011%2F01%2F26%2Fsetting-up-a-hadoop-cluster-part-1-manual-installation%2F&amp;language=en_GB&amp;category=text&amp;title=Setting+up+a+Hadoop+cluster+%26%238211%3B+Part+1%3A+Manual+Installation&amp;description=Introduction+This+has+also+been+posted+on+the+GBIF+Developer+blog.+I%26%238217%3Bll+answer+questions+in+both+places+and+update+both+blogs+as+needed.+In+the+last+few+months+I+was...&amp;tags=cdh%2Ccloudera%2Chadoop%2Cblog" type="text/html" />
	</item>
		<item>
		<title>Performance testing HBase using YCSB</title>
		<link>http://blog.lars-francke.de/2010/08/16/performance-testing-hbase-using-ycsb/</link>
		<comments>http://blog.lars-francke.de/2010/08/16/performance-testing-hbase-using-ycsb/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 16:00:19 +0000</pubDate>
		<dc:creator>Lars Francke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[ycsb]]></category>

		<guid isPermaLink="false">http://blog.lars-francke.de/?p=62</guid>
		<description><![CDATA[Update May 2012: Some or all of this will be outdated now because the latest versions of YCSB are using Maven. I haven&#8217;t updated this article to reflect this. I assume most of you know what HBase is but just in case here is a snippet from Wikipedia: HBase is an open source, non-relational, distributed&#8230;]]></description>
			<content:encoded><![CDATA[<p><em>Update May 2012: Some or all of this will be outdated now because the latest versions of YCSB are using Maven. I haven&#8217;t updated this article to reflect this.</em></p>
<p>I assume most of you know what <a href="http://hbase.apache.org">HBase</a> is but just in case here is a snippet from <a href="http://en.wikipedia.org/wiki/HBase">Wikipedia</a>:</p>
<blockquote><p>HBase is an open source, non-relational, distributed database modeled after Google&#8217;s BigTable and is written in Java.</p></blockquote>
<p>Yahoo has published a <a href="http://research.yahoo.com/node/3202">paper</a> and the accompanying <a href="http://github.com/brianfrankcooper/YCSB">tool</a> (YCSB) about <em>Benchmarking Cloud Serving Systems with YCSB</em>. At the moment I am not interested in comparing different database systems against each other but instead to only benchmark HBase. This is useful to test custom patches and their performance impact or to test different configuration options.</p>
<p>No matter which kind of workload you choose however keep in mind that this is an artificial benchmark and it can&#8217;t replace a test with your real data and load.</p>
<p>In this short blog post I&#8217;m going to outline how to get YCSB running against a current version of HBase. I&#8217;m going to show this on a single machine. In a real test setup you should of course be running YCSB on a different machine (or <a href="http://wiki.github.com/brianfrankcooper/YCSB/running-a-workload-in-parallel">multiple machines</a>) than your HBase cluster. A YCSB benchmark consists of two phases: a <em>load</em> and a <em>transaction</em> phase. The <em>load</em> phase measures various statistics while importing a bunch of data into the database while the <em>transaction</em> phase does just that, i.e. transactions on the data. There are multiple predefined workloads that mimic typical database usage scenarios and you can also define your own.</p>
<h2>Requirements/Setup</h2>
<p>I am using a clean Ubuntu 10.04 installation but this should work on other distributions just as well.</p>
<p>While you&#8217;ll probably run it against an already set up cluster I will be using HBase in standalone mode here in its second development release of 0.89.</p>
<p>For YSCB I&#8217;ve used the latest version checked out from Github but the latest released version (<a href="http://github.com/brianfrankcooper/YCSB/downloads">0.1.2</a> at the time of this writing) should work equally well. So do this:</p>
<pre class="brush:shell">$ sudo apt-get -y install ant openjdk-6-jdk git-core
$ export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
$ wget http://apache.easy-webs.de/hbase/hbase-0.89.20100726/hbase-0.89.20100726-bin.tar.gz
$ tar xvzf hbase-0.89.20100726-bin.tar.gz
$ hbase-0.89.20100726/bin/start-hbase.sh
$ hbase-0.89.20100726/bin/hbase shell
  create 'usertable', 'family'
  exit
$ git clone http://github.com/brianfrankcooper/YCSB.git
$ cp hbase-0.89.20100726/lib/* YCSB/db/hbase/lib
$ cd YCSB
$ ant
$ ant dbcompile-hbase</pre>
<p>As you can see YCSB requires a table called <code>usertable</code> in HBase and it has to contain one column family with an arbitrary name (i.e. <code>family</code> in my case). YCSB also needs all the libraries (jars) that the HBase client needs to run. The easiest is to just copy everything from HBase&#8217;s <code>lib</code> directory to the appropriate directory in YCSB.</p>
<h2>Running YCSB</h2>
<p>At this point we should have HBase running somewhere and YCSB and its HBase driver compiled. Time to load some data into HBase.</p>
<pre class="brush:shell">java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -load -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p recordcount=1000 -s &gt; load.dat</pre>
<p>A few things to note here:</p>
<ul>
<li>This loads only 1000 records into HBase. You will want to increase the number to 100 million or more on a real test.</li>
<li>The <a href="http://wiki.github.com/brianfrankcooper/YCSB/running-a-workload">documentation</a> is pretty good so make sure to read it should you have problems.</li>
<li>The documentation suggests not specifying properties (like recordcount) on the command line but in a property file instead. You&#8217;ll find instructions on how to do this on the aforementioned page.</li>
<li>The <code>-s</code> parameter causes YCSB to print status messages to System.err every ten seconds, remove it if you don&#8217;t want them.</li>
<li>After the load operation has finished you can find statistics in the <code>load.dat</code> file</li>
</ul>
<p>Now we&#8217;ll run the transactions part of the workload (again, for explanations see the documentation of YCSB):</p>
<pre class="brush:shell" style="font: normal normal normal 12px/18px Consolas, Monaco, 'Courier New', Courier, monospace;">java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p operationcount=1000000 -s -threads 10 -target 100 &gt; transactions.dat</pre>
<p>or</p>
<pre class="brush:shell">java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p operationcount=1000000 -s -threads 10 -target 100 -p measurementtype=timeseries -p timeseries.granularity=2000 &gt; transactions.dat</pre>
<p>After each run you should inspect the <code>transactions.dat</code> file. For explanations I&#8217;ll once again refer to the documentation. We&#8217;ve used <code>workloada</code> in these examples but there are in fact multiple predefined workloads (which are listed and explained in the <a href="http://wiki.github.com/brianfrankcooper/YCSB/core-workloads">documentation</a>).</p>
<p>That&#8217;s it. As you can see YCSB is pretty easy to set up. I still hope this guide was helpful in getting started with it. Let me know if you have any questions.</p>
 <p><a href="http://blog.lars-francke.de/?flattrss_redirect&amp;id=62&amp;md5=23509551b618c5bfc16b22a6695b1d28" title="Flattr" target="_blank"><img src="http://blog.lars-francke.de/wp-content/plugins/flattrss/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.lars-francke.de/2010/08/16/performance-testing-hbase-using-ycsb/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=29341&amp;popout=1&amp;url=http%3A%2F%2Fblog.lars-francke.de%2F2010%2F08%2F16%2Fperformance-testing-hbase-using-ycsb%2F&amp;language=en_GB&amp;category=text&amp;title=Performance+testing+HBase+using+YCSB&amp;description=Update+May+2012%3A+Some+or+all+of+this+will+be+outdated+now+because+the+latest+versions+of+YCSB+are+using+Maven.+I+haven%26%238217%3Bt+updated+this+article+to+reflect+this.+I...&amp;tags=hbase%2Cycsb%2Cblog" type="text/html" />
	</item>
		<item>
		<title>Processing OpenStreetMap data with Hive</title>
		<link>http://blog.lars-francke.de/2010/07/22/processing-openstreetmap-data-with-hive/</link>
		<comments>http://blog.lars-francke.de/2010/07/22/processing-openstreetmap-data-with-hive/#comments</comments>
		<pubDate>Thu, 22 Jul 2010 15:00:33 +0000</pubDate>
		<dc:creator>Lars Francke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cloudera]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[openstreetmap]]></category>

		<guid isPermaLink="false">http://blog.lars-francke.de/?p=7</guid>
		<description><![CDATA[Update: I have updated the post with information about the Sqoop problem with reserved keywords. See below or this issue. Update 2: The problem with Hue seems to be HUE-54. Hue just doesn&#8217;t seem to handle anything beyond ASCII at the moment (Python and UTF-8 always were a frickle beast) Update 3: Last problem solved.&#8230;]]></description>
			<content:encoded><![CDATA[<p><em>Update: I have updated the post with information about the Sqoop problem with reserved keywords. See below or </em><a href="https://issues.cloudera.org/browse/SQOOP-37"><em>this issue</em></a><em>.<br />
Update 2: The problem with Hue seems to be <a href="https://issues.cloudera.org/browse/HUE-54">HUE-54</a>. Hue just doesn&#8217;t seem to handle anything beyond ASCII at the moment (Python and UTF-8 always were a frickle beast)<br />
Update 3: Last problem solved. Direct import with PostgreSQL seems to be working. See <a href="https://issues.cloudera.org/browse/SQOOP-38">SQOOP-38</a> and below.<br />
Update 4: There&#8217;s yet another problem with the direct import option for PostgreSQL. Don&#8217;t use it for any tables that contain boolean columns: </em><em><a href="https://issues.cloudera.org/browse/SQOOP-43">SQOOP-43</a></em></p>
<p>As you might or might not know (depending on how you found your way to this blog post) I&#8217;m a heavy user of OpenStreetMap (OSM) and I try to promote it whenever I can. I run the smallish website <a title="OSMdoc" href="http://osmdoc.com">OSMdoc</a> which analyzes a bit of the OSM data. As I also work with the HStack (<a title="Hadoop" href="https://hadoop.apache.org/">Hadoop</a>, <a title="HBase" href="http://hbase.apache.org/">HBase</a>, etc.) I always wanted to combine those two. So this article shows how to install Hadoop and <a title="Hive" href="https://hadoop.apache.org/hive/">Hive</a> on a fresh installation of <a title="Ubuntu" href="http://www.ubuntu.com/">Ubuntu 10.4</a> and load OSM data into it to run queries against it.</p>
<p>I am using <a href="http://www.cloudera.com/">Cloudera&#8217;s</a> <a href="http://www.cloudera.com/hadoop/">CDH3</a> (version Beta 2) distribution for this. I could have also used their pre-built <a href="http://www.cloudera.com/developers/downloads/">Virtual Machine</a> but I wanted to learn more and install everything myself. If you follow this post you should hopefully end up with a working way to use Hive and OSM yourself. While the way I use may not be perfect it is one that doesn&#8217;t require us to write any code for now.</p>
<p>Here is the outline of what I&#8217;m doing:</p>
<ul>
<li>Starting with a fresh updated installation of Ubuntu 10.4 in a virtual machine (I&#8217;m using <a href="http://www.virtualbox.org/">Virtualbox</a>)</li>
<li>Install <a href="http://www.postgresql.org/">PostgreSQL</a> and <a href="http://wiki.openstreetmap.org/wiki/Osmosis">Osmosis</a></li>
<li>Import OSM data into PostgreSQL</li>
<li>Install Hadoop, <a href="http://archive.cloudera.com/cdh/3/hue/manual.html">Hue</a>, <a href="http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html">Sqoop</a> and Hive</li>
<li>Load data from PostgreSQL to Hive using Sqoop</li>
<li>Run Hive queries</li>
</ul>
<h2>Install and set up PostgreSQL</h2>
<p>We&#8217;ll need PostgreSQL to store the OSM data as is usual in the OSM world. Later on we&#8217;ll also use it to host the metastore for Hive.</p>
<p>This one is easy thanks to the package system of Ubuntu:</p>
<pre class="brush:shell">sudo apt-get -y install postgresql
sudo -u postgres createuser -S -D -R -P osm
sudo -u postgres createdb -O osm osm
sudo sed -i "/^# \"local\"/i\local all all md5" /etc/postgresql/8.4/main/pg_hba.conf
sudo service postgresql-8.4 restart</pre>
<p>This install PostgreSQL 8.4, creates a <code>osm</code> user which owns a database called <code>osm</code> and allows access for this user from the local machine with the password. Chose any password, just remember it.</p>
<h2>Install Osmosis</h2>
<p>For those who don&#8217;t know: Osmosis is the tool of choice for any tasks related to OSM data. It is used by almost everyone to keep databases updated and for various other tasks. It also runs on the OSM servers to provide diff files for OSM data. We use it here to import OSM data in a simple database schema.</p>
<p>Again this is relatively easy. Unfortunately there is no .deb package to install for Osmosis (yet) so we&#8217;ve got to do it manually. The first step is to install Java. It is recommended to use the Sun version of Java for Hadoop but as of Ubuntu 10.04 <a href="http://openjdk.java.net/">OpenJDK</a> is the default. We&#8217;ve got to add another repository so we can install the Sun version first. If that is not possible for you or you want to use OpenJDK it should still work but there were some bugs in there that made the Hadoop/HBase guys recommend the Sun JDK.</p>
<p>As we&#8217;ll need Osmosis only occasionally or for a one-off job I won&#8217;t bother setting it up and installing it in the system directories. You&#8217;ll also have to get some OSM data. For testing I just use a small extract <a title="Geofabrik download server" href="http://download.geofabrik.de/osm/">provided</a> by the <a title="You need OSM services ask these guys" href="http://www.geofabrik.de/">Geofabrik</a> but you might want to parse the whole planet or other extracts.</p>
<pre class="brush:shell">sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get -y install sun-java6-jdk
cd ~
wget http://dev.openstreetmap.org/~bretth/osmosis-build/osmosis-bin-latest.tgz
tar xvfz osmosis-bin-latest.tgz
psql -U osm -f osmosis-0.35.1/script/contrib/apidb_0.6.sql -d osm
osmosis-0.35.1/bin/osmosis --read-xml file="&lt;OSM XML file here&gt;" --write-apidb host="localhost" database="osm" user="osm" password="&lt;your password&gt;"</pre>
<p>Depending on which OSM data set you chose to import this can take a while. But the beautiful thing is that we can continue with installing Hadoop while this import is running. So open a new terminal and continue, Osmosis will eventually finish.</p>
<h2>Installing Hadoop, Hue, Hive and Sqoop</h2>
<p>Thanks to Cloudera this is pretty straightforward:</p>
<pre class="brush:shell"># Add Cloudera repositories
sudo sh -c 'echo "deb http://archive.cloudera.com/debian `lsb_release -c -s`-cdh3 contrib" &gt; /etc/apt/sources.list.d/cloudera.list'
sudo sh -c 'echo "deb-src http://archive.cloudera.com/debian `lsb_release -c -s`-cdh3 contrib" &gt;&gt; /etc/apt/sources.list.d/cloudera.list'
wget -q -O - http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install hadoop

# Install Hue
sudo apt-get -y install hadoop-0.20-conf-pseudo-hue

# Install Sqoop
sudo apt-get -y install sqoop

# Install Hive
sudo apt-get -y install hadoop-hive
sudo -u postgres createuser -S -D -R -P hive
sudo -u postgres createdb -O hive hive</pre>
<p>After this is finished all those tools should be installed and PostgreSQL is prepared for Hive. What is left to do is to set up Hive and Hue to use PostgreSQL instead of SQLlite/Derby so they can share the metastore.</p>
<p>You need to edit the file <code><a href="http://wiki.apache.org/hadoop/Hive/AdminManual/MetastoreAdmin#Local_Metastore">/etc/hive/conf/hive-site.xml</a></code> and in particular the following properties:</p>
<ul>
<li><code>javax.jdo.option.ConnectionURL</code>: In our current setup this would be <code>jdbc:postgresql://localhost/osm</code></li>
<li><code>javax.jdo.option.ConnectionUserName</code>: <code>hive</code></li>
<li><code>javax.jdo.option.ConnectionPassword</code>: The password you chose earlier for the database user</li>
<li><code>javax.jdo.option.ConnectionDriverName</code>: <code>org.postgresql.Driver</code></li>
</ul>
<p>Additionally you need to set the property <code>hive_conf_dir</code> in the file <code><a href="http://archive.cloudera.com/cdh/3/hue/manual.html#_beeswax_the_hive_ui">/etc/hue/hue-beeswax.ini</a></code> to <code>/etc/hive/conf</code>.</p>
<p>If you followed my post so far <em>and</em> used the password <code>hive</code> for the PostgreSQL user you can just run the following commands:</p>
<pre class="brush:as3">wget http://gist.github.com/raw/485836/512357fef1be0ac9cf8596770939355fc61a4d1c/hive-site.xml
sudo mv hive-site.xml /etc/hive/conf
sudo chown root:root /etc/hive/conf/hive-site.xml
wget http://gist.github.com/raw/485836/e32b00bc69744509123ef584be226328c37ccf77/hue-beeswax.ini
sudo mv hue-beeswax.ini /etc/hue</pre>
<p>Now you need to <a title="PostgreSQL JDBC site" href="http://jdbc.postgresql.org/download/postgresql-8.4-701.jdbc4.jar">download</a> the current JDBC driver for PostgreSQL. At the time of this writing this is version <a title="PostgreSQL JDBC Driver 8.4-701 download" href="http://jdbc.postgresql.org/download/postgresql-8.4-701.jdbc4.jar">8.4-701</a>:</p>
<pre class="brush:shell">wget http://jdbc.postgresql.org/download/postgresql-8.4-701.jdbc4.jar
sudo mv postgresql-8.4-701.jdbc4.jar /usr/lib/hadoop-0.20/lib/
sudo chown -R hadoop:hadoop /usr/lib/hadoop-0.20/lib/postgresql-8.4-701.jdbc4.jar</pre>
<p>Now all that is left is to start Hadoop and Hue. Please note that the startup takes a while (at least for me and I&#8217;m not sure if that is correct):</p>
<pre class="brush:shell">for x in /etc/init.d/hadoop-0.20-*; do sudo $x start; done
sudo /etc/init.d/hue start</pre>
<p>After this you should have three web interfaces:</p>
<ul>
<li><a href="http://localhost:50030">http://localhost:50030</a> &#8211; MapReduce</li>
<li><a href="http://localhost:50070">http://localhost:50070</a> &#8211; HDFS</li>
<li><a href="http://localhost:8088">http://localhost:8088</a> &#8211; Hue</li>
</ul>
<p>Take the time to look at all those sites and make sure that everything seems fine. Also browse through HDFS using Hue and see if Beeswax and Hive work by clicking on the <em>Tables</em> button; there should be no tables for now.</p>
<h2>Importing the OSM data from PostgreSQL to HDFS</h2>
<p>We&#8217;re almost done. All that&#8217;s left to do is to get the data from PostgreSQL to HDFS and into Hive. That&#8217;s what Sqoop is for and here is how to run it:</p>
<pre class="brush:shell">sqoop import --connect "jdbc:postgresql://localhost/osm" --username osm --password &lt;your password&gt; --table node_tags --hive-import</pre>
<p>An alternative to this command is the so called <em>direct</em> mode which does not use JDBC to connect to PostgreSQL but uses the <em>psql</em> tool to issue <a href="http://www.postgresql.org/docs/8.4/interactive/sql-copy.html">COPY</a> commands which provides a speed boost to the export. There is currently a bug in Sqoop (see below) so the command to start Sqoop is a little bit different to circumvent this:</p>
<pre class="brush:shell">sqoop import --direct --connect "jdbc:postgresql://localhost:5432/osm" --username osm --password &lt;your password&gt; --table node_tags --hive-import</pre>
<p><strong>Note: </strong>Do not forget the port number in the JDBC URL!</p>
<p>The options should be pretty straight forward and easy to understand. But there&#8217;s a caveat. Or two, or three. I&#8217;ll update this post if anything changes with these points:</p>
<ul>
<li>The import of the nodes, ways, relations, etc. tables doesn&#8217;t work with this option as Sqoop seems to have a bug (at least I think Sqoop can do something about it) with reserved words in Hive. timestamp is one such word and it is unfortunately the name of a column in all those tables. So the above command would fail with an error if you used the nodes table instead. I&#8217;ll update this post when I had time to investigate this further. <strong>Update</strong><strong>:</strong> I&#8217;ve opened a <a href="https://issues.cloudera.org/browse/SQOOP-37">ticket</a> for the problem and attached a patch. Reserved words like <em>timestamp</em> can be escaped by backticks: <code>`timestamp`</code>. So for now you&#8217;ll have to create the table manually. See the <a href="http://wiki.apache.org/hadoop/Hive/GettingStarted">Getting Started</a> guide and the <a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL">Data Definition Language</a> for details.</li>
<li>Sqoop tells me that I should use the &#8211;direct option for better performance but I only got error messages. I&#8217;ll investigate this further. The above method may be slower but it works for now. <strong>Update:</strong> Found the problem and reported it to <a href="https://issues.cloudera.org/browse/SQOOP-38">SQOOP-38</a>. I&#8217;ve updated the text above with a way to use the direct import</li>
<li>Unfortunately Hue does seem to have problems with non ASCII data somewhere. This means that Beeswax doesn&#8217;t work for us at the moment as we got  almost every character from the known Unicode set in the database somewhere. The command line however works and I&#8217;ll show a quick example. <strong>Update:</strong> This seems to be <a href="https://issues.cloudera.org/browse/HUE-54">HUE-54</a></li>
</ul>
<p>Now to actually query your data &#8211; and I&#8217;ll use the <code>node_tags</code> table as an example here &#8211; you just have to start Hive with the <code>hive</code> command and enter your query:</p>
<pre class="brush:sql">SELECT * FROM node_tags;
SELECT k, COUNT(k) AS count FROM node_tags GROUP BY k ORDER BY count DESC;
SELECT k, count FROM (SELECT k, COUNT(k) AS count FROM node_tags GROUP BY k) sub WHERE sub.count &gt; 100 ORDER BY count DESC;</pre>
<p>Those are pretty basic but they are the basis for the OSMdoc data so it&#8217;s what I wanted in the first place. The last query looks a bit complicated but Hive doesn&#8217;t have the HAVING clause yet so this is a workaround. It might not be very fast on small data sets but at least it is predictable. On PostgreSQL similar queries on a whole planet would take hours or days. And this is scalable.</p>
<h2>Conclusion</h2>
<p>I was happy to find that the whole setup was pretty easy. I first did this over a year ago and  it was much more involved then. It&#8217;s great seeing the Hadoop community still going strong. As I&#8217;ve mentioned there are a few drawbacks and a few problems left but for <em>my</em> use case this is enough for now to provide a much needed refresh of the OSMdoc data. The last update was about a year ago.</p>
<p>My method has room for improvement &#8211; I for one don&#8217;t really need the PostgreSQL database and would love to store all the data in HBase but I&#8217;m lacking the resources. So for now PostgreSQL is an intermediate step. I&#8217;ve also not yet evaluated the performance but to compare PostgreSQL and Hive would be unfair at best. And the last thing I&#8217;ll need to do is to keep the data in HDFS updated and synchronized with the latest OSM updates.</p>
<p>Let me know if there are any questions or problems and I&#8217;d be glad to help either in the OpenStreetMap or the Hadoop world.</p>
<p>I&#8217;d like to end this post with a call for sponsors: OSMdoc in particular and the OpenStreetMap community in general could benefit greatly from a few more resources for our toy projects. I&#8217;d love to host a HBase version of OSM somewhere to allow for queries against it and to allow for much improved analysis in OSMdoc. <a href="http://www.strato.de/">Strato</a> has kindly <a href="http://wiki.openstreetmap.org/wiki/FOSSGIS/Server">donated</a> three of their largest servers to <a href="http://www.fossgis.de/">FOSSGIS</a> for use in the OSM community but they are overloaded and there&#8217;s already a huge <a href="http://wiki.openstreetmap.org/wiki/FOSSGIS/Server/Projects#Vorgeschlagene_Projekte.2FProposed_Projects">waiting list</a> for new projects. So if you or your company has anything to spare please <a href="mailto:lars.francke@gmail.com">contact</a> me.</p>
<h2>Reference</h2>
<p>You&#8217;ll find all the necessary commands in the following files:<br />
<script src="http://gist.github.com/485301.js"> </script></p>
<p><script src="http://gist.github.com/485422.js"> </script></p>
<p><script src="http://gist.github.com/485836.js"> </script></p>
 <p><a href="http://blog.lars-francke.de/?flattrss_redirect&amp;id=7&amp;md5=38ae3c4e1bc8fa85fc343189da565f28" title="Flattr" target="_blank"><img src="http://blog.lars-francke.de/wp-content/plugins/flattrss/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.lars-francke.de/2010/07/22/processing-openstreetmap-data-with-hive/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<atom:link rel="payment" href="https://flattr.com/submit/auto?user_id=29341&amp;popout=1&amp;url=http%3A%2F%2Fblog.lars-francke.de%2F2010%2F07%2F22%2Fprocessing-openstreetmap-data-with-hive%2F&amp;language=en_GB&amp;category=text&amp;title=Processing+OpenStreetMap+data+with+Hive&amp;description=Update%3A+I+have+updated+the+post+with+information+about+the+Sqoop+problem+with+reserved+keywords.+See+below+or+this+issue.+Update+2%3A+The+problem+with+Hue+seems+to+be+HUE-54....&amp;tags=cloudera%2Chadoop%2Chive%2Copenstreetmap%2Cblog" type="text/html" />
	</item>
	</channel>
</rss>

