Faster HBase - hardware matters

As I've written earlier, I've been spending some time evaluating the performance of HBase using PerformanceEvaluation. My earlier conclusions amounted to: bond your network ports and get more disks. So I'm happy to report that we got more disks, in the form of 6 new machines that together make up our new cluster:



Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper

Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster



Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap)



Hardware:

c1n*: 1x Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K

c4n*: Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K





Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved:




Figure 1: Scan performance of new cluster (2x 1gig ethernet)

So, 1 million records/second? Yes, please! That's a performance increase of about 300% over the c2 cluster I tested in my previous posts. Given that we doubled the number of machines and quadrupled the number of drives, that improvement feels about right. But the decline in performance as number of mappers is increased is still a bit suspect - we'd hope that performance would go up with more workers. This is the same behaviour we saw when we were limited by network in our original tests, and ganglia again shows a similar pattern, though this time it looks like we're hitting a limit around the 2 gig ethernet bond:






Figure 2: bytes_in (MB/s) with dual gigabit ethernet




Figure 3: bytes_out (MB/s) with dual gigabit ethernet



Unfortunately our master switch is currently full, so we don't have the extra 6 ports needed to test a triple bond - but given our past experience I feel reasonably confident that it would change Figure 1 such that scan performance increases with number of mappers up to some disk I/O contention limit.





Hang-on, what about data locality?


But there's a bigger question here - why are we using so much network bandwidth in the first place? Why stress about major compactions and data locality when it doesn't seem to get used? Therein lies the rub - PerformanceEvaluation can't take advantage of data locality. Tim wrote about the tremendous importance of TableInputFormat in ensuring maximum scan performance from MapReduce, and PerformanceEvaluation doesn't do that. It assigns a block of ids to scan to different mappers at random, meaning that at best one in six mappers (in our setup) will coincidentally have local data to read, and the rest will all transfer their data across the network. This isn't a bug in PerformanceEvaluation, per se, because it was written to try and emulate the tests that Google ran in their seminal white paper on BigTable, rather than act as a true benchmark for scanning performance. But if you're new to this stuff (as I was) it sure can be confusing. When we switched to scanning our real data using TableInputFormat our throughput jumped to 2M/sec from the 1M/sec we got using PerformanceEvaluation.



Conclusions



While we learned a lot from using PerformanceEvaluation to test our clusters, and it helped to uncover any misconfigurations and taught us how to fine tune lots of parameters, it is not a good tool for benchmarking scan performance. As Tim wrote, scans across our real occurrence data (~370M records) using TableInputFormat are finishing in 3 minutes - for our needs that is excellent and means we're happy with our cluster upgrade. Stay tuned for news about the occurrence download service that Lars and I are writing to take advantage of all this speed :)

0 comments:

Post a Comment