Experiments with HBase, Phoenix, and SQL at Scale

For a short time we have a large Hadoop/HBase/Phoenix cluster at our disposal before it will be handed off to its designated use. Many 100's of machines, many 1000's of cores, many dozen TB of aggregate ram. (not at liberty to state the exact size of the cluster, nor its exact H/W configuration).

We decided to use this opportunity to perform some semi-scientific tests of Phoenix and HBase at scale to see how much of the hardware we can actually utilize and drive for single queries.

For that we loaded the TPC-H LINEITEM table with 600m and 6bn rows, respectively.

The 600m table uses about 372GB of data (176GB on disk after FAST_DIFF encoding), the 6bn row table measured 4.3TB (1.65TB on disk). While not a lot of data at all for this type of cluster (it all fits easily in the HBase block cache) it none-the-less lets us gauge how Phoenix and HBase can scale their workloads across the cluster.

We performed the following queries over the data:


  • SELECT COUNT(*) FROM T;
    With we measure the scan performance
  • SELECT COUNT(*) FROM T WHERE l_quantity < 1000;
    Here we measure scan performance with a simple aggregate
  • SELECT * FROM T ORDER BY l_extendedprice;
    Terrasort anyone? (we canceled the query as soon as the first result is returned, in to measure the sort performance)
  • SELECT COUNT(*) FROM T, J WHERE l_extendedprice = k1;
    Measure how Phoenix' join operator scales
  • SELECT COUNT(*) FROM T WHERE l_extendedprice = <x>;
    With and without a local index.

The Good

  1. We were able to drive 1000's of cores by sizing Phoenix' client threadpool accordingly. HBase does not have a fully supported asynchronous client, so many threads are needed not doing any actual work, but just triggering work on the RegionServers and waiting for the results.
  2. Phoenix uses guide-posts to break work down in chunks smaller then a region, and as we increased the number of threads the performance scaled linearly with it, utilizing the cores on the servers - for queries that can be scaled that way that is.
  3. HBase did a good job of automatically splitting the data and distributing it over the cluster. We found that as long as the data fits into the aggregate cache it did not matter how many regions (and hence server) were involved, indicating there's a potentially for crunching _way_ larger data sets with a single client. There was room to run larger queries and multiple queries at the same time.
  4. Phoenix can sort 4.3TB in 26.4s, engaging 1000's of cores on 100's of machines.
  5. A LOCAL index over 4.3TB is created in 1 minute. Even with many region it was still effective in reducing the query times significantly.
  6. Due to the chunking and chunk queuing when a query is executing, resource allocation is surprisingly fair.

Numbers:

First we varied the number of a driver threads on the client for the first query.
Notice that we can scan through 6bn rows, or 4.3TB of data in 6.55s.
Same with a simple filter applied. We can search 6bn rows/4.3TB of data in 8.36s.
Next we performed sorting. As soon as the query returns the first result we cancel it.
(note that with an added LIMIT clause Phoenix does an O(N) scan, instead of a O(N*log(N)) sort).
As mentioned, a single client can drive the cluster to sort 4.3TB in 26.4s!
Joins... This time fixing the threadpool size around 4000, and varying the number of a join keys. In the 128k case we performed 786 trillion comparisons of a BigInt value in 26s
Lastly, we also created LOCAL indexes on the data. Due to their local nature they are created in just a few second, and even at many 1000 regions still were effective to reduce query times significantly.

Index creation on the 6bn row/4.7TB table took almost exactly 1 minute.
It brought a scan selecting a few 1000 of the 6nb rows from 8s to about 10ms - and note that this is a table with many 1000 regions, where each region needs to pinged with "do you have these keys?"

The Bad

Joins naturally do not scale well. Phoenix now has a merge join option, which sorts the relations at the servers and then performs a final merge on the client. At the sizes we're talking about this can't be done realistically. Perhaps joins with highly selective filters will work, something we'll test soon.

There is a lot of work to do. In order to be efficient Phoenix/HBase need to work on denormalized data - which should not really come as a surprise.

So we measured joins only for situations where they can be executed as a broadcast-hashjoin. With default configuration this failed at 2^18 (256k) joins keys.

LOCAL indexes of a table are mixed in the same single column family. Creating many of them make that column family the dominant one in terms of size, driving HBase's size based split decisions.

Unlike MapReduce, Spark and some distributed databases, Phoenix has no resource allocation or restarting framework, other than what's provided by HBase. Phoenix is not useful choice for queries that run for more than (say) 1h, or where partial failure on some machines is likely.

And the Ugly

  • When queries naturally do not scale, they simply "never" (as in days) return, churning the client forever until canceled or timing out, failing.
  • Some results seemed surprising at first. OFFSET queries become very slow with increasing OFFSETs. Upon inspection that makes sense
  • Distinct queries on high cardinality tables churned for a long time and then failed.
  • There is no cost based optimizer, Phoenix uses simple heuristics to determine how to execute a query. For simple queries that works well, but breaks down for complex joins.
  • Dropping a local index is very expensive as each entry is marked for deletion individually.

Summary

To drive large queries you must configure the Phoenix client accordingly:
Increase phoenix.query.threadPoolSize (1000, 2000, or 4000) and phoenix.query.queueSize (maybe 100000).

Semi- (or un-)scientific; more of a qualitative tests whether a larger cluster can actually be utilized to execute a single query. Phoenix/HBase do quite well in terms of scaling. We will continue to do more tests and drive the fixed cost down, to allow Phoenix/HBase to utilize the machines better.
We'll have this cluster just a little bit longer. More to come. And if you have any other Phoenix/HBase testing you'd like us to do, please comment, and I'll see what we can do.

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization