LargeCollections - A fast and persistent cache with a java.util.Map interface

This week Axiomine released LargeCollections as open-source under the Apache License.
LargeCollections supports java.util.Map implementation which is backed by LevelDB. This allows your collections to grow very large as it does not use the JVM heap memory.
The primary purpose behind creating LargeCollections was to support java.util.Map. There are also implementations which support java.util.List and java.util.Set. The List however has some caveats which are described below.

Key Design Principles

The underlying java.util.Map implementations are backed by Leveldb. LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from byte array keys to byte array values.
While LargeCollections supports any Serializable/Externalizable/Writable/Kryo-Serializable Key and Value classes, the underlying implementations stores everything as byte-array key value store (similar to HBase). Thus every key/value instance needs to be converted to a byte array to write to the LevelDB backing store and convert back from byte array to a Java instance.
To support the conversions from object to byte-array and back, every java.util.Map sub-class provided by LargeCollections library needs to have Serializer-Deserializer (SerDes) pair, one each for Key and Value class.
These SerDes pair is implements the following standard Interfaces
com.axiomine.largecollections.serdes.TurboSerializer
com.axiomine.largecollections.serdes.TurboDeSerializer
Extremely fast Serializer's and De-Serializer's are provided in the com.axiomine.largecollections.serdes package for primitive data-types such as
  • String
  • Integer
  • Long
  • Float
  • Double
  • Char
  • Byte
  • Byte-Array
For custom-serializers Kryo library is supported. Kryo support leads to indirect support for standard types like Serializable and Externalizable

FAQ

When should you use LargeCollections?

As far as possible never! However there will be times when your Map/List/Set instances will be slightly larger than your JVM heap can handle (200MB-1GB range). Use LargeCollections in those situations. Examples of such usecases are -
  1. When using Text Mining algorithms typically you need a Document-Term matrix. The size of this matrix explodes when the number of documents increases. Hence sampling is used to build models. However using LargeCollections you can store your Document-Term matrix for millions of documents. For the cost of slight performance degradation you will produce models on a single machine (without incurring the complexity of distributed implementations) which utilize large amounts of data. It is well known in the Machine Learning business that "More Data trumps better Models". LargeCollections allows you to use more data without incurring the complexitiy of distributed implementations.
  2. In MapReduce it is preferable to use Map-Side joins over Map-Reduce joins. But Map-Side need one side of the data to be accessible in all Mappers. Several techniques have evolved to handle this problem (See Merge-Join in Pig). However, if one side of your data is relatively small (300MB- 1 GB) in size, you can safely have a LargeCollections instance in each of your Mapper's. This will allow you to perform joins in the Mapper.
  3. You can use LargeCollections if you need access to cached dataset in your Mapper's or Reducers. This can save you considerable complexity in designing your MapReduce programs

Give me more pointers on when I should and should not use LargeCollections

If you can simply use java.util.HashMap by managing the JVM memory (using the -Xmx option) use it. If your Key-Value store is too large (order of 10-100 GB's) use something like MongoDB. If your Key-Value store is very large (order of TB's) use something like HBase/Cassandra
But there is a "Uncanny Valley" for Map sizes. That is in the range of 100MB-1GB. It is not too large to justify the use of MongoDB and even less so to consider HBase. It is large enough that you will get annoying Out Of Memory exceptions every now and then. Worse still you performance will start degrading due to frequent Major Garbage Collection invocations. LargeCollections was built to support this "Uncanny Valley" use-cases.

Can I serialize LargeCollection instances to disk

Yes you can. LargeCollections is composed of two components
  1. Metadata of the collection - Ex. Size of the collection, location on disk for the LevelDB store
  2. LevelDB store on disk
What is serialized is the Metadata. When the LargeCollection instance is deserialized the metadata is deserialized. The LevelDB store on disk simply gets utilized in both cases. You might move you LevelDB store after serialization. There are System parameters you can override which allow you to deserialize such that the derserialized version points to the new path of the LevelDB store. See documentation for more details

What about java.util.List support

LargeCollections supports a limited version of java.util.List. These limitations are listed below
  1. Use it primarily for write one and read many times List.
  2. List can be updated by index but insertions/deletions by index is not supported
  3. contains(Object obj) is a hueristic operation that depends on an underlying bloomfilter. This it is more like "mightContain". Implementing a true "contains" behavior would require iterating the entire list which is very expensive.

What about java.util.Set support

It is supported. See the package com.axiomine.largecollections.util for more the classes.

What are the alternatives to LargeCollections

A commercial alternative is BigMemory from Terracotta. It uses the EHCache api. It is designed to do a lot more than just be a persistent Cache. Our goal was to support only a persistent Cache with a java.util interface.
An open source alternative is ChronicleMap. In spirit is very akin to LargeCollections in that it attempts to support the java.util.Map. It uses MemoryMapped Files instead of LevelDB to support persistence. In practice is tries to do what BigMemory does in that it can be distributed.
LargeCollections is not distributed. Our goal was to support the standard use-case - Use Map when your memory usage falls in the Uncanny Valley. No more and no less.

Commentaires

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark performance optimization: shuffle tuning

Spark optimization