libhashring – high performance consistent hashing

In a previous post I talked about the need to shard Redis data and how I accomplished this by adding shard/hashing support to erldis, an Erlang client for Redis. This solution proved to work well, it distributed our data very well amongst many Redis servers — but there was one problem. Performance.

In the change I made to erldis, the hash ring was stored in ETS (an erlang memory store) and anytime a key was hashed, the ring had to be retrieved from ETS. The problem with this is that Erlang copies the entire ring when it comes out of ETS. The ring we used was very large, with thousands of items. Copying this every single time became a huge performance hit.

To mitigate the performance problems, I decided to implement the hash ring in C, and write an Erlang driver to use it. That is how libhashring was born. It is very high performance and currently has bindings for Erlang and Java. It is currently deployed in our production environment and its speed is incredible. I am confident that increasing the size of the ring as we add capacity will not cause an impact to the hash ring’s performance.

libhashring supports MD5 and SHA-1. MD5 seems to be about 25% faster than SHA-1, so if you want the extra performance, MD5 is probably the best bet.

Feel free to fork libhashring and make it even better, I’d be really happy to get some feedback and contributions.

P.S. – After finishing the hash ring, I came across libketama and realized it pretty much solves the same problem that I had. Too bad I didn’t see it sooner!

C example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
hash_ring_t *ring = hash_ring_create(8, HASH_FUNCTION_SHA1);
char *slotA = "slotA";
char *slotB = "slotB";

char *keyA = "keyA";
char *keyB = "keyBBBB";
char *keyC = "keyB_";

hash_ring_node_t *node;

assert(hash_ring_add_node(ring, (uint8_t*)slotA, strlen(slotA)) == HASH_RING_OK);
assert(hash_ring_add_node(ring, (uint8_t*)slotB, strlen(slotB)) == HASH_RING_OK);

node = hash_ring_find_node(ring, (uint8_t*)keyA, strlen(keyA));
assert(node != NULL && node->nameLen == strlen(slotA) && memcmp(node->name, slotA, strlen(slotA)) == 0);

node = hash_ring_find_node(ring, (uint8_t*)keyB, strlen(keyB));
assert(node != NULL && node->nameLen == strlen(slotA) && memcmp(node->name, slotA, strlen(slotA)) == 0);

node = hash_ring_find_node(ring, (uint8_t*)keyC, strlen(keyC));
assert(node != NULL && node->nameLen == strlen(slotB) && memcmp(node->name, slotB, strlen(slotB)) == 0);

hash_ring_free(ring);
Leave the first comment

Thrift and ZooKeeper

Thrift provides a great framework for developing and accessing remote services. It allows developers to create services that can be consumed by any application that is written in a language that there are Thrift bindings for (which is…just about every mainstream one, and more).

This is great for systems that are heterogeneous — for example, you could write a user authentication service in Java, but call it from your Ruby web application.

Thrift manages serialization of data to and from a service, as well as the protocol that describes a method invocation, response, etc,. This is great because instead of writing all the RPC code — you can just get straight to your service logic. Thrift uses TCP (not sure if UDP is/will be supported) and so a given service is bound to a particular port.

As you start to scale an infrastructure with Thrift services, you may find, as I have, that putting all of your Thrift server IP/port combinations in a configuration file that your clients read is just not…the best. If you have an environment where Thrift servers go down for maintenance (or crash), or you add more capacity on the fly, a dynamic way to manage the state of all of the services is needed.

Enter Apache ZooKeeper, a distributed, fault tolerant, highly available system for managing configuration information, naming, and more (distributed synchronization, anyone?).

ZooKeeper appears like a filesystem to clients, it is a hierarchy of znodes, which are analogous to directories or files, both of which can contain a small amount of data.

ZooKeeper can be used to store Thrift service location information, allowing clients to dynamically discover Thrift services. ZooKeeper even provides a way to create ephemeral znodes, which means that once your Thrift service goes down, it will be removed from ZooKeeper automatically. And if that isn’t cool enough, ZooKeeper even supports watches, where clients can ask to be notified whenever a znode changes. This means that clients can begin using new Thrift service capacity instantly, and when failures happen, clients will stop attempting to contact a down Thrift service.

Laying out your Thrift services in ZooKeeper is important. Clients will need to know about the layout when performing service discovery. For example, you can do something such as:

1
2
3
/services/user/user0000000001
/services/user/user0000000002
....

Clients can get a list of all znodes at /services/user to receive the list of servers for the user service. The only problem is…getting the list of znodes is only half the battle. user0000000001 doesn’t really tell you how to access the user service on that node. This is why its important to store some kind of service location data with the znode.

ZooKeeper allows you to set the data for a znode, so when a node comes up, it just needs to also set data at the znode that describes how to locate or access the service. I’ve adopted to use a URI as the znode data — this is very flexible and easy to parse and read from most languages. For Thrift services I am using a URI that uses a thrift scheme:

1
thrift://10.1.1.10:5656

The URI can be easily adapted to your application, such as adding query string parameters with any extra custom metadata.

Thrift is great, and with ZooKeeper, its even better. I’m in the process of implementing this integration now and am looking forward to all of the benefits it has to offer. I would love to hear any feedback about this approach if anyone has personal experience, or just some good ideas. What is everyone else using to solve this type of problem?

Oh and also, if ZooKeeper isn’t your thing, you should definitely check out Doozer. It has very similar features to ZooKeeper, and although new, it is definitely on my list of projects to watch. Oh, and did I mention that it is written in golang?

Doozer is a highly-available, completely consistent store for small amounts of extremely important data. When the data changes, it can notify connected clients immediately (no polling), making it ideal for infrequently-updated data for which clients want real-time updates. Doozer is good for name service, database master elections, and configuration data shared between several machines.

Leave the first comment

Redis sharding with Erlang

I’ve been using a great Erlang Redis library erldis recently and as our data set has grown much larger, we have decided we need to start sharding. Unfortunately the erldis library does not have support for sharding so I’ve forked the repository and made the changes. You can find my fork here. Pull request here.

A little background

Redis has been really great for us so far but as we get more data and more operations per second, we realized we need to start utilizing multiple Redis instances. Sharding basically splits up your keyspace across multiple servers in a consistent fashion. The typical example for finding out where a key belongs is by doing something like:

1
hash(Key) mod N

Where N is the index to a list of shards. This allows all of your clients to be consistent with knowing where a key belongs. The above algorithm is rather naive and the implementation that I’ve added to erldis is a bit different.

Hash Ring

In erldis when you create a ring each item or node is replicated on the ring multiple times, (128 by default in erldis), so that the hash ring is more evenly distributed. When a key is hashed, it is placed somewhere on the ring, and if you move clockwise the next node that is encountered is the one that will contain the key. If you are already at the end of the ring it just moves around the circle to the first node. Here is the Erlang representation of a ring with NumReplicas=2 with three nodes:

1
2
3
4
5
6
7
8
9
10
11
Eshell V5.8  (abort with ^G)
1> hash_ring:create_ring(["redis01", "redis02", "redis03"], 2).
{2,
 [{"redis01",
   545684121738837007922020352670259130010627269168},
  {"redis03",615626879509609616059735092207848107514679530568},
  {"redis02",831940194106586542005694020470691157762681039493},
  {"redis02",862200733042270842852150771881884931000854854035},
  {"redis01",976184092715215764651301117507107557316103934589},
  {"redis03",
   1324913878318290717698351186295127020927484244520}]}

The hash algorithm used above is the name of the node concatenated with the replica number. So if you look at the ring above, the first item, 545684121738837007922020352670259130010627269168 was generated by doing:

1
2
3
Eshell V5.8  (abort with ^G)
1> <<Int:160/unsigned-integer>> = crypto:sha("redis012"), Int.
545684121738837007922020352670259130010627269168

The hashed string is “redis01″ + 2 (which is the replica number). By knowing the algorithm for where a node lies on the ring, a client in any language can take advantage of the shards.

Adding/Removing nodes

The first thing I have to say about adding nodes is to not do it. Salvatore Sanfilippo wrote a great post about sharding Redis, and his recommendation is that re-hashing is difficult and instead you should create a large number of small Redis instances up front, that way you don’t have to add more nodes down the road.

For example, you might want to create 128 small Redis instances (multiple per box) and as you grow, just move those instances to their own dedicated (or larger) boxes. This lets you have a large amount of shards up front so you don’t have to worry about re-hashing for a hopefully long time.

And for removing nodes, failures etc,. try to use Redis’s replication and use some form of HA so that if a Redis instance dies, you have another one that is up to date to takeover (keepalived, Linux HA, etc,.).

Final thoughts

Redis has been fantastic and I’m really looking forward to finishing up on our Redis “cluster” and modifying our clients to support sharding. As for right now, I’m going to be looking into ways to migrate our existing Redis data into the shards in real-time and then “flipping the switch” on the clients.

Leave the first comment

MoosTrax for Blackberry — Looking for testers

I am finally going to work on updating MoosTrax for BlackBerry — as there are a few outstanding bugs that need to be addressed. The most important bug I will be working on is consistency of location updates in newer, post-4.2 OS BlackBerry devices.

If you are interested in helping test the upcoming version, please e-mail moostraxsupport@gmail.com.

One comment so far, add another

MoosTrax released for iPhone

I am happy to say that MoosTrax has been approved by Apple and is now available for download in the App Store. Thanks to everyone who helped with the beta testing — it was a great help.

Note: MoosTrax works on devices running iOS 4+.

iPhone Support/Help
MoosTrax iTunes Link

Leave the first comment