Wednesday, September 25, 2013

Java: SynchronousQueue vs LinkedBlockingQueue/ArrayBlockingQueue

I recently came across some legacy code which used SynchronousQueue to transfer data between threads. Since I have almost always used ArrayBlockingQueue ( bounded size) or LinkedBlockingQueue ( for unbounded sizes ) I could not understand the rationale for using the SynchronousQueue. I did look at the source code for SynchronousQueue and it looks like there is no real underlying queue at all. Instead there seems to be a handoff semantic with an implicit queue size of just 1. After one put() on the queue, the queue.put() blocks until a consumer is available to remove from the queue.

I ran a simple experiment to see how throughput could vary across these 3 queues. For this test objects put into the queue were Integers, Xms=512m and Xmx=1024m. I used Java 1.7.0_17.

Hardware used: Windows 7-64-bit,8 gb ram, Core i5 CPU, 2.53 Ghz
The LinkedBlockingQueue and ArrayBlockingQueue were set to a capacity of 1 to have a apples-to-apples comparison. I ran multiple tests varying the number of puts in the queue.

Results:
As you can see from the table below, for a capacity of 1, the Synchronousqueue seemed to provide extremely good throughput when compared to a unit sized blocking or linked queue. The SynchronousQueue, under this condition, is about 7x-10x faster than the other 2 blocking queue implementations.


Tuesday, April 2, 2013

Mongodb evaluation



I have been evaluating MongoDB for sometime as a possible candidate for a database replacement.
What is appealing about MongoDB:

  • Easy to setup and administer via the mongo shell
  • Schemaless entities ( collections ) and a good driver API. This allows for fast prototyping.
  • Index support. By default Mongodb provides an _id primary index. Any other index ( simple/compound ) is treated as a secondary index.
  • Replication ( ReplicaSet ) . You can "install" a replicaset configuration once on the shell and it persists through restarts.
  • Sharding. This is useful for Shared Nothing scenarios where you want to partition data throughout the cluster making each node responsible for its own "shard" of data.

Cons:

  • No transaction support
  • Write locking. At the time of this writing, MongoDB seems to be using database-level write locking. This can be a bottleneck for threads updating different collections in the same database.
  • Asynchronous replication. This does not seem to me the safest way to replicate. The onus is on the secondary nodes to pull the oplog from the primary and this asynchronicity can lead to issues during failover.
  • No triggers at the API level. Triggers could be useful in cases where you want notifications about collection updates. Mongo suggests that you instead write tailable cursors on collections and monitor them.
  • Not embeddable. This may not be a real problem for most users. However this is just my peeve and it may be because of my previous experiences with BerkeleyDB or embeddable in-memory data grids. Embedding support allows you to develop a "compute" grid rather than have computation at the client side and a separate data cluster.
  • Disk usage. I noticed that even a small dataset like 36 MB of raw data leads to 128MB of actual storage space and another 32MB of _id index space. When the actual database files are created mongo attempts to pre-emptively allocate much larger chunks(extents) on disk. So a 36MB sized raw data set could lead to 5x or 6x the space on disk.
  • No support for joins. This means you will have to break up a join-query into smaller queries and manually join them on the client side.
I will post more on read and write performance from my tests in my next post.