Tuesday, April 2, 2013

Mongodb evaluation



I have been evaluating MongoDB for sometime as a possible candidate for a database replacement.
What is appealing about MongoDB:

  • Easy to setup and administer via the mongo shell
  • Schemaless entities ( collections ) and a good driver API. This allows for fast prototyping.
  • Index support. By default Mongodb provides an _id primary index. Any other index ( simple/compound ) is treated as a secondary index.
  • Replication ( ReplicaSet ) . You can "install" a replicaset configuration once on the shell and it persists through restarts.
  • Sharding. This is useful for Shared Nothing scenarios where you want to partition data throughout the cluster making each node responsible for its own "shard" of data.

Cons:

  • No transaction support
  • Write locking. At the time of this writing, MongoDB seems to be using database-level write locking. This can be a bottleneck for threads updating different collections in the same database.
  • Asynchronous replication. This does not seem to me the safest way to replicate. The onus is on the secondary nodes to pull the oplog from the primary and this asynchronicity can lead to issues during failover.
  • No triggers at the API level. Triggers could be useful in cases where you want notifications about collection updates. Mongo suggests that you instead write tailable cursors on collections and monitor them.
  • Not embeddable. This may not be a real problem for most users. However this is just my peeve and it may be because of my previous experiences with BerkeleyDB or embeddable in-memory data grids. Embedding support allows you to develop a "compute" grid rather than have computation at the client side and a separate data cluster.
  • Disk usage. I noticed that even a small dataset like 36 MB of raw data leads to 128MB of actual storage space and another 32MB of _id index space. When the actual database files are created mongo attempts to pre-emptively allocate much larger chunks(extents) on disk. So a 36MB sized raw data set could lead to 5x or 6x the space on disk.
  • No support for joins. This means you will have to break up a join-query into smaller queries and manually join them on the client side.
I will post more on read and write performance from my tests in my next post.


No comments:

Post a Comment