Apache has recently released Cassandra 0.6 – Large Scale distributed database system formerly maintained by Facebook but currently supported by Apache foundation. Cassandra is popular, which is being used by biggies like Rackspace, Twitter, Digg etc. Is this a threat to MySQL and the like? Cassandra now comes with built-in support for Hadoop (The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing)
What is NoSQL?
NoSQL (NoREL or Not Only SQL, it is misleading) Movement is catching up and is on the rise to become the most popular emerging next generation database concept in 2010. NoSQL is nothing but rapidly evolving new breed of databases that are clashing with the hoard of traditional relational database systems like MySQL, MS SQL, PostgreSQL etc; they are
- Non Relational
- Distributed, Large Scale Databases
- Horizontally Scalable (More nodes/servers)
- Open Source
- Schema Free
- Eventually Consistent (BASE – Basically Available, Soft state, Eventual consistency)
- Easy Replication Support
- Simple API support
Why Non Relational, Distributed Databases?
Relational Database Systems have been around for a while powering many giant e-commerce websites etc. The essence of relational database is non redundancy relations and non redundancy is desired. Database tables are designed in such a way that redundant data is minimized (Normalization). But, this actually becomes a problem for huge database as we need to maintain data redundancy across servers, nodes etc. So, it is not possible to have efficient redundancy and parallelism in relational database systems (at least not trivial). This leads to single point of failure.
So, for huge databases running into multiple terabytes, relational database is not good. This is the reason, why Amazon, Google, Facebook started working on Non relational databases. In Distributed databases, information is distributed in a redundant manner across ring of identical computers/nodes or servers. Data will be queried with key map. This reduces the risk of single point of failure. Data is redundant and stripped across nodes. So changes in one place, eventually will be propagated (asynchronous) to other nodes, thus the name Eventually Consistent.
Notable Proprietary implementation of NoSQL
- Amazon’s Dynamo – Distributed storage system, unlike relational database system, it does not break data in to tables. Instead all objects are stored and looked up via a key map.
- Google’s BigTable – BigTable is a compressed, high performance database built on Google proprietary platform. BigTable is an extremely large DBMS capable of handling several thousands servers, nodes with several petabytes range of database size.
Notable Open Source implementation of NoSQL
- Apache’s Cassandra – The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
- HBase – HBase is the Hadoop database. Use it when you need random, real-time read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.
- Hypertable – Hypertable is an open source project based on published best practices in solving large-scale data-intensive tasks. It tries to bring the benefits of new levels of both performance and scale to many data-driven businesses who are currently limited by previous-generation platforms.
However, it should be noted that all the technology discussed here applies to very large scale database systems. Our usual small scale web systems still continue to use RDMS like MySQL. It is more than sufficient to handle couple GBs of data and non distributed environment.
Via [ReadWriteWeb] and other sources