28
2017
Let’s move to NoSQL Databases with MongoDB – Mazarin
NoSQL
Introduction to NoSQL
NoSQL is designed to provide a mechanism to store and retrieve data in a distributed database, NoSQL concept
is mostly used with big data and real-time web applications. This concept was introduced to the world in late
1960s, but it was not popular back then as “NoSQL”. Even Though it supports SQL-like query languages, it is
not a replacement for SQL. It is rather a complementary addition to RDBMS and SQL. MongoDB, BigTable,
Redis, Neo4j , RavenDb, Cassandra, Hbase and CouchDb are known as popular NoSQL databases available in
the market.
Why do we need NoSQL?
Today’s web, mobile and IoT applications have one or more of the following characteristics.
- Support large numbers of concurrent users (tens of thousands, perhaps millions)
- Deliver highly responsive experiences
- available at all times– no downtime
- Handle semi-structured and unstructured data
- Rapidly adapt to changing requirements with frequent updates and new features
Since it was challenging to achieve these new features with typical relational databases, the requirement for NoSQL emerged.
CAP theorem & NoSQL
NoSQL database follows the Brewer’s CAP theorem which was Published by Eric Brewer in 2000. This theorem describes a set of basic requirements that describe any distributed system.
CAP theorem consists of three guarantees named Consistency, Availability and Partition Tolerance. Theoretically, it’s impossible to have all 3 requirements simultaneously, so a combination of 2 is chosen. No distributed system is safe from network failures, thus network partitioning generally has to be tolerated. When choosing consistency over availability, the system will return an error or a time-out if particular information cannot be guaranteed to be up to date due to network partitioning. When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning. When the distributed system is running normally without network failure, both availability and consistency can be satisfied.
Types of NoSQL Databases
There are four types of NoSQL databases and all are designed for storing, retrieving, and managing information
Key-Value Database
This has a unique key and a pointer to a particular item of data. Unique key is used to find the record quickly in database. There are no fields to update, instead the entire value other than the key must be updated if changes are to be made. Redis is the most popular implementation of a key-value database
Graph Database
This represent and store data using nodes, edges and properties. The strength of a graph database is in traversing the
connections between the nodes. But they generally require all data to fit on one machine, limiting their scalability. Neo4j is a Java-based Graph Database.
Column Family Database
This consists of a key-value pair, where the key is mapped to a value which is a set of columns. This is created to store and process very large amounts of data distributed over many machines. In these databases each column consists of a column name, a value and a timestamp. There are two types of column families namely standard column family which contains only columns and super column family which contains a map of columns.
Document Database
This is inherently a subclass of the key-value store and this stores a record as a “document”. Unlike in relational databases this store all information for a given object in a single instance . It also supports querying and indexing features with enhanced efficiency . Mongodb is the leading Document Database.
RDBMS vs NoSQL
SQL Schema vs NoSQL Schemaless
In SQL it is mandatory to define tables, fields, field types while it’s optional to define primary key, foreign key, indexes, triggers and stored procedures. In here, data structure is fixed in SQL and Schema must be designed and implemented before any business logic.
In NoSQL it is not necessary to define document design, collection etc. The data structure is not also fixed and data can be added anywhere, at any time. This is more suited to projects where the initial data requirements are difficult to ascertain.
SQL vs NoSQL Scaling
RDBMS is not designed to run efficiently on clusters as it’s limited to scaling up since adding more processors, memory, and storage to a single physical server. It becomes more expensive as enterprises have to purchase large servers. It also can result in downtime if the database has to be taken offline to perform hardware upgrades.
In contrast to this,NoSQL run well on clusters And does scaling up by adding more servers as it scales on-demand and without downtime. NoSQL were engineered to distribute reads, writes, and storage. thus it is easy to install, configure, and scale.
SQL Normalization vs NoSQL De-normalization
SQL Data is read and written by disassembling and reassembling objects which results in inefficiency as illustrates in diagram 1.5-a.
On the other hand, NoSQL reads and writes data formats including XML, YAML, and JSON as well as binary forms like BSON. This eliminates the object-relational impedance mismatch and the overhead of ORM frameworks which leads to faster queries. It is inefficient if data is getting updated, but normalization techniques can be used in NoSQL as shown in the diagram 1.5-b.
Figure1 : Normalization techniques (Source: https://www.couchbase.com/resources/why-nosql)
Introduction to mongodb
MongoDB is an open source, cross-platform, document oriented database that provides high performance, high availability and Easy scalability. This has become one of the most popular NoSQL database in the current market since its’ inception in 2009. mongodb server is open source, which means users can install and use free of charge. There are many mongodb clients available that connect to applications written in different languages. According to the following report mongodb became the fastest growing NoSQL database.
Figure2 : NoSQL databases (Source: https://www.slideshare.net/mongodb/webinar-how-to-visually-explore-and-manipulate-your-mongodb-data)
Why use mongodb?
There are many reasons to use mongodb over traditional RDBMS. Mongodb stores data in JSON formatted binary files(BSON). It can store data regardless of number of attributes needs to store. Mongodb suits for systems which needs to maintain mixed types of data sets as a single collection. Mongodb servers can be easily configured for cluster environment. It supports huge amount of concurrent threads using clustered server resources that ensures high availability of data which results in no server downtime. The servers can handle fast data growth, such as 1000’s millions of write queries per second. Mongodb can select data and process them as large data sets without slowing down the system or it’s operations unlike in RDBMS where it selects and process data by dividing into small batch files and process them in order to keep the database performance at optimal level
Data Modeling
Mongodb doesn’t need declared data structure like RDMS. Mongodb and it has dynamic schema in a collection. Mongodb collection maintains similar fields and document structure. Data in a collection can model in two ways; normalized data model and denormalized data model. Since mongodb has flexible document structure A preferred data modeling can be used for necessary system requirements
Following are two types of Data models provided by MongoDB namely embedded data model and normalized data mode
Embedded Data Model (AKA denormalized model)
In this model, all related data contains in single document. Embedding allows faster read operations than its Normalized model. This data model allows to read, write and update data with single database operation.
As an example consider the following diagram: It has user id, user name, contact and access fields. The contact and access fields can be considered as normalize-able data even though it’s maintained in the same document.
Figure3 : Embedded Data Model (Source: https://docs.mongodb.com/manual/core/data-model-design/)
Normalized Data Model
This model keeps related data in multiple documents. The main document(parent) has relationship with sub document(child). Normalized Data model is useful to show data in multiple hierarchies and nested arrays of data. MongoDB doesn’t provide foreign key references and CRUD operations have to create relationship and execute the operation.
Consider the following diagram as an example
The document has user id, user name, contact and access fields. The contact and access fields can be considered as normalize-able data which is kept in separate documents.
Figure 4 : Normalized Data Model (Source: https://docs.mongodb.com/manual/core/data-model-design/)
Install Mongodb using Docker
Creating a mongodb instant using docker is very simple. Following steps provide a guidance on how to install a mongodb instance in docker easily.
Create a Dockerfile image for mongodb
Following list of commands creates a mongodb docker image version 3.0.1 in Ubuntu 14.14.
FROM ubuntu:14.04 MAINTAINER chpa@mazarin.lk # Import MongoDB public GPG key AND create a MongoDB list file RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10 RUN echo "deb http://repo.mongodb.org/apt/ubuntu "$(lsb_release -sc)"/mongodb-org/3.0 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-3.0.list # Update apt-get sources AND install MongoDB RUN apt-get update && apt-get install -y mongodb-org=3.0.1 mongodb-org-server=3.0.1 mongodb-org-shell=3.0.1 mongodb-org-mongos=3.0.1 mongodb-org-tools=3.0.1 # Create the MongoDB data directory RUN mkdir -p /data/db # Expose port 27017 from the container to the host EXPOSE 27017 # Set usr/bin/mongod as the dockerized entry-point application ENTRYPOINT ["/usr/bin/mongod"]
Build the Dockerfile image.
To build the docker file execute the following command, where the dockerfile is located.
docker build --tag mazarin/mongo:v1
Run created Dockerfile image.
Following command creates a docker container instant by the name of “mongo” with port number 21017. The mounted container’s data directory is set to “containers /data/db” directory, which resides externally.
docker run --name mongo -p 27017:27017 -v $(pwd)/data:/data/db -d mazarin/mongo:v1
Now, we have successfully created a docker container running Mongo version 3.0.1 . It can be used by any mongodb client to access this instant through mongodb default port 27017.
Create a replica-set using Mongodb
Scalability is considered as a key element of NoSQL database design. Mongodb architecture also designed to address this feature in an unique way. Mongodb achieves its scalability by using replica sets. Mentioned below are the steps to create a mongodb replica set in a local machine using a docker instance.
1. Multiple mongodb instants are needed to create a replica-set.Run docker image three
times with different ports. An additional configuration option “–replSet” to indicate that we are
creating a cluster and it belongs to a replica-set named “rs0”
docker run --name mongo_001 -p 28001:27017 -v $(pwd)/data:/data/db -d mazarin/mongo:v1 --replSet rs0 docker run --name mongo_002 -p 28002:27017 -v $(pwd)/data:/data/db -d mazarin/mongo:v1 --replSet rs0 docker run --name mongo_003 -p 28003:27017 -v $(pwd)/data:/data/db -d mazarin/mongo:v1 --replSet rs0
2. Initiate the cluster
rs.initate()
3. Log in to any mongodb instance using a preferable client. It shows as the SECONDARY at firstand it will automatically be primary within few seconds.
mongo –port 28001
4. Find the IP addresses of other mongo instances that are running to add them to the primary docker instance.
docker inspect |grep IPAddress
5. Add the secondary mongo instances to primary mongo instance.
rs.add("172.18.0.2")
6. Check the status of the replica-set, after adding secondary mongo instances
rs.status();
Now we have successfully created a mongodb replica set. Add some data to the PRIMARY and you can read the same data from the SECONDARY mongo server.
Important
-
- If the status of the cluster keep saying STARTUP please check the name of the primary server . If it is not a ip address, run the below command in primary server to correct the name.
cfg = rs.conf() cfg.members[0].host = "" rs.reconfig(cfg) rs.conf()
-
- To read data from the SECONDARY servers you may have to execute the below command on SECONDARY server
rs.slaveOk()
Referances
- https://en.wikipedia.org/wiki/Graph_database
- http://data-magnum.com/lesson-5-key-value-stores-aka-tuple-stores/
- http://www.getbreezenow.com/zza-mongo
- https://10kloc.wordpress.com/tag/column-family/
- https://www.couchbase.com/resources/why-nosql
- https://www.slideshare.net/mongodb/webinar-how-to-visually-explore-and-manipulate-your-mongodb-data
- https://docs.mongodb.com/manual/core/data-model-design/
- https://en.wikipedia.org/wiki/Graph_database
- http://data-magnum.com/lesson-5-key-value-stores-aka-tuple-stores/
- http://www.getbreezenow.com/zza-mongo
- https://10kloc.wordpress.com/tag/column-family/
Authors
- Kaushal Senevirathne
- Charith Padmasiri
- Sirikumara Ranathunga