4.2 : Document Databases
Introduction to Document Databases
Motivations
Document databases were created as an alternative to SQL, which has problems such as:
- Rigid schema
- Not easily scalable (for distributed machines)
- Requires unintuitive joins
Document databases, namely MongoDB, offered the perks of:
- An easy interface with common languages (like Java/JS/PHP)
- Tech which could run anywhere (VMs, cloud etc.)
- Keeping the essential features of RDBMSs, while learning from key-value noSQL systems
Data Model
They're document based, where a document has a max size of 16MB
Documents are in a BSON format, consisting of key-value pairs
Each document is stored in a collection
Collections:
- Have an index set in common
- Are like tables in a relational database
- Do not need to have a uniform structure
JSON
They use JSON (JavaScript Object Notation), which is easy for humans to read/write and for computers to parse/generate
JSON objects can be nested
JSON is built on name/value pairs, and an ordered list of values
BSON
Under the hood, the use BSON, which is essentially Binary JSON(a binary-encoded serialisation of JSON-like docs)
BSON allows for referencing
BSON's embedded structure reduces the need for joins
BSOn has the goals of being lightweight, traversable and efficient (at encoding and decoding)
An example of some BSON is:
{
"_id": "37010",
"city": "ADAMS",
"pop": 2660,
"state": "TN",
"councilman": {
"name": "John Smith",
"address": "13 Scenic Way"
}
}
BSON has many types which can be used
It also has an _id field by default, which has some special characteristics:
- Value serves as the primary key for the collection
- Value is unique, immutable and may be any non-array type
- Default data type is ObjectId, which is "small, likely unique, fast to generate and ordered", so sorting on it is similar to sorting on creation time
MongoDB vs SQL
MongoDB | SQL |
---|---|
Document | Tuple |
Collection | Table/View |
PK: _id Field | PK: Any Attribute(s) |
Uniformity not required | Uniform Relational Schema |
Index | Index |
Embedded Structure (pre-joined documents) | Joins (explicit) |
CRUD
MongoDB follows a very logical set of CRUD instructions, which can be found in their CRUD documentation
Schema Design
MongoDB is basically schema free
It has a 'flexible' schema, where certain records can have different fields, which can add more complexity of checking which fields you have to the application code
This means there are two methods of joining:
Embedding
This is when we have a one-to-one relationship
We embed a document inside of another one, so they just make one overall document
It can be used for one-to-many relationships, but the embedded data is now duplicated in every document which has it embedded, which is wasting space...
Linking
This is used when we have a one-to-many relationship
This is where we put the id of one document in the other, so they can be linked together
It's like a foreign key in SQL
We join these in the application code
Embedding vs Linking
Embedding is like pre-joining data
However, embedding means if you only want one of the documents, you're still also retrieving the embedded one every time
Document level operations are easy for the server to handle
It's better to embed when the "many" objects always appear with their parents
Linking is used when you don't always need the "many" data, meaning you are more flexible
Indexes
Without indexes, MongoDB must scan every document, which can get inefficient because large volumes of data are being processed
Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form
Using the example of a collection users
:
- Indexes can be created using:
db.users.ensureIndex({score: 1})
- Indexes can be shown using:
db.users.getIndexes()
- Indexes can be dropped using:
db.users.dropIndex({score: 1})
- To get a document which describes the process and if/which indexes are used:
db.users.find().explain()
- To override the default index selection:
db.users.find().hint({score: 1})
In MongoDB, there are three types of index:
- Single Field Indexes:
db.users.ensureIndex({score: 1})
- Compound Field Indexes:
db.users.ensureIndex({userid:1, score: 1})
- Multikey Indexes:
db.users.ensureIndex({addr.zip: 1})
Aggregation
Aggregation operations process data records and return computed results
They simplify application code and limit resource requirements
They're modelled on the concept of data processing pipelines
They provide:
- Filters - these operate like queries
- Document transformations - these modify the form of the output document
- Tool for grouping and sorting
- Tools for aggregating the content of arrays
They can also be used to do things such as calculating the average or concatenating a string
Some example pipeline stages are:
$limit
$skip
$sort
$match
$group
$distinct