Coder Social home page Coder Social logo

denormalised's Introduction

#DENORMALISED - 20th September 2012

##Deep dive into MongoDB

MongoDB has found a sweet spot in enterprise adoption due to its ease of use, powerful query model and natural progression for those coming from a SQL mindset. But under the hood it has some unusual design approaches that can trip up the unwary.

Equal Experts has first hand experience of putting MongoDB to use on a number of production systems. In this talk Julian Browne will talk about the power and the pitfalls of MongoDB using practical examples that highlight its inner workings.

##Prequisites

Assumes MongoDB has been downloaded and that there's a writable /data/db directory for mongod to access

##Script

###Basics

cd to mongodb bin directory and start it up

$ mongod --rest

The --rest option provides a useful http connection at http://localhost:28017/

Let's see what mongod is up to:

$ mongostat  

Important columns are query/update/delete/getmore (what clients are up to); mapped (size of data, starts at 0m); the process resource utilisation (vsize, res, faults) and the frequency of the global lock (locked)

###Exploring MongoDB

Let's make some data using the mongo shell

$ mongo
> 

Help is your best friend

> help
db.help()                    help on db methods
db.mycoll.help()             help on collection methods
sh.help()                    sharding helpers
rs.help()                    replica set helpers
help admin                   administrative help
help connect                 connecting to a db help
help keys                    key shortcuts
help misc                    misc things to know
help mr                      mapreduce

show dbs                     show database names
show collections             show collections in current database
show users                   show users in current database
show profile                 show most recent system.profile entries with time >= 1ms
show logs                    show the accessible logger names
show log [name]              prints out the last segment of log in memory, 'global' is default
use <db_name>                set current database
db.foo.find()                list objects in collection foo
db.foo.find( { a : 1 } )     list objects in foo where a == 1
it                           result of the last line evaluated; use to further iterate
DBQuery.shellBatchSize = x   set default number of items to display on shell
exit                         quit the mongo shell

Also, for demos etc, cls to clear the screen and start at the top is quite useful:

> cls

Another useful feature is that commands minus their brackets explain what they do:

> db.stats
function (scale) {
	return this.runCommand({dbstats:1, scale:scale});
}

What databases exist?

> shows dbs

There are no databases yet, so let's make one:

> use dave

"db" is conventional notation for 'the current' database. Now let's create a collection called "colin" in dave:

> db.createCollection('colin')

This make a conventional, empty, collection to put documents in. And we can see the affect it's had by running mongostat in the MongoDB bin directory

$ mongostat

This now shows some storage as 'mapped'. If we go the file system in /data/db we can see the files that have been created to manage out data:

$ cd /data/db
$ ls -lsk

We should have a db storage file of 64Mb (dave.0) plus a namespace file (dave.ns), plus a 128Mb file created for future use to save time later (dave.1). This ahead of time creation can be prevented if desired with a --noprealloc option to mongod

Now let's add a simple document from the mongo shell

> db.colin.save({hello: "world"})

And back to the file system to check out what's happened. We can look inside the dave.0 file and actually see our JSON doc saved in BSON form:

$ od -c dave.0

Note: field names are stored as is (not compressed or tokenised) and repeated for every document so it's important to keep key names short to minimise storage needs and hence allow for more docs to be in memory concurrently.

###Populating MongoDB

First, let's add some useful functions to the shell to make adding some interesting documents easier:

> // generate random string of length

function randString(length) {
	var text  = "";
	var chars = "abcdefghijklmnopqrstuvwxyz0123456789";
	for(var i=0; i < length; i++ )
		text += chars.charAt(Math.floor(Math.random() * chars.length));
	return text;
}

> // generate random number up to max

function randInt(max) {
	return Math.floor(Math.random() * max);
}

// generate a random number of likes

function randLikes(max) {
	var sports = [ "football", "rugby", "hockey", "tennis", "running", "equestrian", "shotput", "javelin", "high jump", "long jump" ];
	var likes = [ ];
	for(var i=0; i < max; i++) {
		var likeId = randInt(sports.length);
		likes.push(sports[likeId]);
		sports = sports.slice(0,likeId).concat(sports.slice(likeId+1));
	}
	return likes;
}

Now we can add half a million documents to the colin collection:

>	for(var id=0; id < 500000; id++) {
		db.colin.save({
			"_id"   : id,
			"name"  : randString(20),
			"age"   : randInt(100),
			"likes" : randLikes(3)
		});	
	}

And mongostat provides some useful information about how things are performing

$ mongostat

A Macbook Pro with SSD typically creates these document at a rate of between 9,000 and 12,000 per second. Note the mapped data total rising and the global lock being invoked. Writes are of course async so this is a useful baseline figure, not a meaningful benchmark.

Back in the shell we can see what MongoDB thinks we have:

> db.colin.stats()

###Finding Data

To find all docs in the collection we pass an empty JSON document to colin's find function

> db.colin.find({})

We can also page through data by assigning those results to a cursor

> var c = db.colin.find({})  
> a.hasNext()					// more data?
true
> a.next()  					// get next document

Get total number documents in the collection

> db.colin.find({}).count() 

Or, simpler:

> db.colin.count()  

Find has a lot of options, one nice one is regex support:

> db.colin.find({name: /^Dave/})

Though by default searches are case sensitive by default so better to try:

> db.colin.find({name: /^Dave/i})

Want to find all people who like football? Can search into arrays to without extra command overhead:

> db.colin.find({"likes" : "football"})

Find all people who are 18

> db.colin.find({"age" : 18})

Find how many

> db.colin.find({"age" : 18}).count()  

Find all people who are under 18

> db.colin.find({"age" : { $lt : 18 } })  

How many are there:

> db.colin.find({"age" : { $lt : 18 } }).count()

Note that:

> db.colin.find({"age" : 18})

takes a perceptible amount of time to return with 500,000 documents. When developing for a production system this is a warning sign worth investigating. First step is to tell the profiler to report on long running queries. In this case let's set the reporting level very low - to 20ms

> db.setProfilingLevel(1,20)

Now when those queries are run again we'll get a useful report logged to the console where mongod was run

It's also worth using explain() to find out how MongoDB is servicing the query:

> db.colin.find({"age" : 18}).explain()

Explain tells us all 500,000 documents are being scanned for that query. If this query is important then we should add an index to help MongoDB out:

> db.colin.ensureIndex({"age" : 1})

Checking explain again. Should show an enormous drop in response time (from 140ms to about 5ms) and documents scanned. Note though that indexes aren't free. They take up space and they also need to be updated when new documents are added. Index use must be balanced against against write impact.

Indexes can also enforce uniqueness for fields.

What about that count operation?

> db.colin.find({age: 25}).count()

MongoDB doesn't use counting btrees for its indexes so even with an index count() queries on large document sets can still be slower than expected.

If after creating an index you would like to see how bad performance would have been without it, you can hint the query to ignore the index:

> db.colin.find({age: 25}).hint({$natural:1})

The aggregation function provides a useful alternative for queries that may in the past have required map/reduce

> db.colin.aggregate( { $match : { "likes" : "football" } } );  

This will fail with 500,000 documents as output must remain within 16Mb doc size limit. Here's one that narrows the result set and should work:

> db.colin.aggregate( { $match : { "likes" : "football", age : { $lt : 20 } }  } );  

###Updating data

Changing one field for a field of identical (or smaller) length is very quick in MongoDB:

> for(var id=0; id < 500000; id++) {
		db.colin.update({"_id" : id}, { $set : { "postcode" : "EC1V 7DP" } });
	}

Transaction rates (async) can reach 45,000/second if there's no index on colin and around 11,000/second with an index

> db.colin.dropIndexes()

Will remove all non _id indexes for this test. Checking

> db.colin.stats()

Shows what MongoDB is using for the Padding Factor between documents. This is intelligently increased when documents grow a lot.

Transaction rates can be seriously impacted if a document grows beyond it's current slot and has to be moved to another place in the data file. This script makes the 20 char name 50 chars long:

> for(var id=0; id < 500000; id++) {
		var newName = randString(50);
		db.colin.update({ "_id" : id }, { $set : { "name" : newName } } );
	}

and performs at about 15,000 writes/second without an index. 10,000 writes/sec with an index on age and 3,000 writes.sec with an index on name. Worth comparing the 50,000 writes/sec when doc stays the same size or shrinks to 3,000 writes/sec when there's an index on name and name is growing in size. In some cases it's worth padding out documents when they are created to keep expansion to a minimum.

Collections can become fragmented, just like a file system, over time.

> db.runCommand( { compact : 'colin' } )

will defrag them, but note the % global lock in mongostat when this is happening. Btest not to do this in a live production environment when there's a lot of documents.

###Deleting Data

> db.colin.drop()

This empties out the data but db files remain

$ cd /data/db  
$ ls -lsk

But:

> db.dropDatabase()

Removes the files

##Further Reading

denormalised's People

Contributors

julianbrowne avatar

Watchers

 avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.