Accessing the database in an application can be one of the least performant operations for the entire application. If the DB sends back gobs of unneeded structural data, it gets even worse. All of that unused meta data just slows the response and compounds the latency issue. In my VERY limited experience with MongoDB, I'm not too thrilled with the amount of meta data I get back from SOME queries. Unlike SQL, the document structure of each record (document) gets returned from the DB to the application. That's a lot of overhead in some situations. Of course, in some cases, that meta data is completely necessary and you just have to bite the bullet on it.

However, for some queries, perhaps there is some really easy way to avoid this overhead, but I didn't see it in the MongoDB course I'm taking or while perusing the docs.

Here's a single example document in a collection called 'families':

{
	"familyName" : "Doe",
	"friendFamilies" : [ ],
	"familyMembers" : [
		{
			"familyMembersId" : ObjectId("52818bf26ffc7e6d09000001"),
			"parent" : true,
			"givenName" : "John",
			"nickName" : null,
			"secretPin" : 8315
		},
		{
			"familyMembersId" : ObjectId("52818bf26ffc7e6d09000002"),
			"parent" : true,
			"givenName" : "Jane",
			"nickName" : "Momma",
			"secretPin" : 6574
		},
		{
			"familyMembersId" : ObjectId("52818bf26ffc7e6d09000003"),
			"parent" : false,
			"givenName" : "Mandy",
			"nickName" : "",
			"secretPin" : 9917
		},
		{
			"familyMembersId" : ObjectId("52818bf26ffc7e6d09000004"),
			"parent" : false,
			"givenName" : "Manny",
			"nickName" : "",
			"secretPin" : 1170
		}
	],
	"_id" : ObjectId("52818bf26ffc7e6d09000005")
}

This shows all the members of a particular family with the _id of : ObjectId("52818bf26ffc7e6d09000005").

At this point, there is nothing important here except for the _id. Say I need EVERY _id for every document in the collection. So, I query like this:

db.families.find({}, { _id : 1 } )

I get a result like this:

{ "_id" : ObjectId("52818bf26ffc7e6d09000005") }
{ "_id" : ObjectId("52818c017f70f16e09000005") }
{ "_id" : ObjectId("52818c017f70f16e0900000b") }
{ "_id" : ObjectId("52818c017f70f16e09000010") }
{ "_id" : ObjectId("52818c017f70f16e09000016") }
{ "_id" : ObjectId("52818c017f70f16e0900001b") }
{ "_id" : ObjectId("52818c017f70f16e09000021") }
{ "_id" : ObjectId("52818c017f70f16e09000028") }
{ "_id" : ObjectId("52818c017f70f16e0900002d") }
{ "_id" : ObjectId("52818c017f70f16e09000031") }
{ "_id" : ObjectId("52818c017f70f16e09000036") }
{ "_id" : ObjectId("52818c017f70f16e09000039") }
{ "_id" : ObjectId("52818c017f70f16e0900003d") }
{ "_id" : ObjectId("52818c017f70f16e09000041") }
{ "_id" : ObjectId("52818c017f70f16e09000046") }
{ "_id" : ObjectId("52818c017f70f16e0900004c") }
{ "_id" : ObjectId("52818c017f70f16e09000050") }
{ "_id" : ObjectId("52818c017f70f16e09000055") }
{ "_id" : ObjectId("52818c017f70f16e0900005c") }
{ "_id" : ObjectId("52818c017f70f16e09000062") }
...

So, the database has to transmit _id for EVERY single document in the collection. If I'm pulling 50,000 records for some reason, that's a whole lot of overhead.

In MySQL and other RDBMS systems, I simply do this:

select _id from families

Then, I get somelike this this (NOTE: I removed the ObjectID because MySQL doesn't use something like that) :

52818bf26ffc7e6d09000005
52818c017f70f16e09000005
52818c017f70f16e0900000b
52818c017f70f16e09000010
52818c017f70f16e09000016
52818c017f70f16e0900001b
52818c017f70f16e09000021
52818c017f70f16e09000028
52818c017f70f16e0900002d
52818c017f70f16e09000031
52818c017f70f16e09000036
52818c017f70f16e09000039
52818c017f70f16e0900003d
52818c017f70f16e09000041
52818c017f70f16e09000046
52818c017f70f16e0900004c
52818c017f70f16e09000050
52818c017f70f16e09000055
52818c017f70f16e0900005c
52818c017f70f16e09000062
...

As you can see, the same INFORMATION was returned, but without all the overhead describing the documents. Again, over a large dataset, that can add up to a lot of bandwidth and hence latency.

I dabbled around and came up with a solution to this problem. It might not be optimal, but it works. It will surely save on bandwidth. However, I don't know what it does to DB performance. I'm not sure what impact this has on the MongoDB. Would it be better to bite the bandwidth bullet to avoid bogging down the DB and possibly slowing down other requests? If someone has some good info on this I'd appreciate it.

My "solution":

db.families.find( {}, { _id : 1 }).map( function( record ) { return record._id });

Results in an array of all the _ids without the meta.

[
	ObjectId("52818bf26ffc7e6d09000005"),
	ObjectId("52818c017f70f16e09000005"),
	ObjectId("52818c017f70f16e0900000b"),
	ObjectId("52818c017f70f16e09000010"),
	ObjectId("52818c017f70f16e09000016"),
	ObjectId("52818c017f70f16e0900001b"),
	ObjectId("52818c017f70f16e09000021"),
	ObjectId("52818c017f70f16e09000028"),
	ObjectId("52818c017f70f16e0900002d"),
	ObjectId("52818c017f70f16e09000031"),
	ObjectId("52818c017f70f16e09000036"),
	ObjectId("52818c017f70f16e09000039"),
	ObjectId("52818c017f70f16e0900003d"),
	ObjectId("52818c017f70f16e09000041"),
	ObjectId("52818c017f70f16e09000046"),
	ObjectId("52818c017f70f16e0900004c"),
	ObjectId("52818c017f70f16e09000050"),
	ObjectId("52818c017f70f16e09000055"),
	ObjectId("52818c017f70f16e0900005c"),
	ObjectId("52818c017f70f16e09000062")
]

So, some of you might be saying "That small bit of overhead is no big deal!" or "Why do you need every _id. That's a contrived example". It is a contrived example. But it does demonstrate the problem.


Here's a REAL example :

Suppose I need every familyMemberId for a particular family. Here's the query and result:

db.families.find( { _id : ObjectId("52818c1fa1ed6c7009005c3f") }, { 'familyMembers.familyMembersId' : 1, _id : 0 }).pretty();

Results In:

{
	"familyMembers" : [
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3a")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3b")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3c")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3d")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3e")
		}
	]
}

NOW look at all the overhead. "But it's just a little more meta overhead, big deal!". True. But imagine that my application does that query a few million times a day. Now, we're talking about some serious overhead.

So, I came up with a solution. Again, I'm not sure of the impact on the database application. Perhaps my solution just causes more pain. If anyone knows for sure or has a more elegant solution, please reach out to me via Twitter.

Solution:

db.families.find( { _id : ObjectId("52818c1fa1ed6c7009005c3f") }, { 'familyMembers.familyMembersId' : 1, _id : 0 }).map( function(record) { var a = []; record.familyMembers.map( function(member) { a.push( member.familyMembersId ) }); return a })

Results in:

[
	[
		ObjectId("52818c1fa1ed6c7009005c3a"),
		ObjectId("52818c1fa1ed6c7009005c3b"),
		ObjectId("52818c1fa1ed6c7009005c3c"),
		ObjectId("52818c1fa1ed6c7009005c3d"),
		ObjectId("52818c1fa1ed6c7009005c3e")
	]
]

I really hate the 2 maps, in there. I couldn't find a way to avoid it because the "record" comes back as the example below even if using findOne:

{
	"familyMembers" : [
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3a")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3b")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3c")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3d")
		},
		{
			"familyMembersId" : ObjectId("52818c1fa1ed6c7009005c3e")
		}
	]
}

The first map is to deal with each distinct document returned. In this case because of the query, there will always be only one. If I had used findOne, I'd have no access to further operations such as map and forEach that the MongoDB cursor provides.

Please let me know if I've missed some very obvious means of reducing the bandwidth or if my solutions could be improved upon.