Operational Factors and Data Models

Operational Factors and Data Models

Modeling application data for MongoDB should consider variousoperational factors that impact the performance of MongoDB. Forinstance, different data models can allow for more efficent queries,increase the throughput of insert and update operations, or distributeactivity to a sharded cluster more effectively.

When developing a data model, analyze all of your application’sread and write operations in conjunction with thefollowing considerations.

Atomicity

In MongoDB, a write operation is atomic on the level of a singledocument, even if the operation modifies multiple embedded documentswithin a single document. When a single write operation modifiesmultiple documents (e.g. db.collection.updateMany()), themodification of each document is atomic, but the operation as a wholeis not atomic.

Embedded Data Model

The embedded data model combines all related data in a single documentinstead of normalizing across multiple documents and collections. Thisdata model facilitates atomic operations.

See Model Data for Atomic Operations for an example data modelthat provides atomic updates for a single document.

Multi-Document Transaction

For data models that store references between relatedpieces of data, the application must issue separate read and writeoperations to retrieve and modify these related pieces of data.

For situations that require atomicity of reads and writes to multipledocuments (in a single or multiple collections), MongoDB supportsmulti-document transactions:

In version 4.0, MongoDB supports multi-document transactions onreplica sets.
In version 4.2, MongoDB introduces distributed transactions,which adds support for multi-document transactions on shardedclusters and incorporates the existing support formulti-document transactions on replica sets.

For details regarding transactions in MongoDB, see theTransactions page.

Important

In most cases, multi-document transaction incurs a greaterperformance cost over single document writes, and theavailability of multi-document transactions should not be areplacement for effective schema design. For many scenarios, thedenormalized data model (embedded documents and arrays) will continue to be optimal for yourdata and use cases. That is, for many scenarios, modeling your dataappropriately will minimize the need for multi-documenttransactions.

For additional transactions usage considerations(such as runtime limit and oplog size limit), see alsoProduction Considerations.

Sharding

MongoDB uses sharding to provide horizontal scaling. Theseclusters support deployments with large data sets and high-throughputoperations. Sharding allows users to partition acollection within a database to distribute the collection’sdocuments across a number of mongod instances orshards.

To distribute data and application traffic in a sharded collection,MongoDB uses the shard key. Selecting the propershard key has significant implications forperformance, and can enable or prevent query isolation and increasedwrite capacity. It is important to consider carefully the field orfields to use as the shard key.

See Sharding andShard Keys for more information.

Indexes

Use indexes to improve performance for common queries. Build indexes onfields that appear often in queries and for all operations that returnsorted results. MongoDB automatically creates a unique index on the_id field.

As you create indexes, consider the following behaviors of indexes:

Each index requires at least 8 kB of data space.
Adding an index has some negative performance impact for writeoperations. For collections with high write-to-read ratio, indexesare expensive since each insert must also update any indexes.
Collections with high read-to-write ratio often benefit fromadditional indexes. Indexes do not affect un-indexed read operations.
When active, each index consumes disk space and memory. This usagecan be significant and should be tracked for capacity planning,especially for concerns over working set size.

See Indexing Strategies for more information on indexes aswell as Analyze Query Performance. Additionally, the MongoDBdatabase profiler mayhelp identify inefficient queries.

Large Number of Collections

In certain situations, you might choose to store related information inseveral collections rather than in a single collection.

Consider a sample collection logs that stores log documents forvarious environment and applications. The logs collection containsdocuments of the following form:

{ log: "dev", ts: ..., info: ... }
{ log: "debug", ts: ..., info: ...}

If the total number of documents is low, you may group documents intocollection by type. For logs, consider maintaining distinct logcollections, such as logs_dev and logs_debug. The logs_devcollection would contain only the documents related to the devenvironment.

Generally, having a large number of collections has no significantperformance penalty and results in very good performance. Distinctcollections are very important for high-throughput batch processing.

When using models that have a large number of collections, considerthe following behaviors:

Each collection has a certain minimum overhead of a few kilobytes.
Each index, including the index on _id, requires at least 8 kB ofdata space.
For each database, a single namespace file (i.e.<database>.ns) stores all meta-data for that database, and eachindex and collection has its own entry in the namespace file. MongoDBplaces limits on the size of namespace files.

Collection Contains Large Number of Small Documents

You should consider embedding for performance reasons if you have acollection with a large number of small documents. If you can groupthese small documents by some logical relationship and you frequentlyretrieve the documents by this grouping, you might consider“rolling-up” the small documents into larger documents that contain anarray of embedded documents.

“Rolling up” these small documents into logical groupings means thatqueries to retrieve a group of documents involve sequential reads andfewer random disk accesses. Additionally, “rolling up” documents andmoving common fields to the larger document benefit the index on thesefields. There would be fewer copies of the common fields and therewould be fewer associated key entries in the corresponding index. SeeIndexes for more information on indexes.

However, if you often only need to retrieve a subset of the documentswithin the group, then “rolling-up” the documents may not providebetter performance. Furthermore, if small, separate documents representthe natural model for the data, you should maintain that model.

Storage Optimization for Small Documents

Each MongoDB document contains a certain amount of overhead. Thisoverhead is normally insignificant but becomes significant if alldocuments are just a few bytes, as might be the case if the documentsin your collection only have one or two fields.

Consider the following suggestions and strategies for optimizingstorage utilization for these collections:

Use the _id field explicitly.

MongoDB clients automatically add an _id field to each documentand generate a unique 12-byte ObjectId for the _idfield. Furthermore, MongoDB always indexes the _id field. Forsmaller documents this may account for a significant amount ofspace.

To optimize storage use, users can specify a value for the _id fieldexplicitly when inserting documents into the collection. Thisstrategy allows applications to store a value in the _id fieldthat would have occupied space in another portion of the document.

You can store any value in the _id field, but because this valueserves as a primary key for documents in the collection, it mustuniquely identify them. If the field’s value is not unique, then itcannot serve as a primary key as there would be collisions in thecollection.

Use shorter field names.

Note

Shortening field names reduces expressiveness and does not provideconsiderable benefit for larger documents and where documentoverhead is not of significant concern. Shorter field names do notreduce the size of indexes, because indexes have a predefinedstructure.

In general, it is not necessary to use short field names.

MongoDB stores all field names in every document. For mostdocuments, this represents a small fraction of the space used by adocument; however, for small documents the field names may representa proportionally large amount of space. Consider a collection ofsmall documents that resemble the following:

{ last_name : "Smith", best_score: 3.9 }

If you shorten the field named last_name to lname and thefield named best_score to score, as follows, you could save 9bytes per document.

{ lname : "Smith", score : 3.9 }

Embed documents.

In some cases you may want to embed documents in other documents andsave on the per-document overhead. SeeCollection Contains Large Number of Small Documents.

Data Lifecycle Management

Data modeling decisions should take data lifecycle management intoconsideration.

The Time to Live or TTL feature ofcollections expires documents after a period of time. Consider usingthe TTL feature if your application requires some data to persist inthe database for a limited period of time.

Additionally, if your application only uses recently inserteddocuments, consider Capped Collections. Capped collectionsprovide first-in-first-out (FIFO) management of inserted documentsand efficiently support operations that insert and read documents basedon insertion order.