This note provides a brief, slightly abridged, introduction to the MarkLogic Data Model.
Documents
The basic unit of organization in MarkLogic is a Document, encoded, for example, in JSON like
or XML, like The set of JSON keys, objects, and arrays, or XML elements and attributes you use in your documents is up to you. MarkLogic does not require adherence to any schemas.MarkLogic also supports documents encoded in binary form (e.g., image files, Word, Excel, PowerPoint, executables, and so on) or plain text as well. We refer to this encoding (JSON, XML, text, or binary) as the document's Format.
URIs
A document's URI is a key that you choose when you insert a document into the database.
Each document has a unique URI. You use this URI to retrieve or refer to the document later. Typically document URIs begin
with a slash like /beer
.
Organization
How does MarkLogic organize documents in the database? Logically, MarkLogic provides two concepts: Collections and Directories. You can think of collections as unordered sets. If you have a notion of tag as well, that may help. Collections can hold multiple documents and documents can belong to multiple collections.
Directories are similar in concept to the notion of directories or folders in file systems. They are hierarchical and membership is implicit based on the path syntax of URIs.
Under the covers
Under the covers, MarkLogic stores Documents as compressed trees, based on the well-known XPath Data Model. This model is sufficiently featured to represent all sorts of documents, including plain-text and JSON. To understand a little bit deeper, take the following XML document:
It is represented in MarkLogic as a tree structure:
The advantage of storing documents in XML format is
that you can query the tree structure, using XPath
expressions such as /doc/title
. And with MarkLogic, you can
perform searches that are aware of and can be qualified by this
tree structure.
So, how is a text document, like this one, represented?
Well, since every document is a tree, this one is too (albeit a simple one):
Binary documents are also (trivial) trees, like text documents. But in this case, MarkLogic extends the XPath model to include binary nodes. These are special and can only occur as singular children of the document node. For example, the storage of a JPG file would look like this:
MarkLogic stores binary data as is (without additional compression) and provides a mechanism for storing the binary data externally, outside of the database as well.
What about JSON? When you insert JSON into MarkLogic via REST or Java APIs, the JSON is converted to an XML representation that is designed to be indexed for efficient search. In general, writing queries or working with JSON documents doesn't require you to know any of the details on this XML representation.
Data Modeling: Documents Are Like Rows
When modeling data for MarkLogic, think of documents more like rows than tables. In other words, if you have a thousand items, model them as a thousand separate documents not as a single document holding a thousand child elements. This is for two reasons:
- Locks are managed at the document level. A separate document for each item avoids lock contention.
- All index, retrieval, and update actions happen at the document level. When finding an item, retrieving an item, or updating an item, that means it's best to have each item in its own document. The easiest way to accomplish that is to put them in separate documents.
More about Data
Those are the basics. You may find the following helpful as well:
- Updates and Transactions — what's on disk and how to concurrent reads and writes work
- Query, Search, and Indexing — how queries work, including details about MarkLogic's indexes