Analytics

"Analytics" is used to describe a class of functionality in MarkLogic that relates to retrieving values and frequencies of values across a large number of documents. With search/query, we're interested in finding documents themselves. With analytics, we're interested in extracting all the unique values that appear within a particular context (such as an XML element or JSON key), as well as the number of times each value occurs. An example of analytics in a MarkLogic application is the message traffic chart on MarkMail.org:

The above chart portrays ranges of email message dates bucketed by month, as well as the number of messages that appear each month. Since MarkMail hosts over 50 million messages, it of course does not go read all those messages when you load the page. Instead, whenever a new document (email message) is loaded into the database, its date is added to a sorted, in-memory list of message dates (values), each associated with a count (frequency). This is achieved through an administrator-defined index (called a range index).

A range index is one kind of lexicon. Whenever you want to perform analytics, you need to have a lexicon configured. In addition to range indexes, other lexicons include the URI lexicon and the collection lexicon. Each of these must be explicitly configured in the database.

Retrieve all collection tags

For this example, you need to have the collection lexicon enabled. Fortunately, we already took care of that at the beginning when we set up the database. Run the following command:

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"values":{"name":"tag","collection":{"prefix":""}}}}' \
  'http://localhost:8011/v1/config/query/tutorial'

This option exposes the database's collection tags as a set of values we're naming "tag":

JSON
XML

{
  "options": {
    "values": {
      "name": "tag"
      "collection": {
        "prefix": ""
      }
    }
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <values name="tag">
    <collection prefix=""/>
  </values>
</options>

The name of the values option is the name we'll be using when we fetch the values ("tag"). The child of the values option defines the source of those values. In this case, "collection" indicates the collection lexicon as the source.

Now that you've configured the values, retrieve them using a GET request using the /values/name endpoint:

The results show all the distinct values of collection tags and their frequency of usage (in other words, how many documents are in each collection):

JSON
XML

{
  "values-response": {
    "metrics": {
      "aggregate-resolution-time": "PT0.000017S",
      "total-time": "PT0.001675S",
      "values-resolution-time": "PT0.000189S"
    },
    "distinct-value": [
      {
        "_value": "mlw2012",
        "frequency": 88
      },
      {
        "_value": "photos",
        "frequency": 140
      },
      {
        "_value": "shakespeare",
        "frequency": 22
      }
    ],
    "type": "xs:string",
    "name": "tag"
  }
}

<values-response name="tag" type="xs:string" xmlns="http://marklogic.com/appservices/search">
  <distinct-value frequency="88">mlw2012</distinct-value>
  <distinct-value frequency="140">photos</distinct-value>
  <distinct-value frequency="22">shakespeare</distinct-value>
  <metrics>
    <values-resolution-time>PT0.000195S</values-resolution-time>
    <aggregate-resolution-time>PT0.000017S</aggregate-resolution-time>
    <total-time>PT0.001873S</total-time>
  </metrics>
</values-response>

Retrieve all document URIs

This example requires the URI lexicon to be enabled. It's enabled by default since MarkLogic 6, so here too we're ready to go. This example is almost identical to the previous one except that we're choosing a different values name ("uri") and a different values source (the URI lexicon):

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/xml" \
  -d'<options xmlns="http://marklogic.com/appservices/search"><values name="uri"><uri/></values></options>' \
  'http://localhost:8011/v1/config/query/tutorial'

The uri element or property indicates the URI lexicon as the source:

JSON
XML

{
  "options": {
    "values": {
      "name": "uri",
      "uri": null
    }
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <values name="uri">
    <uri/>
  </values>
</options>

Retrieve the values using a GET request:

This will return all the document URIs in the database, as well as how many documents they're each associated with (the frequency). For all the JSON and XML document URIs, the answer of course is just one per document. But you might be surprised to see that each image document URI yields a count of 2. That's because each image document has an associated properties document which shares the same URI.

Set up some range indexes

Before we can run the remaining examples in this section, we need to enable some range indexes in our database. Since we have a small number of documents, it won't take long for MarkLogic to re-index everything. At a much larger scale, you'd want to be careful about what indexes you enable and when you enable them. That's why such changes require database administrator access.

We're going to set up the following range indexes:

scalar type	namespace uri	localname
string	empty	SPEAKER
string	empty	affiliation
int	empty	contentRating
unsignedLong	http://marklogic.com/filter	size
string	http://marklogic.com/filter	Exposure_Time

We'll configure each of these by sending a PUT request to the Management API, using the /manage/v2/databases/[db name or id]/properties endpoint. When we send information about range element indexes, what we send will replace the current range element index configuration. That means we will send all the desired indexes in one message. To remove an index, we simply send a PUT request with all current indexes except the one we want to remove. Here's the command to add the indexes we need:

curl -X PUT  --anyauth --user admin:admin --header "Content-Type:application/json" \
-d '{"word-positions": true,
     "element-word-positions": true,
     "range-element-index":
    [ { "scalar-type": "string",
        "namespace-uri": "",
        "localname": "SPEAKER",
        "collation": "http://marklogic.com/collation/",
        "range-value-positions": false,
        "invalid-values": "reject"
      }, 
      { "scalar-type": "string",
        "namespace-uri": "",
        "localname": "affiliation",
        "collation": "http://marklogic.com/collation/",
        "range-value-positions": false,
        "invalid-values": "reject"
      }, 
      { "scalar-type": "int",
        "namespace-uri": "",
        "localname": "contentRating",
        "collation": "",
        "range-value-positions": false,
        "invalid-values": "reject"
      }, 
      { "scalar-type": "unsignedLong",
        "namespace-uri": "http://marklogic.com/filter",
        "localname": "size",
        "collation": "",
        "range-value-positions": false,
        "invalid-values": "reject"
      }, 
      { "scalar-type": "string",
        "namespace-uri": "http://marklogic.com/filter",
        "localname": "Exposure_Time",
        "collation": "http://marklogic.com/collation/",
        "range-value-positions": false,
        "invalid-values": "reject"
      }]}' \
http://localhost:8002/manage/v2/databases/TutorialDB/properties

Now that we have the indexes configured, let's jump back over to the command line.

Retrieve values of a JSON key

We're now ready to make use of some range indexes. Run the following command:

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"values":{"name":"company","range":{"type":"xs:string","collation":"http://marklogic.com/collation/","json-property":"affiliation"},"values-option":["frequency-order"]}}}' \
  'http://localhost:8011/v1/config/query/tutorial'

As with collection and URI values, we start by choosing a name ("company"). This time, instead of "uri" or "collection", we use a "range" field to indicate that a range index is the source of the values. We identify the range index by the name of the JSON key ("affiliation") and the type of the indexed values (string, using the default collation). Here we must make sure that the configuration lines up exactly with the range index that's configured in the database. Otherwise, we'll get an "index not found" error when we try to retrieve the values.

JSON
XML

{
  "options": {
    "values": {
      "name": "company",
      "range": {
        "collation": "http://marklogic.com/collation/",
        "json-property": "affiliation",
        "type": "xs:string"
      },
      "values-option": [
        "frequency-order"
      ]
    }
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <values name="company">
    <range type="xs:string" collation="http://marklogic.com/collation/">
      <json-proprety>affiliation</json-property>
    </range>
    <values-option>frequency-order</values-option>
  </values>
</options>

The last thing to point out above is that, rather than return the values in alphabetical (collation) order, we want to get them in "frequency order" (using the corresponding "values-option"). In other words, return the most commonly mentioned companies first. That's what the "frequency-order" option lets you do.

Retrieve the values by making a GET request:

Unsurprisingly, you'll see that MarkLogic was the most common company affiliation at the MarkLogic World conference.

Retrieve values of an element

In this example, we'll using an element range index to indicate the source of our "speaker" values:

JSON
XML

{
  "options": {
    "values": {
      "range": {
        "collation": "http://marklogic.com/collation/",
        "type": "xs:string",
        "element": {
          "ns": "",
          "name": "SPEAKER"
        }
      },
      "name": "speaker",
      "values-option": [
        "frequency-order"
      ]
    }
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <values name="speaker">
    <values-option>frequency-order</values-option>
    <range type="xs:string" collation="http://marklogic.com/collation/">
      <element ns="" name="SPEAKER"/>
    </range>
  </values>
</options>

Run the following command to upload the values configuration:

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"values":{"name":"speaker","range":{"type":"xs:string","collation":"http://marklogic.com/collation/","element":{"name":"SPEAKER","ns":""}},"values-option":["frequency-order"]}}}' \
  'http://localhost:8011/v1/config/query/tutorial'

To make use of the new "speaker" values, retrieve them using a GET request:

Run the program to see all the unique speakers in the Shakespeare plays, starting with the most garrulous.

Compute aggregates on values

Not only can we retrieve values and their frequencies; we can also perform aggregate math on the server. MarkLogic provides a series of built-in aggregate functions such as avg, max, count, and covariance, as well as the ability to construct user-defined functions (UDFs) in C++ for close-to-the-database computations.

In this example, we're going to access an integer index on the "contentRating" JSON key, exposing it as "rating" values:

JSON
XML

{
  "options": {
    "values": {
      "range": {
        "json-property": "contentRating",
        "type": "xs:int"
      },
      "name": "rating"
    }
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <values name="rating">
    <range type="xs:int">
      <json-property>contentRating</json-property>
    </range>
  </values>
</options>

Run this command to upload the new option:

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"values":{"name":"rating","range":{"type":"xs:int","json-property":"contentRating"}}}}' \
  'http://localhost:8011/v1/config/query/tutorial'

This time, in our GET request, we'll also request the mean and median averages by using the aggregate parameter. And to specify that we want the results in descending order (highest ratings first), we can use the direction parameter:

Fetch the results to see how many conference talks scored 5 stars, how many scored 4 stars, etc.—as well as the mean and median rating for all conference talks:

JSON
XML

"aggregate-result": [
  {
    "_value": "3.71839080459770115",
    "name": "avg"
  },
  {
    "_value": "4",
    "name": "median"
  }
]

<!--...-->
<aggregate-result name="avg">3.71839080459770115</aggregate-result>
<aggregate-result name="median">4</aggregate-result>
<!--...-->

Constrain the values returned using a query

This example starts to hint at the real power of MarkLogic: combining analytics with search. Rather than retrieve all the values of a given key, we're going to retrieve only the values from documents meeting a certain criterion. In this case, we'll get all the ratings for conference talks given by employees of a certain organization. To configure this, we first need to supply a "rating" values option, backed by the "contentRating" key index:

JSON
XML

{
  "options": {
    "values": {
      "range": {
        "json-property": "contentRating",
        "type": "xs:int"
      },
      "name": "rating"
    }
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <values name="rating">
    <range type="xs:int">
      <json-property>contentRating</json-property>
    </range>
  </values>
</options>

Run the following command to upload the "rating" values option:

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"values":{"name":"rating","range":{"type":"xs:int","json-property":"contentRating"}}}}' \
  'http://localhost:8011/v1/config/query/tutorial'

Now, when you retrieve the values, restrict the values to come only from those documents matching a particular query by using the q parameter:

Run the program to see the ratings of all talks given by MarkLogic employees (documents matching the "company:marklogic" string query).

Retrieving tuples of values (co-occurrences)

In addition to retrieving values from a single source, you can also retrieve co-occurrences of values from two different value sources. In other words, you can perform analytics on multi-dimensional data sets. The following JSON document configures tuples (named "size-exposure") backed by two different range indexes. In particular, it will enable you to get all the unique pairings of photo size and exposure time in image metadata:

JSON
XML

{
  "options": {
    "tuples": [
      {
        "name": "size-exposure",
        "range": [
          {
            "type": "xs:unsignedLong",
            "element": {
              "ns": "http://marklogic.com/filter",
              "name": "size"
            }
          },
          {
            "collation": "http://marklogic.com/collation/",
            "type": "xs:string",
            "element": {
              "ns": "http://marklogic.com/filter",
              "name": "Exposure_Time"
            }
          }
        ]
      }
    ]
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <tuples name="size-exposure">
    <range type="xs:unsignedLong">
      <element ns="http://marklogic.com/filter" name="size"/>
    </range>
    <range type="xs:string" collation="http://marklogic.com/collation/">
      <element ns="http://marklogic.com/filter" name="Exposure_Time"/>
    </range>
  </tuples>
</options>

Run the following command to upload the options:

curl -v -X POST \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"tuples":[{"name":"size-exposure","range":[{"type":"xs:unsignedLong","element":{"name":"size","ns":"http://marklogic.com/filter"}}, {"type":"xs:string","collation":"http://marklogic.com/collation/", "element":{"name":"Exposure_Time","ns":"http://marklogic.com/filter"} } ]}]}}' \
  'http://localhost:8011/v1/config/query/tutorial'

To view the tuples, make a GET request:

The results include unique pairings of distinct values:

JSON
XML

"tuple": [
  {
    "distinct-value": [
      {
        "_value": "60641",
        "type": "xs:unsignedLong"
      },
      {
        "_value": "1/100",
        "type": "xs:string"
      }
    ],
    "frequency": 1
  }
]

<!--...-->
<tuple frequency="1">
  <distinct-value xsi:type="xs:unsignedLong" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">60641</distinct-value>
  <distinct-value xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">1/100</distinct-value>
</tuple>
<!--...-->

As mentioned earlier, MarkLogic's real power lies in the combination of search and analytics. A couple examples ago we saw how a query could be used to constrain a values retrieval. What we haven't seen yet is how the /search endpoint can also return lists of values (called "facet values") along with its search results. These facets can then be used to interactively explore your data. In this case, we're not calling /values at all, just /search.

But before we can run a faceted search, we need to define one or more constraints that are backed by a lexicon or range index. The following XML options configures two range-index backed constraints:

JSON
XML

{
  "options": {
    "constraint": [
      {
        "range": {
          "json-property": "contentRating",
          "type": "xs:int",
          "facet-option": "descending"
        },
        "name": "rating"
      },
      {
        "range": {
          "collation": "http://marklogic.com/collation/",
          "json-property": "affiliation",
          "type": "xs:string",
          "facet-option": "frequency-order"
        },
        "name": "company"
      }
    ]
  }
}

<options xmlns="http://marklogic.com/appservices/search">
  <!-- expose the "contentRating" JSON key range index as "rating" values -->
  <constraint name="rating">
    <range type="xs:int" facet="true">
      <json-property>contentRating</json-property>
      <!-- highest ratings first -->
      <facet-option>descending</facet-option>
    </range>
  </constraint>
  <!-- expose the "affiliation" JSON key range index as "company" values -->
  <constraint name="company">
    <range type="xs:string" facet="true" collation="http://marklogic.com/collation/">
      <json-property>affiliation</json-property>
      <!-- most common values first -->
      <facet-option>frequency-order</facet-option>
    </range>
  </constraint>
</options>

For this example, we'll use a different options set instead of the "tutorial" options. Run the following command to upload the new options set ("tutorial2"):

curl -v -X PUT \
  --digest --user rest-admin:x \
  -H "Content-type: application/json" \
  -d'{"options":{"constraint":[{"name":"rating","range":{"type":"xs:int","json-property":"contentRating","facet-option":"descending"}}, {"name":"company","range":{"type":"xs:string","collation":"http://marklogic.com/collation/","json-property":"affiliation","facet-option":"frequency-order"}} ]}}' \
  'http://localhost:8011/v1/config/query/tutorial2'

GET the options to verify they've been uploaded:

The above configuration makes the "rating" and "company" constraints available for users to type in their query search string. You may be thinking "Isn't that only going to be useful for power users? Most users aren't going to bother learning a search grammar." That's true, but with a UI that supports faceted navigation, they won't need to. All they'll have to do is click a link to get the results constrained by a particular value. For example, the screenshot below from MarkMail shows four facets: month, list, sender, and attachment type:

Each of these is a facet, whose values are retrieved from a range index. Moreover, users can drill down and pick various combinations of facets simply by clicking a link, or in the case of the histogram, swiping their mouse pointer.

MarkLogic's REST API gives you everything you need to construct a model for faceted navigation. We're not building any UI in this tutorial, but we can simulate faceted search by trying out different links representing a series of searches a user might make.

Find all conference talks (and list all facets):

Find and list facets for only the talks given by MarkLogic employees:

Find and list facets for MarkLogic talks that garnered a 5-star rating:

Find talks mentioning "java" that were rated 4 or higher:

In addition to the normal search results listing documents and their matching snippets, the results of a faceted search include lists of facets:

JSON
XML

  "facets": {
    "rating": {
      "facetValues": [
        {
          "count": 61,
          "name": "5"
        },
        {
          "count": 54,
          "name": "4"
        },
        {
          "count": 34,
          "name": "3"
        },
        {
          "count": 11,
          "name": "2"
        },
        {
          "count": 2,
          "name": "1"
        },
        {
          "count": 12,
          "name": "0"
        }
      ],
      "type": "xs:int"
    },
    "company": {
      "facetValues": [
        {
          "count": 38,
          "name": "MarkLogic"
        },
        {
          "count": 2,
          "name": "Avalon Consulting, LLC"
        },
        {
          "count": 2,
          "name": "Overstory, Ltd."
        }
      ],
      "type": "xs:string"
    }
  }

<search:facet name="rating" type="xs:int">
  <search:facet-value name="5" count="61">5</search:facet-value>
  <search:facet-value name="4" count="54">4</search:facet-value>
  <search:facet-value name="3" count="34">3</search:facet-value>
  <search:facet-value name="2" count="11">2</search:facet-value>
  <search:facet-value name="1" count="2">1</search:facet-value>
  <search:facet-value name="0" count="12">0</search:facet-value>
</search:facet>
<search:facet name="company" type="xs:string">
  <search:facet-value name="MarkLogic" count="38">MarkLogic</search:facet-value>
  <search:facet-value name="Avalon Consulting, LLC" count="2">Avalon Consulting, LLC</search:facet-value>
  <search:facet-value name="Overstory, Ltd." count="2">Overstory, Ltd.</search:facet-value>
  <!--...-->
</search:facet>

These values can be used to drive a faceted navigation UI. We saw earlier how the results structure maps to the search results on this website. Now we can see how it maps to facet results. One facet ("Category") is represented by a <facet> element (or the "facets" array in JSON):

And its values are modeled by <facet-result> elements (or the facet-results array in JSON):

When a user clicks on one of these values, it takes them to a new automatically constrained search results page. For example, if they click "Blog posts," it will re-run their search with the additional constraint "category:blog".

Custom search

More learning resources

Contents