Analytics

"Analytics" is used to describe a class of functionality in MarkLogic that relates to retrieving values and frequencies of values across a large number of documents. With search/query, we're interested in finding documents themselves. With analytics, we're interested in extracting all the unique values that appear within a particular context (such as an XML element or JSON key), as well as the number of times each value occurs. An example of analytics in a MarkLogic application is the message traffic chart on MarkMail.org:

Machine generated alternative text: 0 0 0 Home - MarkMaiI - Commt (- C [markmail.org s  I ]UIM[ark Ìt o- Want your own MarkMair? Tell us about it. Sign In or Sign Up (Why?) Summary of all Messages What’s New Search for: T (_Search_J ç Actions > Previous news itE > Subscribe to the > Read the FAQ > Give feedback > Advertise here > 1 FINDMEOII-’1 I ¡‘‘‘ ‘- I I._ ) About MarkMall harmony mozilia spamassassin Who invoked Godwin’s Law? MarkMaiI is devel apache hibernate myfaces squid-cache MarkLogic Corpor cocoon httpd mysal struts “godwin’s law” opt:noquote MarkMail is a free s incubator pj thunderbird mailing list archives discuss advantages over tr jdom p.jj tomcat engines. It is power fìrfrw jri.h, nhñ Searches for these SDeclfies that tte Server: Each email

The above chart portrays ranges of email message dates bucketed by month, as well as the number of messages that appear each month. Since MarkMail hosts over 50 million messages, it of course does not go read all those messages when you load the page. Instead, whenever a new document (email message) is loaded into the database, its date is added to a sorted, in-memory list of message dates (values), each associated with a count (frequency). This is achieved through an administrator-defined index (called a range index).

A range index is one kind of lexicon. Whenever you want to perform analytics, you need to have a lexicon configured. In addition to range indexes, other lexicons include the URI lexicon and the collection lexicon. Each of these must be explicitly configured in the database.

Retrieve all collection tags

For this example, you need to have the collection lexicon enabled. Fortunately, we already took care of that at the beginning when we set up the database. Run the following command:

This option exposes the database's collection tags as a set of values we're naming "tag":

The name of the values option is the name we'll be using when we fetch the values ("tag"). The child of the values option defines the source of those values. In this case, "collection" indicates the collection lexicon as the source.

Now that you've configured the values, retrieve them using a GET request using the /values/name endpoint:

The results show all the distinct values of collection tags and their frequency of usage (in other words, how many documents are in each collection):

Retrieve all document URIs

This example requires the URI lexicon to be enabled. It's enabled by default since MarkLogic 6, so here too we're ready to go. This example is almost identical to the previous one except that we're choosing a different values name ("uri") and a different values source (the URI lexicon):

The uri element or property indicates the URI lexicon as the source:

Retrieve the values using a GET request:

This will return all the document URIs in the database, as well as how many documents they're each associated with (the frequency). For all the JSON and XML document URIs, the answer of course is just one per document. But you might be surprised to see that each image document URI yields a count of 2. That's because each image document has an associated properties document which shares the same URI.

Set up some range indexes

Before we can run the remaining examples in this section, we need to enable some range indexes in our database. Since we have a small number of documents, it won't take long for MarkLogic to re-index everything. At a much larger scale, you'd want to be careful about what indexes you enable and when you enable them. That's why such changes require database administrator access.

We're going to set up the following range indexes:

scalar type

namespace uri

localname

string

empty

SPEAKER

string

empty

affiliation

int

empty

contentRating

unsignedLong

http://marklogic.com/filter

size

string

http://marklogic.com/filter

Exposure_Time

We'll configure each of these by sending a PUT request to the Management API, using the /manage/v2/databases/[db name or id]/properties endpoint. When we send information about range element indexes, what we send will replace the current range element index configuration. That means we will send all the desired indexes in one message. To remove an index, we simply send a PUT request with all current indexes except the one we want to remove. Here's the command to add the indexes we need:

Now that we have the indexes configured, let's jump back over to the command line.

Retrieve values of a JSON key

We're now ready to make use of some range indexes. Run the following command:

As with collection and URI values, we start by choosing a name ("company"). This time, instead of "uri" or "collection", we use a "range" field to indicate that a range index is the source of the values. We identify the range index by the name of the JSON key ("affiliation") and the type of the indexed values (string, using the default collation). Here we must make sure that the configuration lines up exactly with the range index that's configured in the database. Otherwise, we'll get an "index not found" error when we try to retrieve the values.

The last thing to point out above is that, rather than return the values in alphabetical (collation) order, we want to get them in "frequency order" (using the corresponding "values-option"). In other words, return the most commonly mentioned companies first. That's what the "frequency-order" option lets you do.

Retrieve the values by making a GET request:

Unsurprisingly, you'll see that MarkLogic was the most common company affiliation at the MarkLogic World conference.

Retrieve values of an element

In this example, we'll using an element range index to indicate the source of our "speaker" values:

Run the following command to upload the values configuration:

To make use of the new "speaker" values, retrieve them using a GET request:

Run the program to see all the unique speakers in the Shakespeare plays, starting with the most garrulous.

Compute aggregates on values

Not only can we retrieve values and their frequencies; we can also perform aggregate math on the server. MarkLogic provides a series of built-in aggregate functions such as avg, max, count, and covariance, as well as the ability to construct user-defined functions (UDFs) in C++ for close-to-the-database computations.

In this example, we're going to access an integer index on the "contentRating" JSON key, exposing it as "rating" values:

Run this command to upload the new option:

This time, in our GET request, we'll also request the mean and median averages by using the aggregate parameter. And to specify that we want the results in descending order (highest ratings first), we can use the direction parameter:

Fetch the results to see how many conference talks scored 5 stars, how many scored 4 stars, etc.—as well as the mean and median rating for all conference talks:

Constrain the values returned using a query

This example starts to hint at the real power of MarkLogic: combining analytics with search. Rather than retrieve all the values of a given key, we're going to retrieve only the values from documents meeting a certain criterion. In this case, we'll get all the ratings for conference talks given by employees of a certain organization. To configure this, we first need to supply a "rating" values option, backed by the "contentRating" key index:

Run the following command to upload the "rating" values option:

Now, when you retrieve the values, restrict the values to come only from those documents matching a particular query by using the q parameter:

Run the program to see the ratings of all talks given by MarkLogic employees (documents matching the "company:marklogic" string query).

Retrieving tuples of values (co-occurrences)

In addition to retrieving values from a single source, you can also retrieve co-occurrences of values from two different value sources. In other words, you can perform analytics on multi-dimensional data sets. The following JSON document configures tuples (named "size-exposure") backed by two different range indexes. In particular, it will enable you to get all the unique pairings of photo size and exposure time in image metadata:

Run the following command to upload the options:

To view the tuples, make a GET request:

The results include unique pairings of distinct values:

Searching with facets

As mentioned earlier, MarkLogic's real power lies in the combination of search and analytics. A couple examples ago we saw how a query could be used to constrain a values retrieval. What we haven't seen yet is how the /search endpoint can also return lists of values (called "facet values") along with its search results. These facets can then be used to interactively explore your data. In this case, we're not calling /values at all, just /search.

But before we can run a faceted search, we need to define one or more constraints that are backed by a lexicon or range index. The following XML options configures two range-index backed constraints:

For this example, we'll use a different options set instead of the "tutorial" options. Run the following command to upload the new options set ("tutorial2"):

GET the options to verify they've been uploaded:

The above configuration makes the "rating" and "company" constraints available for users to type in their query search string. You may be thinking "Isn't that only going to be useful for power users? Most users aren't going to bother learning a search grammar." That's true, but with a UI that supports faceted navigation, they won't need to. All they'll have to do is click a link to get the results constrained by a particular value. For example, the screenshot below from MarkMail shows four facets: month, list, sender, and attachment type:

Machine generated alternative text: MarkM1it j) Home Messages per 12000C 100000 80000 60000 20000 ‘00 .01 ‘02 ‘03 ‘04 ‘05 ‘06 ‘07 jWhat List? . netjava.dev.opensso.issues netjava,dev.glassfish.users net.java.dev.maven-repository.cvs org .jboss.I ists.jboss-cvs-commits org.netbeans.nbusers netjava.dev.glassfish.issues org .j boss.I ists.jboss-user org.apache.tuscany.dev netjava.dev.mojarra.commits netjavadev.mifosissues netjava.dev.sailrinissues net.java.dev.hudson.users orgapache.hadoop.core-dev com.googlegroups.google-web-toolkit Month Remove date refinements’ tir.  ‘08 ‘09 ‘10 ‘11 12 View morel 9,034 8,095 7,282 6,767 6,079 5,503 5,188 5,097 5,004 4,721 4,686 4,665 4,467 4,278 ‘Who Sent It? View more LS’] Attachments? View moro kohs,..©dev.java.net 7,321 patch 980 mave..©dev.java.net 7,280 zip 858 rlu.,cdev.java.net 6,741 txt 713 jbos...©lists.jboss.org 6,270 jpg 686 glas..javadesktop.org 5,725 java 578 Continuum VMBuild Server 4,334 gif 466 Build Daemon user 3,428 png 445 sv...©wso2.org 3,343 log 437 met...©javadesktop.org 3,139 html 393 cont..©apache.org 2,888 duff 353 to...©freenetprojectorg 2,718 xml 335 jbos...lists.jboss.org 2,541 pdf 236 dcleal 2,529 dat 208 code...tgoogle.com 2,454 Other 201

Each of these is a facet, whose values are retrieved from a range index. Moreover, users can drill down and pick various combinations of facets simply by clicking a link, or in the case of the histogram, swiping their mouse pointer.

MarkLogic's REST API gives you everything you need to construct a model for faceted navigation. We're not building any UI in this tutorial, but we can simulate faceted search by trying out different links representing a series of searches a user might make.

Find all conference talks (and list all facets):

Find and list facets for only the talks given by MarkLogic employees:

Find and list facets for MarkLogic talks that garnered a 5-star rating:

Find talks mentioning "java" that were rated 4 or higher:

In addition to the normal search results listing documents and their matching snippets, the results of a faceted search include lists of facets:

These values can be used to drive a faceted navigation UI. We saw earlier how the results structure maps to the search results on this website. Now we can see how it maps to facet results. One facet ("Category") is represented by a <facet> element (or the "facets" array in JSON):

Machine generated alternative text: CATEGORIES <tacet> All categories [87] J Functìon pages [50) ¿ Userguides[12) Bkg posts [11)

And its values are modeled by <facet-result> elements (or the facet-results array in JSON):

Machine generated alternative text: CATEGORIES All categorIes [87] Functìon pages [50) I2r9uL121If’ Bkg posts [11)

When a user clicks on one of these values, it takes them to a new automatically constrained search results page. For example, if they click "Blog posts," it will re-run their search with the additional constraint "category:blog".

Custom search

More learning resources

Stack Overflow iconStack Overflow: Get the most useful answers to questions from the MarkLogic community, or ask your own question.