Loading Data

In this exercise, you'll load some triples for use in future exercises.

Prerequisites

This exercise assumes you have mlcp installed and a relatively clean MarkLogic server from which to start. (We assume no server port or name conflicts with those used here). The loading scripts also assume you have the mlcp bin directory in your system PATH environment variable.

Specifically, in this exercise, you will create:

A content database called tutsem-content, with its triple index enabled.
A modules database called tutsem-modules.
A HTTP REST instance called tutsem-rest on port 9910

Create the Database and App servers

MarkLogic provides several ways to create and configure Databases and App servers. For this exercise, we will use the Management API.

Open a text editor and save this JSON as tutsem-server.json. If you already have something running on port 9910, change it in this file (and remember to update the port through the rest of this tutorial): { "rest-api": { "name": "tutsem-rest", "database": "tutsem-content", "modules-database": "tutsem-modules", "port": "9910", "xdbc-enabled": true } }

Save this one as tutsem-content.json: { "collection-lexicon": "true", "triple-index": "true" }

Use curl to send tutsem-server.json to the Manage API. This instructs MarkLogic to create the application server, content database, and modules database: curl --anyauth --user admin:admin -X POST \ --data-binary @tutsem-server.json \ -H "Content-type: application/json" \ http://localhost:8002/v1/rest-apis

Using MarkLogic 8, you can do the same for tutsem-content.json to turn on the triples index: curl --anyauth --user admin:admin -X PUT \ --data-binary @tutsem-content.json \ -H "Content-type: application/json" \ http://localhost:8002/manage/v2/databases/tutsem-content/properties Using MarkLogic 9+, the triple index is turned on by default for new databases. For an alternative to running the curl command, point your browser to the Admin UI (http://localhost:8001) and click on the tutsem-content database (under Databases). Find "triple index" and click the radio button to set it to true; do the same to set "collection lexicon" to true, then scroll back up and click the "ok" button.

Load data from Hello World

You can manually load the 3 triples from our Hello World exercise to your new tutsem-content database or you can use the provided script: load-livesIn.bat (or load-livesIn.sh) as follows.

(NB: the following instructions apply to each of the loading scripts referenced below as well):

If you haven't yet, download and install mlcp and add the mlcp bin sub-directory to your operating system PATH environment variable.
Download the entire semantics-exercises.zip and unzip it.
In a shell, change directories to the load-scripts directory that was inside the zip
If you are on Windows, edit each .bat script and update the admin username and password as needed. If you are on Linux/OSX, you can set the MLUSER and MLPASS environment variables read by the shell scripts, or you can simply edit the scripts.
If you are not running on localhost, you will also need to edit the hostname in the URL in the script.
Run the appropriate script for your operating system.

After running load-livesIn.bat or load-livesIn.sh you'll see output from mlcp like: 13/11/26 15:24:08 INFO contentpump.ContentPump: Hadoop library version: 1.2.0 13/11/26 15:24:08 INFO contentpump.LocalJobRunner: Content type: XML 13/11/26 15:24:08 INFO input.FileInputFormat: Total input paths to process : 1 13/11/26 15:24:10 INFO contentpump.LocalJobRunner: completed 100% 13/11/26 15:24:10 INFO contentpump.LocalJobRunner: com.marklogic.contentpump.ContentPumpStats: 13/11/26 15:24:10 INFO contentpump.LocalJobRunner: ATTEMPTED_INPUT_RECORD_COUNT: 1 13/11/26 15:24:10 INFO contentpump.LocalJobRunner: SKIPPED_INPUT_RECORD_COUNT: 0 13/11/26 15:24:10 INFO contentpump.LocalJobRunner: Total execution time: 1 sec

Load Triples from DBPedia

We have provided a collection of 60k triples taken from DBpedia 3.8, available under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. DBPedia is a crowd-sourced, community effort to extract structured information from Wikipedia.

This provided collection includes 10k triples each from:

Ontology Infobox types	http://downloads.dbpedia.org/3.8/en/instance_types_en.nt.bz2
Ontology Infobox properties	http://downloads.dbpedia.org/3.8/en/mappingbased_properties_en.nt.bz2
Ontology Infobox properties (specific)	http://downloads.dbpedia.org/3.8/en/specific_mappingbased_properties_en.nt.bz2
Short Abstracts	http://downloads.dbpedia.org/3.8/en/short_abstracts_en.nt.bz2
Geographic Coordinates	http://downloads.dbpedia.org/3.8/en/geo_coordinates_en.nt.bz2
Persondata	http://downloads.dbpedia.org/3.8/en/persondata_en.nt.bz2

To load the data, run the provided load-dbpedia.bat or load-dbpedia.sh script. See above for how to run the script.

What have you got so far?

When you load RDF triples into MarkLogic, the triples are stored in MarkLogic-managed XML documents. Below are some questions you can answer by examining the database. You can import the Query Console Workspace, ts-loading-data.xml (available in the qc-workspaces directory within the unarchived semantics-exercises.zip as well.) After importing the workspace, for MarkLogic 8, set the 'Content Source' for each buffer to "tutsem-content (tutsem-modules: /)". For MarkLogic 9+, set the 'Database' for each buffer to "tutsem-content".

Q. What documents got created? Under what URIs?

A. You should see one document under /triplestore directly (the Hello World triples) and 600 under /triplestore/dbpedia/

Q. How many triples are in the database?

A. 60003

Q. How many distinct triples are in the database?

A. 58517

Hints:

Download the Query Console Workspace ts-loading-data.xml and import it into Query Console.
Point your browser to http://localhost:9910/v1/graphs/things to browse triples (Replace localhost as needed).

Load data from BBC News

In this step, you will load a set of articles from the BBC News that we enriched using the OpenCalais Web Service.

We started with each article as a single XHTML document. We used OpenCalais to analyze the articles and find the entities (real-world things) within them. OpenCalais spotted entities like people, their roles, places (cities and countries) and organizations. Additionally, it linked individuals with their role(s) and also determined the subject headings (categories) of the documents. For example, for one news article, OpenCalais generated triples for us that indicated the item was about war, identified the places mentioned in the article, and provided geo-location information for those places.

Our enrichment process generated modified copies of these source documents and an associated set of triples for us, too. To load the modified articles, run the load-news-content.bat or load-news-content.sh script. (See above for how to run the script).

To load the associated triples, run the load-news-graph.bat or load-news-graph.sh script. (See above for how to run the script). You can ignore the errors about lexical forms. (As you will discover, it is not uncommon for triple data to be encoded out of spec. In this data set, the triples with such issues will still be loaded. But, the "dates" that are incorrectly formatted will be treated as strings.)

Before we move on, let's talk a little about some of the modifications we made during enrichment. Specifically, during enrichment, each article was assigned an IRI and that IRI was embedded in the article itself. Each document's IRI was also linked to an OpenCalais identifier using the common owl:sameAs predicate. There’s also a triple that links the document’s assigned IRI to the document’s database URI as well. This enables you to say things about the document, such as "database-document-X mentions IBM" by joining two triples (e.g., something like "assigned-URI mentions IBM" and "assigned-URI isDocument database-document-X").

Below is a SPARQL query that will show you all the owl:sameAs triples in the database. In particular, this will show you all the IRIs linked to OpenCalais identifiers.

PREFIX owl: <http://www.w3.org/2002/07/owl#> SELECT * FROM <http://www.bbc.co.uk/news/graph> WHERE { ?s owl:sameAs ?o } # Note: # Each document has an IRI e.g. http://www.bbc.co.uk/news/entertainment-arts-23094281 # The document's IRI is linked to an OpenCalais identifier via owl:sameAs # The document's IRI is also embedded the document itself

We can also see how an individual document's IRI is embedded by looking at one via XQuery or JavaScript:

XQuery
JavaScript

declare namespace html = "http://www.w3.org/1999/xhtml";

fn:collection()//html:head[@resource="http://www.bbc.co.uk/news/entertainment-arts-23094281"]

xdmp.xqueryEval('//*:head[@resource="http://www.bbc.co.uk/news/entertainment-arts-23094281"]');

There are other way to link triples and documents. For example, you can embed triples in the documents themselves. Such triples could include metadata such as "thisDocument publishDate today"; subjects or topics mentioned in the document such as "thisDocument mentionsCity 'New York'"; or events such "John wentTo China" or "Jack metWith Joe".

Verify your data

You should now see

155,980 distinct triples
2,021 documents containing triples
179,288 total triples

More?

Want to do more, before moving on to the next exercise? Using mlcp or sem:rdf-load() or a REST Endpoint, you can load any file containing triples that you can find on the Semantic Web! Also, because MarkLogic is a database, you can add, edit, or delete triples (in fully-ACID compliant transactions) as well!

References

Query Console export: ts-loading-data.xml (also available inside the semantics-exercises.zip).
mlcp
Semantics Developer Guide section: Loading Triples

Introducing SPARQL

Contents

Loading Data

References