Adding a Map Datatype to XQuery

Introduction

The "standard" way of passing a complex data structure in XQuery is to create an XML fragment and later query it using XPath. This approach works well most of the time, but sometimes you just can't use it:

  • wrapping stored data into an XML fragment will create a new copy of the data in memory (even though eXist-db will defer this until the data is actually used). Wrapping a large query result into an XML structure may thus use a considerable amount of memory.
  • the reference to the original document gets lost.
  • the power of higher-order functions in XQuery 3.0 makes me wish I could create data structures containing function items as values.

Maps provide a solution to the problems above. Michael Kay has posted a well thought out proposal for maps, which I decided to implement a few weeks ago.

Let's have a quick look at the map datatype as proposed by Michael and implemented in the current trunk of eXist-db. Note that this is not part of the XQuery 3.0 specification - though it is considered for later inclusion - and may be subject to change.

Creating a Map

You create a new map through either the literal syntax or the functions map:new and map:entry. Here's the literal syntax:

let $daysOfWeek := map { "Sunday" := 1, "Monday" := 2, "Tuesday" := 3, "Wednesday" := 4, "Thursday" := 5, "Friday" := 6, "Saturday" := 7 }

The keys are arbitrary atomic values while any sequence can be used as value. You are thus not limited to string keys: dates, numbers or QNames will work as well. Keys are compared for equality using the eq operator under the map's collation.

map:entry creates a map with a single key/value pair. Use this to create map items programmatically in combination with map:new (see map:new below):

map:entry("Sunday", 1)

map:new creates either an empty map or a new map from a sequence of maps. It accepts an optional collation string as second parameter:

let $daysOfWeek := ( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday" ) let $map := map:new( for $day at $pos in days return map:entry($day, $pos), "?strength=primary" )

As you can see, the only way to create a map from a sequence programmatically is to merge single-item maps into the new map. The map implementation in eXist-db makes sure this is not too expensive (by using a lightweight wrapper for single key/value pairs).

In this example, the collation string "?strength=primary" causes keys to be compared in a case-insensitive way.

Look Up

To look up a key, use map:get:

map:get($map, "Tuesday")

But wait, there's a real cool shortcut to do a look up: a map is also a function item, which means you can directly call it as a function, passing the key to retrieve as single parameter:

$map("Tuesday")

Calling the map as a function item otherwise just behaves like map:get.

Because the empty sequence is allowed as a value, map:get does not tell you for sure if a key exists in a map or not. You can use map:contains to see if a key is present in the map:

map:contains($map, "Tuesday")

map:keys retrieves all keys in the map as a sequence:

map:keys($map)

Please note that the order in which keys are returned is implementation-defined, so don't rely on it. In fact, eXist-db uses two different map implementations for better performance, depending on collation settings and key types.

Here's a complete example which combines the functions to access a map:

xquery version "1.0"; let $workDays := map { "Monday" := 2, "Tuesday" := 3, "Wednesday" := 4, "Thursday" := 5, "Friday" := 6 } let $daysOfWeek := map:new(($workDays, map { "Sunday" := 1, "Saturday" := 7 })) for $day in map:keys($daysOfWeek) order by map:get($daysOfWeek, $day) return <day n="{$daysOfWeek($day)}" atWork="{map:contains($workDays, $day)}">{$day}</day>
Edit

Maps are Immutable

To remove a key/value pair, call

let $newMap := map:remove("Sunday")

At this point we definitely need to talk about an important feature: maps are immutable! Adding or removing a key/value pair will result in a new map. To illustrate this with an example:

let $daysOfWeek := map { "Sunday" := 1, "Monday" := 2, "Tuesday" := 3, "Wednesday" := 4, "Thursday" := 5, "Friday" := 6, "Saturday" := 7 } let $workDays := map:remove($daysOfWeek, "Sunday") return ( map:contains($daysOfWeek, "Sunday") (: Still there :), map:contains($workDays, "Sunday") (: Nope :) )
Edit

Internally, eXist-db uses an efficient implementation of persistent immutable maps and hash tables taken from clojure , another lisp-like, functional language for the Java VM.

Use Cases

So far I found maps to be useful in a number of scenarios:

  1. in my HTML templating framework for passing around application data between templates. In this case the sequences stored in the map can potentially be very large, e.g. if they include the result of queries into the database. Wrapping the data into an in-memory fragment would thus be a bad idea.
  2. to pass optional configuration parameters into a library module.
  3. to introduce additional levels of abstraction when working with heterogeneous data sets.

Function Items as Values

To understand the last scenario, we have to take a closer look at an important feature of maps: one can use function items as map values! For example, a library module may allow the calling module to register an optional function for resolving a resource, which only the calling module can know how to find:

let $configuration := map { "resolve": function($relPath as xs:string) { (: resolve resource :) } }

You can even use maps and function items to simulate "objects". For example, one of my library modules has to display a short summary of documents using two different schemas: docbook and TEI. It thus needs to extract common metadata like title or author from the documents. Using maps, I could create a wrapper around the documents, which provides functions to access the data in object-oriented style:

xquery version "3.0"; declare namespace tei="http://www.tei-c.org/ns/1.0"; declare namespace db="http://docbook.org/ns/docbook"; declare function local:tei($root as element()) as map(xs:string, function(*)) { map { "title" := function() as xs:string { $root//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title/string() } } }; declare function local:docbook($root as element()) as map(xs:string, function(*)) { map { "title" := function() as xs:string { $root//db:info/db:title/string() } } }; declare function local:wrap($root as element()) as map(xs:string, function(*))? { typeswitch ($root) case element(tei:TEI) return local:tei($root) case element(db:article) return local:docbook($root) default return () }; <ul> { for $doc in (doc("/db/db-test.xml")/*, doc("/db/tei-test.xml")/*) let $wrapped := local:wrap($doc) return <li>{$wrapped("title")()}</li> } </ul>

This approach has its limitations. There's no guarantee that the maps returned by local:wrap do indeed have a "title" function. XQuery is not - and was not designed to be - an object-oriented language. However, I can see that the technique could improve reusability of code libraries.

Availability

Maps as a data type are currently available in eXist-db trunk and will likely go into the final 2.0 release (only minor additions to the query engine were required). If you would like to test them right now, feel free to check out trunk.

Higher-Order Functions in XQuery 3.0

Introduction

Higher-order functions are probably the most notable addition to the XQuery language in version 3.0 of the specification. While it may take some time to understand their full impact, higher-order functions certainly open a wide range of new possibilities, and are a key feature in all functional languages.

As of April 2012, eXist-db completely supports higher-order functions, including features like inline functions, closures and partial function application. This article will quickly walk through each feature before we put them all together in a practical example.

Function References

A higher-order function is a function which takes another function as parameter or returns a function. So the first thing you'll need in order to pass a function around is a way to obtain a reference to a function.

In older versions of eXist-db we had an extension function for this, called util:function, which expected a name as first argument, and the arity of the function as second. The arity corresponds to the number of parameters the target function takes. Name and arity are required to uniquely identify a function within a module.

XQuery 3.0 now provides a literal syntax for referencing a function statically. It also consists of the name and the arity of the function to look up, separated by a hash sign:

let $f := my:func#2

This assumes you already know the function when writing the query. However, there's also a dynamic variant, a function called "function-lookup", which is part of the standard library:

let $f := function-lookup("my:func", 2)

Dynamic Function Calls

Now we have a reference, we can actually call the function. In XQuery 3.0, a dynamic function call is a primary expression followed by an argument in parenthesis:

let $f := upper-case#1 return $f("Hello")

In this case the function reference is obtained from variable $f, but we could actually use any primary expression here, even another function call. Now we're able to write our first higher-order function, i.e. one which takes another function as parameter:

xquery version "3.0"; declare namespace ex="http://exist-db.org/xquery/ex"; declare function ex:map($func, $list) { for $item in $list return $func($item) }; (: Create an inline function and assign it to $f :) let $f := upper-case#1 return ex:map($f, ("Hello", "world!"))
Edit

Our function, ex:map, simply applies the passed function to every item in the list it received. This is actually a very common operation, so the XQuery 3.0 standard library does already provide a function, fn:map, and we don't need to write our own.

Inline Functions and Closures

Instead of explicitly declaring a function in the module, we may also use an inline function (or anonymous function), which we can directly pass around. To put it simple: an inline function is a function without a name:

map(function($x) { $x * $x }, 1 to 5)
Edit

Inline functions have an important feature: within the function body, you can not only access the function parameters, but also any variable which was in scope at the time the inline function was created. This is commonly called a closure in functional languages. The closure allows the function body to access variables outside its immediate lexical scope. We'll see concrete examples in the larger example following below.

Inline functions also inherit the static context from the module in which they were created.

Partial Function Applications

Partial function applications are a powerful feature in functional programming languages. They allow us to fill in parts of a function call, leaving some parameters open. This is useful if information has to be passed across a hierarchy of function calls. The calling function can fill in some parameters and leaves it to the called function to provide the rest.

A partial function application looks like a normal (dynamic or static) function call using a question mark (?) as placeholder for some parameters:

declare function local:add($lhs as xs:integer, $rhs as xs:integer) as xs:integer { $lhs + $rhs }; let $f := local:add(10, ?) return $f(32)
Edit

The partial function application actually creates a new function with the same name, but removing all fixed parameters from the parameter list.

Putting it all Together

For a practical exercise we choose a frequent task when working with eXist: scan a collection hierarchy recursively and apply some operation to every collection or resource encountered. This can be written as a one-time solution using a simple recursive function. However, we would rather like to design a set of functions which can be reused for arbitrary tasks on collections and resources.

To start with, we create a function to recursively scan the collection hierarchy, calling a provided function for each collection encountered:

xquery version "3.0"; module namespace dbutil="http://exist-db.org/xquery/dbutil"; (:~ : Scan a collection tree recursively starting at $root. Call : $func once for each collection found :) declare function dbutil:scan-collections($root as xs:anyURI, $func as function(xs:anyURI) as item()*) { $func($root), for $child in xmldb:get-child-collections($root) return dbutil:scan-collections(xs:anyURI($root || "/" || $child), $func) };

The function takes two parameters:

  • the path to the root collection
  • a function which accepts one parameter: the path to the current collection

The callback function is called once for every collection we encounter while scanning the collection tree. We then walk each child collection, calling dbutil:scan-collections recursively. The || operator is also new in XQuery 3.0 and concatenates strings. It is a shorthand for the fn:concat function.

Next, let's add another function for scanning all the resource paths stored in a given collection and pass them to another callback function. This is straightforward:

(:~ : List all resources contained in a collection and call the : supplied function once for each resource with the complete : path to the resource as parameter. :) declare function dbutil:scan-resources($collection as xs:anyURI, $func as function(xs:anyURI) as item()*) { for $child in xmldb:get-child-resources($collection) return $func(xs:anyURI($collection || "/" || $child)) };

We're now ready to combine the two operations: create a function which recursively walks the collection tree and calls a provided function for each collection and resource encountered.

(:~ : Scan a collection tree recursively starting at $root. Call : the supplied function once for each resource encountered. : The first parameter to $func is the collection URI, the : second the resource path (including the collection part). :) declare function dbutil:scan($root as xs:anyURI, $func as function(xs:anyURI, xs:anyURI?) as item()*) { dbutil:scan-collections($root, function($collection as xs:anyURI) { $func($collection, ()), (: scan-resources expects a function with one parameter, so we use a partial application to fill in the collection parameter :) dbutil:scan-resources($collection, $func($collection, ?)) }) };

The callback function for the collection/resource scan takes two parameters:

  • the path to the collection
  • the full resource path if the current item is a resource, or the empty sequence if it is a collection

We start by calling dbutil:scan-collections with an inline function. This gets the path to the current collection and first calls the user provided callback with the collection path and an empty sequence for the resource parameter. Note that the variable $func, which we received as parameter to the outer call, is visible within the inline function due to the closure!

Now to the interesting code: for each collection, we call dbutil:scan-resources. Remember that it expects a callback function with just one parameter (the resource path), but the user-supplied $func takes two parameters. We could now work around this by changing dbutil:scan-resources, but there's an easier way: we pass a partial function application, providing $collection as a fixed parameter while leaving the resource parameter open!

We now have all our library functions in place and are ready to use them. For a first test, here's a function which scans the collection hierarchy for resources having a given mime type:

declare function dbutil:find-by-mimetype($collection as xs:anyURI, $mimeType as xs:string) { dbutil:scan($collection, function($collection, $resource) { if (exists($resource) and xmldb:get-mime-type($resource) = $mimeType) then $resource else () }) };

If our inline function receives a $resource parameter, we check its mime type and return the resource path if it matches. In all other cases, the empty sequence is returned. We can now test if our library works as expected:

import module namespace dbutil="http://exist-db.org/xquery/dbutil" at "xmldb:exist:///db/codebin/dbutils.xql"; dbutil:find-by-mimetype(xs:anyURI("/db/demo"), "application/xquery")
Edit

You can view the code for the complete module.

Another operation you will probably need when migrating your database from eXist-db version 1.4.x to the forthcoming 2.0: traverse the collection tree and fix permissions on all collections and XQuery resources. Contrary to previous versions, eXist-db 2.0 closely follows the Unix security model. To access a collection, it is no longer sufficient to have read permissions on it: you also need the execute permission in order to see the collection contents. Likewise, stored XQuery modules need the execute flag to be set in order to be executed through eXist-db.

Here's a piece of code to globally reset permissions:

xquery version "3.0"; import module namespace dbutil="http://exist-db.org/xquery/dbutil" at "xmldb:exist:///db/codebin/dbutils.xql"; (: You need to run this as an admin user, so it won't work on the demo server :) let $perms := "g+x,u+x,o+x" return dbutil:scan(xs:anyURI("/db"), function($collection, $resource) { if ($resource and xmldb:get-mime-type($resource) = "application/xquery") then sm:chmod($resource, $perms) else sm:chmod($collection, $perms) })
Edit

Availability and Backwards Compatibility

Complete support for XQuery 3.0 higher-order functions is available in eXist-db trunk (2.1dev) starting with revision 16248 (April 14, 2012). It will also be part of the final 2.0 release. We're currently working hard to complete the last bits and pieces for the release.

eXist-db provided ways to dynamically call functions since several years, based on the two extension functions util:call and util:function. They are used in several modules (e.g. the KWIC module).

We made sure the new XQuery 3.0 higher-order function implementation is backwards compatible (though a lot more powerful), so old code will continue to work and migration is smooth.

1.4.2 Maintenance Release

While we're working hard on polishing version 2.0 for the final release, we should not forget that there have been quite a few important bug fixes in the 1.4.x branch as well. We are thus happy to announce another production quality release in the stable branch: 1.4.2. Version 1.4.2 is the last release in the 1.4.x branch before 2.0. After 2.0 has been released in May, it will become the new stable branch.

Stable releases only contain hand-selected bug fixes which were ported back from the development trunk. There are no new features. Notable changes include:

  • Deadlock fixes
  • Lots of updates to the rewritten WebDAV interface
  • Performance improvements when using the Lucene index
  • Backup to file system produced bad directory structure
  • Support circular module imports
  • XQuery processing fixes

eXist-db version 1.4.2 is now available on sourceforge.

Content Extraction and Binary Resource Indexing

One of the outstanding new features in the development version of eXist is the ability to index and query the content of binary resources, which are stored inside the database. This is made possible through a new XQuery module for content extraction and some basic, but powerful additions to the Lucene-based full text indexing. Our goal was to provide as much flexibility to the user as possible. The new functions are thus rather generic and can be used in a wide range of scenarios.

Content Extraction

Prerequisites

The content extraction module is based on Apache tika. Tika understands a large variety of formats, ranging from PDF documents to spreadsheets and image metadata. It thus requires a number of helper libraries, which will be installed automatically when you build the module.

To get started, you need a recent checkout of eXist from SVN trunk. Enable the content extraction module by editing EXIST_HOME/extensions/build.properties and set the corresponding property to true:

# Binary Content and Metadata Extraction Module include.feature.contentextraction = true

Next, call build.sh/build.bat from eXist's top directory to build the module. You should see in the output how the various libraries are downloaded and installed.

Usage

To import the module use an import statement as follows:

import module namespace content="http://exist-db.org/xquery/contentextraction" at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

The module provides three functions:

content:get-metadata($binary as xs:base64Binary) as document-node() content:get-metadata-and-content($binary as xs:base64Binary) as document-node() content:stream-content($binary as xs:base64Binary, $paths as xs:string*, $callback as function, $namespaces as element()?, $userData as item()*) as empty()

The first two functions don't need much of an explanation: get-metadata just returns some metadata extracted from the resource, while get-metadata-and-content will also provide the text body of the resource - if there is any. The third function is a streaming variant of the other two and is used to process larger resources, whose content may not fit into memory.

All functions produce XHTML. The metadata will be contained in the HTML head, the contents go into the body. The structure of the body HTML varies a lot, depending on the media type you parse. For PDFs, the body is just a sequence of divs, one for each page. One can use this feature to extract page numbers, as I do in my example application (see below). However, in most cases the HTML structure will be mostly flat.

Indexing

While you could decide to just store the html returned by the content extraction functions as an XML resource into the database, this is not very efficient, in particular for larger documents. You would need to maintain both, the binary as well as the extracted html.

We have thus added a feature to the existing full text indexing module, which allows users to associate additional text indexes with a binary resource (or actually: any resource, binary or xml). The index will be linked to the resource, meaning that the same permissions apply and if the resource is deleted, the index will be removed as well.

To create an index, call the index function with the following arguments:

  • The path of the resource to which the index should be linked as a string.
  • An XML fragment describing the fields you want to add and the text content to index.

For example, to associate an index with the document test.txt one may call index as follows:

ft:index("/db/demo/test.txt", <doc> <field name="title" store="yes">Indexing</field> <field name="para" store="yes">This is the first paragraph.</field> <field name="para" store="yes">And a second paragraph.</field> </doc>)

This creates a lucene index document, indexes the content using the configured analyzers, and links it to the eXist document with the given path. You may link more than one lucene document to the same eXist resource.

The field elements map to lucene fields. You can use as many fields as you want or add multiple fields with the same name. The store="yes" attribute tells the indexer to also store the text string, so you can retrieve it later.

It ist also possible to configure the analyzers used by lucene for indexing a given feed as well as other options in the collection configuration.

To query the created index, use the search function:

ft:search("/db/demo/test.txt", "para:paragraph and title:indexing")

The first parameter is the path to the resource or collection to query, the second specifies a lucene query string. Note how we prefix the query term by the name of the field. Executing this query returns:

<results> <search uri="/db/demo/test.txt" score="6.3111067"> <field name="para">This is the first <exist:match>paragraph</exist:match>.</field> <field name="para">And a second <exist:match>paragraph</exist:match>.</field> <field name="title"><exist:match>Indexing</exist:match></field> </search> </results>

Each matching resource is described by a search element. The score attribute expresses the relevance lucene computed for the resource (the higher the better). Within the search element, every field which contributed to the query result is returned, but only if store="yes" was defined for this field at indexing time (if not, the field content won't be available). Note how the matches in the text are enclosed in match elements, just as if you did a full text query on an XML document. This makes it easy to post-process the query result, for example to create a keywords in context display using eXist's standard kwic module.

The document the index is linked to does not need to be a binary resource. One can also create additional indexes on xml documents. This is a useful feature, because it allows us to index and query information which is not directly contained in the XML itself. For example, one could add metadata fields and retrieve them later using get-field. Or we could use fields to pre-process and normalize information already present in the XML to speed up later access.

Combining content extraction and indexing

The following example extracts metadata and content from a PDF (I chose the TEI guidelines) and creates a field for each page. Please note that extracting content can take a while and is a memory intensive process. For larger PDFs, you want to use stream-content. We do not cover this here, but you may have a look at the sample application (see below).

xquery version "1.0"; declare namespace xhtml="http://www.w3.org/1999/xhtml"; import module namespace content="http://exist-db.org/xquery/contentextraction" at "java:org.exist.contentextraction.xquery.ContentExtractionModule"; let $path := "/db/Guidelines.pdf" let $binary := util:binary-doc($path) let $content := content:get-metadata-and-content($binary) let $indexDef := <doc> <field name="title" store="yes">{ $content//xhtml:title/text() }</field> { for $page in $content//xhtml:div[@class = "page"]/xhtml:p return <field name="page" store="yes">{ $page/text() }</field> } </doc> return ft:index($path, $indexDef)

We can now query the index, using summarize to get just the immediate context of the match in the text:

xquery version "1.0"; import module namespace kwic="http://exist-db.org/xquery/kwic" at "resource:org/exist/xquery/lib/kwic.xql"; for $result in ft:search("/db/Guidelines.pdf", 'page:"page layout"')/search for $field in $result/field return kwic:summarize($field, <config width="40"/>)

Demo app

To see queries on binary documents in action and study a complete code example, please head over to my sample application. This application is available as an installable package, so you can play with the code locally.

To install the package into your own eXist instance watch our screencast or follow the steps below:

  • Open the admin page in the web application and log in as admin
  • Select the "Package Repository" link from the sidebar
  • Switch to the "Public Repo" tab and click on "Retrieve packages"
  • You should see a list of packages available on the server
  • Click on the package "eXist-db Demo Apps (0.1)"
  • Click on the install icon
  • After installation finished, the package should show up in the "Installed" tab
  • Click the installed package. You should see a link "Local URL". Click it to get to the application

Demo of the new app repository

eXist's development version (to become v1.5) provides a number of new features to simplify the process of creating, deploying and distributing XQuery-based apps. An "app" in this context is a self-contained package, which can be downloaded from a public or private repository and installed into any instance of eXist-db with a few mouse clicks. The app may just package a bunch of XQuery library modules or (REST-style) interfaces, or it may contain an entire, complex web application.

There are many different paths to create an application with eXist, which is good. But this also makes it difficult for new users to find their way. The new app repository as well as eXide try to simplify the process for people to get started (just keep in mind that not every app will fit into this framework).

Upon request, I created a short screencast to demonstrate how simple it is to use the package repository to install entire applications into eXist. This is just a teaser and does not explain how to actually create app packages. I have a longer video in the pipeline which explains just that (eXide actually handles most of the setup work for you).

For the next release of eXist, we plan to ship all example code and parts of the documentation as apps, which can be installed on demand. This will lead to a cleaner installation and make it easier for people to find their way through the examples.

Latest Posts