XPath and Layout Functions (new XPath module)

This page describes the new (default) XPath parser, called “lxpath”. There is also the old XPath module called “luxor”. To switch between these two, you can set the default by setting the xpath configuration, for example on the command line with

sp --xpath luxor

(the old one)

or

sp --xpath lxpath

for the new (default) one. You can also set this in the configuration file.

What is XPath and why are there two different implementations?

The speedata Publisher’s input is data encoded in the XML format. XML is a hierarchical data structure where there is a root element and each element can have attributes and children, that are either text or other elements. For example you can have an element articlegroup which contains ten article elements. Now the idea of XPath is to navigate through the XML tree and ask questions like: give me all articles that have a certain attribute. Or how many articles do I have in this article group?

Up to now the speedata Publisher uses an ad-hoc implementation of an XML parser that has proven to work, but which is not very robust and does not report many errors with incorrect data. Also the XPath parser has worked mostly with regular expressions and global state, which is also not very robust.

The new implementation uses the XML parser from Go’s standard library and an XPath implementation that is rewritten with the XPath grammar in mind. It aims to be fully XPath 2 compliant.

There are some benefits when using the new XPath parser:

  • The overall implementation is faster.
  • There are better error messages (for example line numbers are part of error messages).
  • You can have your own function definitions.
  • The new XPath parser will be the default in speedata Publisher version 4.18 and above.

XPath expressions

The speedata Publisher accepts XPath expressions in some attributes. These attributes are called test or select or documented as such. In all other attributes XPath expressions can be used via curly braces ({ and }). In the following example XPath expressions are used in the attribute width and in the attribute select of the element Value. The width of the textblock is taken from the variable $width, the contents of the paragraph is the value of the current node (whatever this is).

<Textblock width="{ $width }">
  <Paragraph>
    <Value select="."/>
  </Paragraph>
</Textblock>

The following XPath expressions are handled by the software:

  • Number: Return the value without change: "5"
  • Text: Return the text without change: 'hello world'
  • Arithmetic operation (*, div, idiv, +, -, mod). Example: ( 6 + 4.5 ) * 2
  • Variables. Example: $column + 2
  • Access to the current node (dot operator). Example: . + 2
  • Access to subelements. Examples: productdata, *, foo/bar, node(),
  • Parent nodes: ../
  • Filter in square brackets, for example article[1] selects the first article.
  • Access to subelements. Examples: productdata, *, foo/bar
  • Attribute access in the current node. Example @a
  • Attribute access in subnodes, for example foo/@bar
  • Boolean expressions: <, >, <=, >=, =, !=. Attention, the less than symbol < must be written in XML as &lt;, the symbol > may be written as &gt;. Example: $number > 6. Can be used in tests.
  • if/then/else expressions: if (…​) then …​ else …​.
  • for clauses for example: for $i in (1,2,3) return $i * 2 or for $i in 1 to 3 return $i * 2.
  • The well known axis expressions like following-sibling, parent, preceding-sibling.

See the test file of lxpath if in doubt.

If you are still uncertain about XPath, please follow the good tutorial at W3Schools. See the test file of lxpath if in doubt

The following XPath functions are known to the system:

There are three classes of XPath functions: standard XPath functions, the speedata Publisher specific ones and user defined functions. The layout specific functions are in the namespace urn:speedata:2009/publisher/functions/en (denoted by sd: below). The standard functions should behave like documented by the XPath 2.0 standard.

sd layout functions

sd:allocated(x,y,<areaname>,<framenumber>)

Return true if the grid cell is allocated, false otherwise (since 2.3.71).

sd:alternating(<type>, <text>,<text>,.. )

On each call the next element will be returned. You can define more alternating sequences by using distinct type values. Example: sd:alternating('tbl', 'White','Gray') can be used for alternating color of table rules. To reset the state, use sd:reset-alternating(<type>).

sd:aspectratio(<imagename>,[pagenumber],[pdfbox])

Return the result of the division width by height of the given image. (< 1 for portrait images, > 1 for landscape). The arguments can contain a page number and a PDF box, see Specifying the page for layout functions for details.

sd:attr(<name>, …​)

Is the same as @name, but can be used to dynamically construct the attribute name. See example at sd:variable().

sd:count-saved-pages(<name>)

Return the number of saved pages from <SavePages>.

sd:current-column(<name>)

Return the current column. If name is given, return the column of the given frame.

sd:current-framenumber(<name>)

Return the current frame number of given positioning area.

sd:current-page()

Return the current page number.

sd:current-row(<name>)

Return the current row. If name is given, return the row of the given frame.

sd:decode-base64(<contents>)

Expect a string encoded in base64 and return the binary contents.

sd:decode-html(<node>)

Change text such as &lt;i&gt;italic&lt;/i&gt; into HTML markup (<i>italic</i> in this case).

sd:dimexpr(<Unit>,<Expression>)

Interprets the expression as a calculation and return the value as a scalar in the unit. Interprets variables. For example, say that $twocm is set to the string 2cm, sd:dimexpr('cm',' (40mm + $twocm) / 2 ') returns the number 3.0.

sd:dummytext([<count>])

Return a “Lorem ispum…​ ” dummy text. The count defaults to 1.

sd:even(<number>)

True if number is even. Example: sd:even(sd:current-page()).

sd:file-exists(<filename or URI schema>)

True if file exists in the current search path. Otherwise it returns false.

sd:filecontents(<binarycontent>)

Save the given contents into a file and return the file name.

sd:firstmark(pagenumber)

The first marker of the given page number. Useful for headings in dictionaries where the first and the last entry of a page is given.

sd:first-free-row(<name>)

Return the first free row of this area (experimental).

sd:format-number(Number or string, thousands separator, comma separator)

Format the number and insert thousands separators and change comma separator. Example: sd:format-number(12345.67, ',','.') returns the string 12,345.67.

sd:format-string(object, object, …​ ,formatting instructions)

Return a text string with the objects formatted as given by the formatting instructions. These instructions are the same as the instructions by the C function printf().

sd:group-height(<string>[, <unit>])

Return the given group’s height (in gridcells). See sd:group-width(…​) If provided with an optional second argument, it returns the height of the group in multiples of this unit. For example sd:group-height('mygroup', 'in') returns the group height in inches.

sd:group-width(<string>[, <unit>])

Return the number of gridcells of the given group’s width. The argument must be the name of an existing group. Example: sd:group-width('My group'). See sd:group-height() for description of the second parameter.

sd:imageheight(<filename or URI schema>,[pagenumber],[pdfbox],[<unit>])

Natural height of the image in grid cells. Attention: if the image is not found, the height of the file-not-found placeholder will be returned. Therefore you need to check in advance if the image exists. If provided with an optional second argument, it returns the height of the image in multiples of this unit. For example sd:imageheight('myimage.pdf', 'in') returns the height of 'myimage.pdf' in inches. The arguments can contain a page number and a PDF box, see Specifying the page for layout functions for details.

sd:imagewidth(<filename or URI schema>,[pagenumber],[pdfbox],[<unit>])

Natural width of the image in grid cells. Attention: if the image is not found, the width of the file-not-found placeholder will be returned. Therefore you need to check in advance if the image exists. If provided with an optional second argument, it returns the width of the image in multiples of this unit. For example sd:imagewidth('myimage.pdf', 'in') returns the width of myimage.pdf in inches. The arguments can contain a page number and a PDF box, see Specifying the page for layout functions for details.

sd:keep-alternating(<type>)

Use the current value of sd:alternating(<type>) without changing the value.

sd:lastmark(pagenumber)

The first marker of the given page number. Useful for headings in dictionaries where the first and the last entry of a page is given.

sd:loremipsum()

Same as sd:dummytext().

`sd:markdown(<text>)

Renders the text as markdown. See Markdown.

sd:md5(<value>,<value>, …)

Return the MD5 sum of the concatenation of each value as a hex string. Example: sd:md5('hello ', 'world') gives the string 5eb63bbbe01eeed093cb22bb8f5acdc3.

sd:merge-pagenumbers(<pagenumbers>,<separator for range>,<separator for space>, [hyperlinks])

Merge page numbers. For example the numbers "1, 3, 4, 5" are merged into 1, 3–5. Defaults for the separator for the range is an en-dash (–), default for the spacing separator is ', ' (comma, space). This function sorts the page numbers and removes duplicates. When the separator for range is empty, the page numbers are separated each with the separator for the space. If hyperlinks is set to true(), the page numbers become active. The default is false(). The function will show the user visible page numbers, which correspond to the logical page numbers by default.

sd:mode(<string>[,<string>…​])

Returns true (true()) if one of the specified modes is set. A mode can be set from the command line or from the configuration file. See Control of the layout when calling the Publisher

sd:number-of-columns()

Number of columns in the current grid.

sd:number-of-pages(<filename or URI schema>)

Determines the number of pages of a (PDF-)file.

sd:number-of-rows()

Number of rows in the current grid.

sd:odd(<number>)

True if number is odd.

sd:pagenumber(<string>)

Get the number of the page where the given mark is placed on. See the command <Mark>.

sd:pageheight(<unit>)

Similar to sd:pagewidth(), just for the height.

sd:pagewidth(<unit>)

Get the width of the page in number of units (but without the unit). For example a page with width 210mm sd:pagewidth("mm") returns 210. This function initializes a page. (Since version 4.13.8.)

sd:romannumeral(<number>)

Convert the number into a lowercase Roman numeral.

sd:randomitem(<Value>,<Value>, …)

Return one of the values.

sd:reset-alternating(<type>)

Reset alternating so the next sd:alternating() starts again from the first element.

sd:sha1(<value>,<value>, …)

Return the SHA-1 sum of the concatenation of each value as a hex string. Example: sd:sha1('hello ', 'world') gives the string 2aae6c35c94fcfb415dbe95f408b9ce91ee846ed.

sd:sha256(<value>,<value>, …)

Return the SHA-256 sum of the concatenation of each value as a hex string. Example: sd:sha256('hello ', 'world') gives the string b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9.

sd:sha512(<value>,<value>, …)

Return the SHA-512 sum of the concatenation of each value as a hex string. Example: sd:sha512('hello ', 'world') gives the string 309ecc489c12d6eb4cc40f50c902f2b4d0ed77ee511a7c7a9bcd3ca86d4cd86f989dd35bc5ff499670da34255b45b0cfd830e81f605dcf7dc5542e93ae9cd76f.

sd:tounit(<string>,<string>[,<number>])

Return a scalar of the unit given in the second argument converted to the unit given in the first argument rounded to the digits in the third argument (defaults to 0 - return integer values). Example: sd:tounit('pt','1pc') returns 12, because there are 12pt in 1pc (pica point).

sd:variable(<name>, …​)

The same as $name. This function allows variable names to be constructed dynamically. Example: sd:variable('myvar',$num) – if $num contains the number 3, the resulting variable name is myvar3.

sd:variable-exists(<name>)

True if variable name exists. Example: sd:variable-exists('my_bar') checks whether $my_bar is defined (variable names in this function have to be enclosed in single quotation marks if double quotation marks are used to delimit the XPath attribute).

sd:visible-pagenumber(<number>)

Return the user visible page number (as defined by matters) for the given real page number.

XPath functions

abs(<number>)

Return the positive value of the number.

boolean(<value>)

Return the effective boolean value of the argument.

codepoints-to-string( <codepoints> )

Convert the sequence of code points to a string.

ceiling()

Round to the higher integer. ceiling(-1.34) returns 1, ceiling(1.34) returns 2.

concat( <value>,<value>, … )

Create a new text value by concatenating the arguments.

contains(<haystack>,<needle>)

True if haystack contains needle. contains('bana','na') returns true().

count(<text>)

Counts all child elements with the given name. Example: count(article) counts how many child elements with the name article exist.

empty( <sequence> )

Checks, if the sequence is empty. For example non existing child elements or non existing attributes are “empty”.

false()

Return false.

floor()

Returns the largest number with no fractional part that is not greater than the value of the argument.

last()

Return the number of elements of the same named sibling elements. Not yet XPath conform.

local-name()

Return the local name (without namespace) of the current element.

lower-case(<text>)

Return the text in lowercase letters.

matches(<text>,<regexp>[,<flags>])

Return true if the regexp matches the text. Flags can be one of sim and are described in the spec: https://www.w3.org/TR/xpath-functions-31/#flags. Example: matches("banana", "^(.a)+$") returns true.

max()

Return the maximum value. max(1.1,2.2,3.3,4.4) returns 4.4.

min()

Return the minimum value. min(1.1,2.2,3.3,4.4) returns 1.1.

number(<value>)

Convert the argument to a number. Return “not a number” if the value cannot be converted.

not()

Negates the value of the argument. Example: not(true()) returns false().

normalize-space(<text>)

Return the text without leading and trailing spaces. All newlines will be changed to spaces. Multiple spaces/newlines will be changed to a single space.

position()

Return the position of the current node.

replace(<input>,<regexp>, <replacement>)

Replace the input using the regular expression with the given replacement text. Example: replace('banana', 'a', 'o') yields bonono.

root(element)

Return the root element of the element.

round(<number>,<number>)

Return the argument in the first parameter rounded to number of decimal places in the second parameter. The second parameter defaults to 0.

ends-with ( <string>, <string>)

Return true if the first string ends with the second string. Example: ends-with ( "tattoo", "too") returns true.

starts-with ( <string>, <string>)

Return true if the first string starts with the second string. Example: starts-with ( "tattoo", "tat") returns true.

string(<sequence>)

Return the text value of the sequence e.g. the contents of the elements.

string-join(<sequence>,separator)

Return the string value of the sequence, where each element is separated by the separator.

string-length(<string>)

Return the length of the string in characters. Multi-byte UTF-8 sequences are counted as 1.

substring(<input>,<start>,<length>)

Return the part of the string input that starts at start and optionally has the given length. start can be (in contrast to the XPath specification) negative which counts from the end of the input.

substring-after(<string>,<string>])

Return the contents of the first string, that comes after the second string. Example: substring-after ( "tattoo", "tat") returns "too".

substring-before(<string>,<string>])

Return the contents of the first string, that comes before the second string. Example: substring-before ( "tattoo", "attoo") returns "t".

tokenize(<input>,<regexp>)

This function returns a sequence of strings. The input text is read from left to right. When the regular expression matches the current position, the text read so far from the last match is returned. Example (from the great XPath / XSLT book by M. Key): tokenize("Go home, Jack!", "\W+") returns the sequence "Go", "home", "Jack", "".

true()

Return true.

unparsed-text(<filename>)

Returns the contents of the file without interpretation.

upper-case()

Converts the text to capital letters: upper-case('text') results in TEXT.