Accessibility / tagged PDF

What is tagged PDF?

Tagged PDF is an extension of the PDF file with a logical structure of the content. The tags are used to determine the role of the object to be output, e.g. level 1 heading, a paragraph or an image. This is mainly used to create accessible PDFs, i.e. PDFs that are optimized for screen readers or similar output devices.

For accessible PDFs, not only the technical side must be taken into account (this is what the speedata Publisher is for), but also the logical structure must be created sensibly. For example, images must be output with an alternative text, otherwise their use is restricted.

Example

In the following example, the document is not further divided into sub structures. A heading (H1) and an image (Figure) are output in this section.

<Layout xmlns="urn:speedata.de:2009/publisher/en"
	xmlns:sd="urn:speedata:2009/publisher/functions/en">

  <PDFOptions format="PDF/UA" />

  <StructureElement role="Document" id="doc" />

  <Record element="data">
    <PlaceObject>
      <Textblock>
        <Bookmark level="1" select="'My image collection'"/>
        <Paragraph role="H1" parent="doc">
          <Value>My image collection</Value>
        </Paragraph>
      </Textblock>
    </PlaceObject>

    <PlaceObject>
      <Image parent="doc"
             description="an ocean"
             file="ocean.pdf"
             width="4cm" />
    </PlaceObject>
  </Record>
</Layout>

Hierarchical structure

By default, the role "Document" is defined with the ID doc. This corresponds to the following structure:

<StructureElement role="Document" id="doc" />

There are two ways to define further structures. Either these are defined in a nested manner (both examples create the same structure):

<StructureElement role="Document">
	<StructureElement role="Art">
		<StructureElement role="Sect" id="sect1a1"/>
		<StructureElement role="Sect" id="sect2a1"/>
	</StructureElement>
	<StructureElement role="Art">
		<StructureElement role="Sect" id="sect1a2"/>
		<StructureElement role="Sect" id="sect2a2"/>
	</StructureElement>
</StructureElement>

or you can use the parent attribute:

<StructureElement role="Document" id="doc" />
<StructureElement role="Art" id="art1" parent="doc" />
<StructureElement role="Art" id="art2" parent="doc" />
<StructureElement role="Sect" id="sect1a1" parent="art1"/>
<StructureElement role="Sect" id="sect2a1" parent="art1"/>
<StructureElement role="Sect" id="sect1a2" parent="art2"/>
<StructureElement role="Sect" id="sect2a2" parent="art2"/>

This allows the document to be structured dynamically based on the data.

The id attribute is necessary to specify the parent structure for paragraphs and images. In the example above, both elements are output in the Sect section.

Roles (tag names)

The following role names (tags) are defined in the speedata Publisher. Others can be added on request:

Art, Artifact, Div, Document, Figure, H1, H2, H3, H4, H5, H6, Lbl, Link, P, Part, Sect, Span, Table, TD, TH, TOC, TOCI, TR.

Order of the structure elements

The hierarchical structure is created in the order in which they are output in the Publisher. This means that objects that are output first appear higher in the structure. However, if you want to output an H1 heading after other objects, for example, it will appear further down in the document structure. To change the order, you can specify the position in the command (attribute structpos). The specification top (or equivalent 1) inserts the paragraph at the first position in the structure, below the specified parent ID (section in the example below). Alternatively, a different position can be specified, but it must be smaller than the number of current child elements.

‘Unimportant’ texts and images

Texts and images that should not appear in the hierarchy can be marked as an artifact (role).

Alternative texts and image descriptions

Images should be provided with a description so that this appears in the hierarchy instead of the image. For paragraphs, it is also possible to insert the ‘real text’ (actual text). This is displayed in the hierarchy instead of the text within <Value>.

From the examples repository

This layout can also be found in the examples repository and creates a PDF/UA-compliant PDF that fulfils the requirements for accessibility. Below the top level Document there is a section Sect, which contains a heading, a paragraph without special features and a paragraph with a hyperlink, as well as an image.

(Run the speedata Publisher with sp --dummy, as variable data is not used here).

<Layout xmlns="urn:speedata.de:2009/publisher/en"
	xmlns:sd="urn:speedata:2009/publisher/functions/en">
  <PDFOptions format="PDF/UA" />

  <StructureElement role="Document">
    <StructureElement role="Sect" id="section" />
  </StructureElement>

  <Record element="data">
    <PlaceObject>
      <Textblock>
        <Paragraph role="H1" parent="section">
          <B>
            <Value>A very short story</Value>
          </B>
        </Paragraph>
        <Paragraph role="P" parent="section">
          <Value>Once upon a time....</Value>
        </Paragraph>
        <Paragraph role="P" parent="section">
          <Value>This is a </Value>
          <A href="https://www.speedata.de"
             description="link to speedata.de">
            <Value>link to speedata.de</Value>
          </A>
          <Value>.</Value>
        </Paragraph>
      </Textblock>
    </PlaceObject>
    <PlaceObject>
      <Image
          width="8"
          file="ocean.pdf"
          parent="section"
          description="An image of an ocean" />
    </PlaceObject>
  </Record>
</Layout>

The output from the layout above is as expected.

ay11output

Various tools can be used to check the structure of the document:

ay11structure
The accessibility checker outputs exactly the specified structure. The b-tag in the heading is not displayed in the structure.
ay11acrobat
In addition to a detailed review, Adobe Acrobat also provides a visual view of the structure.

You can use pdfuaanalyze to display the structure as an XML tree.

<Document>
  <Sect>
    <H1></H1>
    <P></P>
    <P>
      <Link></Link>
    </P>
    <Figure></Figure>
  </Sect>
</Document>

Checking the document

The following programs can be used to check accessibility:

Known limitations

The following limitations are known and will be fixed as soon as possible:

  • Output/Text does not support accessibility
  • SavePages/InsertPages creates an incorrect structure hierarchy

The current issues can be viewed on GitHub.