Accessibility / tagged PDF
What is tagged PDF?
Tagged PDF is an extension of the PDF file with a logical structure of the content. The tags are used to determine the role of the object to be output, e.g. level 1 heading, a paragraph or an image. This is mainly used to create accessible PDFs, i.e. PDFs that are optimized for screen readers or similar output devices.
For accessible PDFs, not only the technical side must be taken into account (this is what the speedata Publisher is for), but also the logical structure must be created sensibly. For example, images must be output with an alternative text, otherwise their use is restricted.
Example
In the following example, the document is not further divided into sub structures. A heading (H1) and an image (Figure) are output in this section.
<Layout xmlns="urn:speedata.de:2009/publisher/en"
xmlns:sd="urn:speedata:2009/publisher/functions/en">
<PDFOptions format="PDF/UA" />
<StructureElement role="Document" id="doc" />
<Record element="data">
<PlaceObject>
<Textblock>
<Bookmark level="1" select="'My image collection'"/>
<Paragraph role="H1" parent="doc">
<Value>My image collection</Value>
</Paragraph>
</Textblock>
</PlaceObject>
<PlaceObject>
<Image parent="doc"
description="an ocean"
file="ocean.pdf"
width="4cm" />
</PlaceObject>
</Record>
</Layout>
Hierarchical structure
By default, the role "Document" is defined with the ID doc
. This corresponds to the following structure:
<StructureElement role="Document" id="doc" />
There are two ways to define further structures. Either these are defined in a nested manner (both examples create the same structure):
<StructureElement role="Document">
<StructureElement role="Art">
<StructureElement role="Sect" id="sect1a1"/>
<StructureElement role="Sect" id="sect2a1"/>
</StructureElement>
<StructureElement role="Art">
<StructureElement role="Sect" id="sect1a2"/>
<StructureElement role="Sect" id="sect2a2"/>
</StructureElement>
</StructureElement>
or you can use the parent attribute:
<StructureElement role="Document" id="doc" />
<StructureElement role="Art" id="art1" parent="doc" />
<StructureElement role="Art" id="art2" parent="doc" />
<StructureElement role="Sect" id="sect1a1" parent="art1"/>
<StructureElement role="Sect" id="sect2a1" parent="art1"/>
<StructureElement role="Sect" id="sect1a2" parent="art2"/>
<StructureElement role="Sect" id="sect2a2" parent="art2"/>
This allows the document to be structured dynamically based on the data.
The id
attribute is necessary to specify the parent structure for paragraphs and images. In the example above, both elements are output in the Sect
section.
Roles (tag names)
The following role names (tags) are defined in the speedata Publisher. Others can be added on request:
Art, Artifact, Div, Document, Figure, H1, H2, H3, H4, H5, H6, Lbl, Link, P, Part, Sect, Span, Table, TD, TH, TOC, TOCI, TR.
Order of the structure elements
The hierarchical structure is created in the order in which they are output in the Publisher. This means that objects that are output first appear higher in the structure. However, if you want to output an H1 heading after other objects, for example, it will appear further down in the document structure. To change the order, you can specify the position in the command (attribute structpos
). The specification top
(or equivalent 1
) inserts the paragraph at the first position in the structure, below the specified parent ID (section in the example below). Alternatively, a different position can be specified, but it must be smaller than the number of current child elements.
‘Unimportant’ texts and images
Texts and images that should not appear in the hierarchy can be marked as an artifact (role).
Alternative texts and image descriptions
Images should be provided with a description so that this appears in the hierarchy instead of the image. For paragraphs, it is also possible to insert the ‘real text’ (actual text). This is displayed in the hierarchy instead of the text within <Value>
.
From the examples repository
This layout can also be found in the examples repository and creates a PDF/UA-compliant PDF that fulfils the requirements for accessibility. Below the top level Document there is a section Sect, which contains a heading, a paragraph without special features and a paragraph with a hyperlink, as well as an image.
(Run the speedata Publisher with sp --dummy
, as variable data is not used here).
<Layout xmlns="urn:speedata.de:2009/publisher/en"
xmlns:sd="urn:speedata:2009/publisher/functions/en">
<PDFOptions format="PDF/UA" />
<StructureElement role="Document">
<StructureElement role="Sect" id="section" />
</StructureElement>
<Record element="data">
<PlaceObject>
<Textblock>
<Paragraph role="H1" parent="section">
<B>
<Value>A very short story</Value>
</B>
</Paragraph>
<Paragraph role="P" parent="section">
<Value>Once upon a time....</Value>
</Paragraph>
<Paragraph role="P" parent="section">
<Value>This is a </Value>
<A href="https://www.speedata.de"
description="link to speedata.de">
<Value>link to speedata.de</Value>
</A>
<Value>.</Value>
</Paragraph>
</Textblock>
</PlaceObject>
<PlaceObject>
<Image
width="8"
file="ocean.pdf"
parent="section"
description="An image of an ocean" />
</PlaceObject>
</Record>
</Layout>
The output from the layout above is as expected.
Various tools can be used to check the structure of the document:
You can use pdfuaanalyze to display the structure as an XML tree.
<Document>
<Sect>
<H1></H1>
<P></P>
<P>
<Link></Link>
</P>
<Figure></Figure>
</Sect>
</Document>
Checking the document
The following programs can be used to check accessibility:
- PAC (PDF accessibility checker)
- Adobe Acrobat
- Vera PDF
- pdfuaanalyze shows the structure of the document as XML.
Known limitations
The following limitations are known and will be fixed as soon as possible:
- Output/Text does not support accessibility
- SavePages/InsertPages creates an incorrect structure hierarchy
The current issues can be viewed on GitHub.