To PDF, or not to PDF, that is the question.
On first consideration it is an easy answer. Content should be published in an open and accessible form and that means publishing content as HTML and avoiding proprietary, potentially non accessible formats such as PDF. The issue however is more nuanced as there are circumstances where a PDF may still be appropriate and possible, given the circumstances.
This article takes a look at the nuances, by reviewing the accessibility context and then offers some recommended approaches to authoring content.
Do Australian Government sites need to be accessible?
Yes. The Disability Discrimination Act requires that equaly access is required by law for people with a disability where it can reasonably provided. Further, Australia is a signatory to the Convention on the Rights of Persons with Disabilities which assets rights for people with disability. Signatories need to take all appropriate measures to ensure that persons with disabilities can exercise the right to freedom of expression and opinion, including the freedom to seek, receive and impart information and ideas on an equal basis with others and through all forms of communication of their choice”
What does the Australian Government Style Manual say about PDFs?
The Style Manual is very firm, HTML should be used rather than PDFs. The following problems are noted:
PDFs often create problems for users, including:
- Scalability – PDFs don't reflow to fit the user's screen or browser windows.
- Speed – PDF files are much larger than optimised HTML pages. They can be harder to use by people with slow internet access.
- Navigation – you can't link from a page to a section of a PDF. PDFs can also cause disorientation when they open in a new tab or different tool.
- Search engine optimisation – very often PDFs aren't tagged. This makes it hard for search engines to find the content.
- Maintenance – people download and share PDFs, making it hard for you to ensure that people are using up-to-date content.
Are PDFs accessible?
It is worth noting that PDFs are not necessarily inaccessible. It is possible to produce accessible PDFs if a structured approach is taken and tools are used to add accessibility features. See for example documentation from the Australian Government Style Manual and Adobe.
Whilst PDFs can be accessible, the vast majority of them will not be. This is generally because the generation of a PDF is the end result of a publishing pipeline which doesn’t support the accessibility options. For example, the Print PDF function of browsers will not produce an accessible PDF and likewise most automated PDF generation tools will not produce accessible PDFs.
Practically, editors generally will not be working on the PDF to make it accessible post production. Thought should therefore be given as to whether the editing of PDFs for accessibility is feasible.
Common scenarios where PDFs are being used
PDF usage appears to be prominent in a few common scenarios.
DTP documents
Desktop publishing solutions can be used for reports and manuals when the documents are:
- Large
- Complex
- Impactful.
These are the most difficult documents to manage as PDF as their characteristics make them difficult for conversion. They are strong candidates for remaining as PDFs, however, consideration should be given as to how they could be converted to HTML.
Binary documents
Many organisations arrange their content in binary files in a Document Management System, located on an internal server or in the cloud. These documents are closely revisioned and highly secure. They are often simply exported out to the CMS and used in their binary format (PDF, Word, Excel). In this paradigm there is no concept of creating accessible formats as the binary file is seen as the canonical form. This is probably the most serious scenario for organisations looking to move from PDF or other binary formats.
Offline documents
In some cases an HTML format is available, however, there is the perceived need for the user to download an “offline” version. Users may have poor internet access or would prefer to rely on a “static” version on their own file system for review, annotation or bookmarking. This can happen with large manuals which have traditionally been printed and it is common practice for users to rely on that manual.
Publishing pipeline
The publishing pipeline is the core consideration for how content is represented and presented to the end user. The decision to publish as PDF or HTML will often be predetermined by the publishing pipeline that is already in place. ie. the technology and processes will drive the available end format.
Canonical content format
What is the source of the document? Common formats would include HTML, text and Word. More structured by less common formats may be Docbook or some other XML variant. More presentational formats would include DTP file types. The type of file will naturally determine how it may be published. In some cases (HTML, text, Word, Docbook) the content could be transformed to HTML with varying degrees of success. In others (Word, DTP), the natural output format will be a PDF. In the latter case there is no viable option to publishing PDF.
Content storage
The source content will tend to be stored in one of two places: The Content Management (CMS) or an internal Document Management System (DMS). In the case of the CMS, web editors will generally be in control of content editing and flow. For DMS, it is more likely that domain experts will be in control of the documents. The DMS solution is much more likely to lead to the republishing of binary formats.
The pain points
Pain points tend to arise in the following areas.
CMS web team become the transformers
Oftentimes the web team is the sole gatekeeper to content on the web. Content is authored by domain experts, reviewed by decision-makers and then passed to the content team to add to the site on a specific date. Publishing may actually mean manually converting from the binary format (Word) to HTML. This is painful for the content team and leads to effort, cost and potential mistakes. Content moderation then becomes mostly a matter of reviewing the conversion, rather than the substance.
In these cases, the publishing of the PDF or Word can appear to be more cost-effective. There is no need to expend the effort in the conversion and review.
CMS domain experts do not feel comfortable with the tools
An alternative approach is to have the domain experts author the content in the CMS. This approach is much rarer in government mainly due to the expense of training domain experts. There is also the risk of inconsistency in outcomes where best practices are not followed.
Once again, this situation will lead to PDF being a favoured option from an expense perspective.
The CMS is a poor reflection of the DMS
The final pain point is where content is not managed in the CMS at all and primacy is handed over to the DMS. The binary file is king. Users will download PDFs as the sole means of accessing the content. In this scenario, the CMS just provides a thin wrapper around the file as Media or perhaps a Resource content type.
This solution is cost-effective but antagonistic to users and fails the required accessibility standards.
Take a painkiller: Define the flow
There are two practical ways to resolve the pain in a clean way. And finally, there is a workaround when needed.
Make the CMS the canonical source
The most direct and effective solution is to make the CMS the canonical source of content. Workflow is used for revisions and proofing. No PDF option is provided. This approach can work for sites with an effective editorial team or when there is a tight set of domain experts who understand the tools.
Invest in a semantic source format with transformations
For more serious content publishing needs, the organisation should consider a content publishing pipeline that allows for the maintenance of a single source document and then toolchains that can use the source document (DTP, Web, Marketing). A simple approach would be to use Word with a tight semantic style guide. More advanced tools may use XML or some other format to define the required structure needed by the organisation.
Resort to PDF when needed
There will be times when a transformation to HTML is not feasible, or “reasonable”. In these cases, using PDF is an option.
Take a vitamin: Better HTML
If the decision has been made to drop the use of PDFs then there are several ways to keep things simple inside the CMS.
Semantic HTML
Keeping things simple is the best approach. The content needs to be reviewed for structural aspects so that they can be mapped across to the standard HTML elements: headings, paragraphs, lists, links, tables and embedding of media. This part is easy and a no-brainer for most content needs.
Difficulties can arise with more demanding content such as:
- large tables
- fancy tables
- footnotes and references
- components (with multiple elements and classes).
Paragraphs or WYSIWYG?
In Drupal, the Paragraphs module allows for the embedding of complex components with their own data, presentation and functionality. They are generally used for things such as lists of cards, complex layouts, galleries, sliders and the like.
These components can help tell a story, however, the use of Paragraphs does require a mental shift for editors. Content becomes a series of components, rather than a flow of content and markup. This change in perspective can be challenging for editors, especially for those with a “document” background where the written word is the prime focus, as opposed to the presentation.
If document flow is a priority, as it will be for most cases, a WYSIWYG-based solution is preferred. Drupal supports the concept of WYSIWYG templates which define an HTML structure with slots for the data that goes into the correct region in the template. Using templates in this way allows the editor to stay in the flow more and also ensures that all content remains in the body field.
The Paragraph module is still a viable alternative, especially for presentation-rich landing pages. However, for content-focused content, staying in the WYSIWYG is the preferred option.
Navigation
In Drupal, The Book module is a much-loved solution for complex documents. It allows editors to produce deeply nested “book pages” for a single book. Typical cases include annual reports, manuals and other forms of demanding content. This kind of content also is typically produced as PDF because of its length and complexity.
In these cases, navigation becomes a problem. The Book module supplies its own hierarchy and this naturally means that all of the content is not available in one place. It's possible to concatenate Book pages into one page, however, this solution can be difficult to execute well.
If content can be compressed into a single, long page, with helpful in-page navigation links, the challenge can be overcome. For example, Javascript libraries such as Anchorific can offer a solution. More advanced solutions would include moving the navigation to a side panel with its own scrollbar.
Print styles
HTML content needs to be presented well for users who wish to print it. The Print dialogue in the browser will allow users to print or save as a PDF. This is by far the most practical way for users to capture the presentation of HTML content. Site themers should therefore pay close attention to the way the print stylesheet has been implemented.
Best practice approaches include:
- Removing extraneous elements such as header, footer and sidebar
- Adjusting font sizes to be suitable for print.
- Ensuring hidden elements (accordions, captions) are shown.
- Ensuring all elements have correct colour contrast - especially content that may have been light or dark.
This is something to be aware of during QA as it is an often neglected part of site testing.
Conclusion
This article has made a strong case for presenting content as HTML and has provided several recommendations as to how to achieve a positive outcome for all forms of content, including content that is long, complex and presentation-rich. Taking an HTML approach will ensure that content remains as accessible as possible to the majority of users.
It is not always going to be possible to easily produce the HTML content. In situations where the internal publishing pipeline is predominantly based around binary files, there may be challenges to easily delivering HTML content to the site. In cases where it is possible to transform documents from one format to another, there is a good chance that a publishing pipeline will ease the pain for content editors. For example, a Word document using only semantic makeup could be used as a source for the web.
Ultimately the decision to use a PDF as the prime format should not be taken lightly. There are several considerations to take into account including the publishing pipeline, editor training and editor resourcing. Each of these will have consequences for costs and organisational functions.
To PDF or, not to PDF? Hopefully, the answer is a little clearer.