HTML to PDF with IText 2

The collection of open-source libraries written in Java that can convert HTML templates into HTML documents is quite scarce. It shrinks further, especially after filtering out dependencies on restrictive (A)GPL licenses. One of the most popular libraries in this context is IText. In the dated version 4.2.2, the license change from MPL to GPL gave rise to many alternatives, like OpenPDF under the less restrictive (when used as a library) LGPL license.

Despite its old age, you will still find the use of the older version of IText 2.1.7, for example, in the popular JasperReports library. In legacy projects, you will often have it in your dependencies list. Below you will see how to generate simple PDF documents from HTML form and common problems of the dated version of IText.

IText 2.1.7 and HTML to PDF conversion

Two alternative ways to generate a PDF document from a template is either a com.lowagie.text.html.simpleparser.HTMLWorker (HTML4) or com.lowagie.text.html.HtmlParser (XHTML). The base model for both parsers is the IText Document. With its creation, you can define a document size and standard margins.

To generate a PDF file, you need to provide a document and the output stream via the getInstance(Document, OutputStream) factory of the PdfWriter class. Then the HTML template can be passed through one of the parsers, which finishes writing when you close the document.

package dev.termian.itextdemo;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.PageSize;
import com.lowagie.text.html.HtmlParser;
import com.lowagie.text.html.simpleparser.HTMLWorker;
import com.lowagie.text.pdf.PdfWriter;

import java.awt.Desktop;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.Reader;
import java.io.StringReader;
import java.util.Objects;

public class Main {
    public static void main(String[] args) throws IOException, DocumentException {
        Document pdfDocument = new Document(PageSize.A4);
        File outputPdf = File.createTempFile("output-%s".formatted(String.valueOf(System.currentTimeMillis())), ".pdf");
        try (InputStream inputHtmlStream = Objects.requireNonNull(Thread.currentThread().getContextClassLoader().getResourceAsStream("input.html"));
             Reader inputHtmlReader = new InputStreamReader(inputHtmlStream);
             OutputStream outputPdfStream = new FileOutputStream(outputPdf)) {
            PdfWriter.getInstance(pdfDocument, outputPdfStream);
            pdfDocument.open();
            HTMLWorker htmlWorker = new HTMLWorker(pdfDocument);
            htmlWorker.parse(inputHtmlReader);
            pdfDocument.close();
        }
        Desktop.getDesktop().open(outputPdf);
    }
}

In the example above, I convert the input.html file from the classpath path (e.g., added as a resource from src/main/resources/input.html). Alternatively, you can use java.io.StringReader to pass the HTML template as a String. You can make it completely in-memory by replacing the output file stream with java.io.ByteArrayOutputStream.

Instead of HTML4 conversion via HTMLWorker, you can invoke XHTML conversion through the HtmlParser.parse(pdfDocument, inputHtmlReader) method. Going this way, you will have more styling options, which I will describe in a moment.

CVE-2017-9096

The IText 2.1.7 version is affected by the CVE-2017-9096 vulnerability that belongs to the OWASP security misconfiguration category. With standard settings, the library is vulnerable to an XXE (XML External Entity) injection attack. To illustrate the attack, let's assume that the user can provide any XHTML template, e.g.:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE foo [ <!ENTITY xxe SYSTEM "file:///etc/passwd" >]>
<html>
<body>
&xxe;
</body>
</html>

The DTD evaluation will cause the resulting document to contain the contents of the /etc/passwd Unix system file:

Presentation of the XXE attack using IText 2 – a document containing data from a system file

To protect against this attack, let's check what interfaces are used to parse the XML. The answer can be found in the HtmlParser class that inherits from an XmlParser. It's the SAXParser. Using OWASP cheat sheets that suggest how to create a secure parser, verify the guidelines in conjunction with the implementation used in your project. You will find it out by calling SAXParserFactory.newInstance().getClass().getName().

Now, instead of using the HtmlParser, we adapt the solution to getSecureSAXParser().parse(inputHtmlStream, new SAXmyHtmlHandler(pdfDocument)). As a result, for this attack, we expect a parsing error, e.g.:

Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 10; DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.

Whitespaces in the table and the "java.lang.ClassCastException: com.lowagie.text.Table"

Another problem with the old version of IText is its sensitivity to whitespaces in XHTML templates. The error manifests when generating tables containing these characters, as reported for OpenPDF. If you're not going to stick to proper formatting, take a look at the fix to the SAXiTextHandler class. It is the base for the previously used SAXmyHtmlHandler.

A workaround solution that doesn't require you to modify the library code is to ignore empty elements. Add this logic by overriding the handleStartingTags and handleEndingTags methods of the handler:

public class Main {
    public static void main(String[] args) throws Exception {
        //...
        getSecureSAXParser().parse(inputHtmlStream, new SAXmyHtmlHandler(pdfDocument) {
            @Override
            public void handleStartingTags(String name, Properties attributes) {
                if (currentChunk != null && currentChunk.getContent() != null && currentChunk.getContent().trim().isEmpty()) {
                    currentChunk = null;
                }
                super.handleStartingTags(name, attributes);
            }

            @Override
            public void handleEndingTags(String name) {
                if (currentChunk != null && currentChunk.getContent() != null && currentChunk.getContent().trim().isEmpty()) {
                    currentChunk = null;
                }
                super.handleEndingTags(name);
            }
        });
        //...
    }
}

BouncyCastle dependency update

BouncyCastle is a library that extends cryptographic features available in Java. IText uses it to create secured documents. Unfortunately, older versions of BouncyCastle have several security vulnerabilities detected.

To protect against potential attacks, I recommend getting a patched library from Jaspersoft (e.g., 2.1.7.js10 or newer). This version is used, among others, in the JasperReports report generation library. However, you will not find it in the central repository. Some maven configuration will be necessary:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <!--...-->
    <repositories>
        <repository>
            <id>jaspersoft</id>
            <url>https://jaspersoft.jfrog.io/jaspersoft/third-party-ce-artifacts/</url>
            <releases/>
        </repository>
        <!--...-->
    </repositories>
    <dependencies>
        <dependency>
            <groupId>com.lowagie</groupId>
            <artifactId>itext</artifactId>
            <version>2.1.7.js10</version>
        </dependency>
        <!--...-->
    </dependencies>
    <!--...-->
</project>

In the case of the standard distribution, the update path is quite complicated. It consists of excluding transitive dependencies (groupId has been changed), adding current BouncyCastle libraries, and adapting the code of a few IText classes.

Alternatively, consider whether you really need the functionality of PDF encryption (and reading). Maybe the solution is to remove dependencies or add some ArchUnit test that verifies you'll not use insecure classes.


<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <!--...-->
    <dependencies>
        <dependencies>
            <dependency>
                <groupId>com.lowagie</groupId>
                <artifactId>itext</artifactId>
                <version>2.1.7</version>
                <exclusions>
                    <exclusion>
                        <groupId>*</groupId>
                        <artifactId>*</artifactId>
                    </exclusion>
                </exclusions>
            </dependency>
        </dependencies>
        <!--...-->
    </dependencies>
    <!--...-->
</project>

Elements and styles

The main disadvantage of both HTML parsers is the lack of support for the <style></style> tag. It gets instead rendered as-is in the document. In fact, this version has only limited support for built-in styles. Such functionality is available in newer versions, notably under the GPL license. Out of other libraries, the Flying Saucer (LGPL) based on OpenPDF implements a CSS 2.1 styling.

As for elements, both mentioned parsers implement similar subsets of HTML 4 tags. In this context, let's take a look at the supported attributes. They are the means of styling the elements in this version.

HTMLWorker

HTMLWorker supports the following elements (tags): ol, ul, li, a, pre, font, span, br, p, div, body, table, td, th, tr, i, b, u, sub, sup, em, strong, s, strike, h1, h2, h3, h4, h5, h6, img, hr.

Implements attributes:

align, size, before, after, encoding, face — text elements;
width, height, src, image_path — img;

The image_path attribute allows you to load an image from a system file (empty src="" attribute is required).

indent — ul, or;
width — table, hr;
align, valign, border, cellpadding, bgcolor, colspan, extraparaspace — tr, td.

Styles: font-family, font-size, font-style, font-weight, text-decoration, color, line-height, text-align, padding-left.

The face attribute and the font-family style allow you to use one of the 14 built-in fonts (Courier, Helvetica, Times, Symbol, ZapfDingbats) or indicate a system true type font, e.g., style='font-family: "/System/Library/Fonts/Supplemental/Times New Roman.ttf"'. To list installed fonts on a Unix system, try using the fc-list command. Standard encoding is Cp1252. Unfortunately, font sizes are quite limited for the HTML4 parser.

You can also handle your own attributes with some additional work. Custom font factory or image factory (e.g., base64) implementations can be configured using the setInterfaceProps method.

SAXiTextHandler

SAXiTextHandler supports a very similar, yet bigger, set of tags: ol, ul, li, a, code, font, span, br, p, div, html, table, td, th, tr, i, b, u, sup, sub, em, strong, s, h1, h2, h3, h4, h5, h6, var, img, hr, annotation, itext, chapter, section, chunk, newpage.

The implemented attributes are much more extensive, the most important of them are:

encoding, embedded, font, size, fontstyle, red, green, blue, color, leading, itext, generictag — text elements;
pagesize, orientation, left, right, top, bottom — itext;
itext, localgoto, remotegoto, page, destination, localdestination, subscript, generictag, backgroundcolor — chunk;
name, href — a;
numbered, lettered, lowercase, autoindent, alignindent, first, listsymbol, indentationright, indentationleft, symbolindent — ol, ul;
horizontalalign, verticalalign, width, colspan, rowspan, leading, header, nowrap — td, th;
widths, columns, lastHeaderRow, align, cellspacing, cellpadding, offset, width, tablefitspage, cellsfitpage, convert2pdfp, borderwidth, left, right, top, bottom, red, green, blue, bordercolor, bgred, bggreen, bgblue, bgcolor, grayfill — table;
numberdepth, indent, indentationleft, indentationright — chapter, section;
src, align, underlying, textwrap, alt, absolutex, absolutey, plainwidth, plainheight, rotation — img;
llx, lly, urx, ury, title, content, url, named, file, destination, page — annotation.

Styles: font-family, font-size, font-style, font-weight, color.

This parser makes it easier to set up the document and new pages, style tables, arrange columns or cells and create links and annotations. On the other hand, the lack of specification requires reviewing the implementation to understand the use of non-standard attributes.

Summary

Having the IText library version 2.1.7 at your disposal, you can easily convert some simple HTML4 templates to PDF. However, this version lacks the implementation of cascading style sheets. Generating a satisfactory document is difficult, especially with the simpler HTMLWorker parser.

The XHTML template parser SAXmyHtmlHandler resolves (after securing) some flaws and allows styling with non-standard attributes. However, if you need a stable conversion or HTML5/CSS3 support, consider the latest version of the library, or look for other alternatives with a satisfactory license.