How and why I am building Multiformat

Published Feb 13, 2018Last updated May 23, 2018
How and why I am building Multiformat

The problem I wanted to solve

When dynamically generating documents as a PDF, thumbnail or preview images are sometimes required. Tools like ReportLab are able to generate excellent vector and text-based PDF files but can not generate the same document in an image format. Additionally, imaging packages like Pillow can generate both images and PDF files, but the PDF files will be in a low-quality raster format.

Creating thumbnails or preview images for documents often requires generating a PDF, then converting it to an image format or building two separate files in parallel. These solutions are inefficient and it is often difficult to create images that match the PDF version of the document.

The goal of this project is to create a single class that can be used to generate both high-quality vector and text-based PDF files and identical images.

What is Multiformat?

Multiformat is a Python package that can generate documents in multiple formats, including PDF, PNG, GIF, and JPEG. The PDF files created are high-quality vector and text-based. Documents in multiple formats can be built using a single set of methods found in the “Document” class. After elements are added to the document many files in multiple formats can be created from a single document object.

Here is an example that creates a simple 2 page PDF and a PNG of the first page:

from multiformat.multiformat import Document

# Create a new document
document = Document(document_size='A4', layout='portrait')
# Add a green background
document.draw_rectangle(
    x=0,
    y=0,
    w=document.w,
    h=document.h,
    fill_color=(29, 179, 97),
    border_color=(0, 0, 0),
    border_width=0)
# Add "Hello World" to the bottom left.
document.draw_string(
    string="Hello World",
    x=100,
    y=document.h - 100,
    alignment="left",
    font="OpenSans-Bold",
    size=100,
    color=(255, 255, 255))
# Add an orange line from the top-left to the bottom-right
document.draw_line(
    x=0, y=0, x1=document.w, y1=document.h, width=50, color=(251, 176, 64))
# Insert page break
document.insert_page_break()
# Add an orange line from the top-right to the bottom-left
document.draw_line(
    x=document.w, y=0, x1=0, y1=document.h, width=50, color=(251, 176, 64))
# Generate the document as a PDF
document.generate_pdf(file_name="example")
# Generate page 1 of the document as a PNG no larger than 1000,1000
document.generate_image(
    file_name="example", image_format="PNG", size=(1000, 1000), page=1)

Tech stack

Multiformat is written in Python 3. It uses the ReportLab open-source PDF Toolkit to generate the text and vector-based PDF files. Pillow is used for generating the image files.

Testing the app is accomplished using the pytest framework. Automated testing of the repository is done using Travis CI and codecov.io is used for automating code coverage reports.

Files can be created in a directory or returned as a file object that should be compatible with web frameworks like Django.

The process of building Multiformat.

After deciding on the technology the next step was to decide how text and other page elements would be added to a document before the files are generated. After trying other options I determined storing document element objects in a list was the best method. This permits a document file to be created many times in multiple formats after the document object has been created.

The next step was to develop a unit and coordinate system. Elements need to occur at exactly the same relative position on images and PDF files, but ReportLab and Pillow use different coordinate systems. A top-down coordinate system was selected with each unit representing 0.01 centimeter on a printed page or 1 pixel on an image. Image resolution can be increased or decreased when the images are created.

From this point, each page element method added to Multiformat needs to be built so that it creates an identical element in Pillow and ReportLab. This involves selecting the best set of methods in each of those packages to utilize. Next coordinates, sizes, etc. are converted to produce identical elements. Methods have been created for text strings, lines, circles and rectangles.

Challenges

Challenges faced building this project are many of the same that have been faced before when developers generated multiple formats of the same document in parallel. Methods must be created that produce the same result when generated as an image or PDF. Many differences in imaging and PDF generation packages must be compensated for. These differences include coordinate systems, units, color codes, text alignments, fonts, and likely many more that I have not yet encountered.

Final thoughts and next steps

The Multiformat package currently only supports a few page elements, including single-line text, rectangles, and lines. Additional elements will be required for it to meet the needs of most projects.

Discover and read more posts from Adam Moller
get started