Collection of Recipes

A collection of recipes in “How-To” format for using PyMuPDF. We aim to extend this section over time. Where appropriate we will refer to the corresponding Wiki pages, but some duplication may still occur.


Images


How to Make Images from Document Pages

This little script will take a document filename and generate a PNG file from each of its pages.

The document can be any supported type like PDF, XPS, etc.

The script works as a command line tool which expects the filename being supplied as a parameter. The generated image files (1 per page) are stored in the directory of the script:

  1. import sys, fitz # import the bindings
  2. fname = sys.argv[1] # get filename from command line
  3. doc = fitz.open(fname) # open document
  4. for page in doc: # iterate through the pages
  5. pix = page.get_pixmap() # render page to an image
  6. pix.save("page-%i.png" % page.number) # store image as a PNG

The script directory will now contain PNG image files named page-0.png, page-1.png, etc. Pictures have the dimension of their pages with width and height rounded to integers, e.g. 595 x 842 pixels for an A4 portrait sized page. They will have a resolution of 96 dpi in x and y dimension and have no transparency. You can change all that – for how to do this, read the next sections.


How to Increase Image Resolution

The image of a document page is represented by a Pixmap, and the simplest way to create a pixmap is via method Page.get_pixmap().

This method has many options to influence the result. The most important among them is the Matrix, which lets you zoom, rotate, distort or mirror the outcome.

Page.get_pixmap() by default will use the Identity matrix, which does nothing.

In the following, we apply a zoom factor of 2 to each dimension, which will generate an image with a four times better resolution for us (and also about 4 times the size):

  1. zoom_x = 2.0 # horizontal zoom
  2. zoom_y = 2.0 # vertical zoom
  3. mat = fitz.Matrix(zoom_x, zoom_y) # zoom factor 2 in each dimension
  4. pix = page.get_pixmap(matrix=mat) # use 'mat' instead of the identity matrix

Since version 1.19.2 there is a more direct way to set the resolution: Parameter "dpi" (dots per inch) can be used in place of "matrix". To create a 300 dpi image of a page specify pix = page.get_pixmap(dpi=300). Apart from notation brevity, this approach has the additonal advantage that the dpi value is saved with the image file – which does not happen automatically when using the Matrix notation.


How to Create Partial Pixmaps (Clips)

You do not always need or want the full image of a page. This is the case e.g. when you display the image in a GUI and would like to fill the respective window with a zoomed part of the page.

Let’s assume your GUI window has room to display a full document page, but you now want to fill this room with the bottom right quarter of your page, thus using a four times better resolution.

To achieve this, define a rectangle equal to the area you want to appear in the GUI and call it “clip”. One way of constructing rectangles in PyMuPDF is by providing two diagonally opposite corners, which is what we are doing here.

_images/img-clip.jpg

  1. mat = fitz.Matrix(2, 2) # zoom factor 2 in each direction
  2. rect = page.rect # the page rectangle
  3. mp = (rect.tl + rect.br) / 2 # its middle point, becomes top-left of clip
  4. clip = fitz.Rect(mp, rect.br) # the area we want
  5. pix = page.get_pixmap(matrix=mat, clip=clip)

In the above we construct clip by specifying two diagonally opposite points: the middle point mp of the page rectangle, and its bottom right, rect.br.


How to Zoom a Clip to a GUI Window

Please also read the previous section. This time we want to compute the zoom factor for a clip, such that its image best fits a given GUI window. This means, that the image’s width or height (or both) will equal the window dimension. For the following code snippet you need to provide the WIDTH and HEIGHT of your GUI’s window that should receive the page’s clip rectangle.

  1. # WIDTH: width of the GUI window
  2. # HEIGHT: height of the GUI window
  3. # clip: a subrectangle of the document page
  4. # compare width/height ratios of image and window
  5. if clip.width / clip.height < WIDTH / HEIGHT:
  6. # clip is narrower: zoom to window HEIGHT
  7. zoom = HEIGHT / clip.height
  8. else: # clip is broader: zoom to window WIDTH
  9. zoom = WIDTH / clip.width
  10. mat = fitz.Matrix(zoom, zoom)
  11. pix = page.get_pixmap(matrix=mat, clip=clip)

For the other way round, now assume you have the zoom factor and need to compute the fitting clip.

In this case we have zoom = HEIGHT/clip.height = WIDTH/clip.width, so we must set clip.height = HEIGHT/zoom and, clip.width = WIDTH/zoom. Choose the top-left point tl of the clip on the page to compute the right pixmap:

  1. width = WIDTH / zoom
  2. height = HEIGHT / zoom
  3. clip = fitz.Rect(tl, tl.x + width, tl.y + height)
  4. # ensure we still are inside the page
  5. clip &= page.rect
  6. mat = fitz.Matrix(zoom, zoom)
  7. pix = fitz.Pixmap(matrix=mat, clip=clip)

How to Create or Suppress Annotation Images

Normally, the pixmap of a page also shows the page’s annotations. Occasionally, this may not be desirable.

To suppress the annotation images on a rendered page, just specify annots=False in Page.get_pixmap().

You can also render annotations separately: they have their own Annot.get_pixmap() method. The resulting pixmap has the same dimensions as the annotation rectangle.


How to Extract Images: Non-PDF Documents

In contrast to the previous sections, this section deals with extracting images contained in documents, so they can be displayed as part of one or more pages.

If you want recreate the original image in file form or as a memory area, you have basically two options:

  1. Convert your document to a PDF, and then use one of the PDF-only extraction methods. This snippet will convert a document to PDF:

    1. >>> pdfbytes = doc.convert_to_pdf() # this a bytes object
    2. >>> pdf = fitz.open("pdf", pdfbytes) # open it as a PDF document
    3. >>> # now use 'pdf' like any PDF document
  2. Use Page.get_text() with the “dict” parameter. This works for all document types. It will extract all text and images shown on the page, formatted as a Python dictionary. Every image will occur in an image block, containing meta information and the binary image data. For details of the dictionary’s structure, see TextPage. The method works equally well for PDF files. This creates a list of all images shown on a page:

    1. >>> d = page.get_text("dict")
    2. >>> blocks = d["blocks"] # the list of block dictionaries
    3. >>> imgblocks = [b for b in blocks if b["type"] == 1]
    4. >>> pprint(imgblocks[0])
    5. {'bbox': (100.0, 135.8769989013672, 300.0, 364.1230163574219),
    6. 'bpc': 8,
    7. 'colorspace': 3,
    8. 'ext': 'jpeg',
    9. 'height': 501,
    10. 'image': b'\xff\xd8\xff\xe0\x00\x10JFIF\...', # CAUTION: LARGE!
    11. 'size': 80518,
    12. 'transform': (200.0, 0.0, -0.0, 228.2460174560547, 100.0, 135.8769989013672),
    13. 'type': 1,
    14. 'width': 439,
    15. 'xres': 96,
    16. 'yres': 96}

How to Extract Images: PDF Documents

Like any other “object” in a PDF, images are identified by a cross reference number (xref, an integer). If you know this number, you have two ways to access the image’s data:

  1. Create a Pixmap of the image with instruction pix = fitz.Pixmap(doc, xref). This method is very fast (single digit micro-seconds). The pixmap’s properties (width, height, …) will reflect the ones of the image. In this case there is no way to tell which image format the embedded original has.

  2. Extract the image with img = doc.extract_image(xref). This is a dictionary containing the binary image data as img[“image”]. A number of meta data are also provided – mostly the same as you would find in the pixmap of the image. The major difference is string img[“ext”], which specifies the image format: apart from “png”, strings like “jpeg”, “bmp”, “tiff”, etc. can also occur. Use this string as the file extension if you want to store to disk. The execution speed of this method should be compared to the combined speed of the statements pix = fitz.Pixmap(doc, xref);pix.tobytes(). If the embedded image is in PNG format, the speed of Document.extract_image() is about the same (and the binary image data are identical). Otherwise, this method is thousands of times faster, and the image data is much smaller.

The question remains: “How do I know those ‘xref’ numbers of images?”. There are two answers to this:

  1. “Inspect the page objects:” Loop through the items of Page.get_images(). It is a list of list, and its items look like [xref, smask, …], containing the xref of an image. This xref can then be used with one of the above methods. Use this method for valid (undamaged) documents. Be wary however, that the same image may be referenced multiple times (by different pages), so you might want to provide a mechanism avoiding multiple extracts.

  2. “No need to know:” Loop through the list of all xrefs of the document and perform a Document.extract_image() for each one. If the returned dictionary is empty, then continue – this xref is no image. Use this method if the PDF is damaged (unusable pages). Note that a PDF often contains “pseudo-images” (“stencil masks”) with the special purpose of defining the transparency of some other image. You may want to provide logic to exclude those from extraction. Also have a look at the next section.

For both extraction approaches, there exist ready-to-use general purpose scripts:

extract-imga.py extracts images page by page:

_images/img-extract-imga.jpg

and extract-imgb.py extracts images by xref table:

_images/img-extract-imgb.jpg


How to Handle Image Masks

Some images in PDFs are accompanied by image masks. In their simplest form, masks represent alpha (transparency) bytes stored as separate images. In order to reconstruct the original of an image, which has a mask, it must be “enriched” with transparency bytes taken from its mask.

Whether an image does have such a mask can be recognized in one of two ways in PyMuPDF:

  1. An item of Document.get_page_images() has the general format (xref, smask, ...), where xref is the image’s xref and smask, if positive, is the xref of a mask.

  2. The (dictionary) results of Document.extract_image() have a key “smask”, which also contains any mask’s xref if positive.

If smask == 0 then the image encountered via xref can be processed as it is.

To recover the original image using PyMuPDF, the procedure depicted as follows must be executed:

_images/img-stencil.jpg

  1. >>> pix1 = fitz.Pixmap(doc.extract_image(xref)["image"]) # (1) pixmap of image w/o alpha
  2. >>> mask = fitz.Pixmap(doc.extract_image(smask)["image"]) # (2) mask pixmap
  3. >>> pix = fitz.Pixmap(pix1, mask) # (3) copy of pix1, image mask added

Step (1) creates a pixmap of the basic image. Step (2) does the same with the image mask. Step (3) adds an alpha channel and fills it with transparency information.

The scripts extract-imga.py, and extract-imgb.py above also contain this logic.


How to Make one PDF of all your Pictures (or Files)

We show here three scripts that take a list of (image and other) files and put them all in one PDF.

Method 1: Inserting Images as Pages

The first one converts each image to a PDF page with the same dimensions. The result will be a PDF with one page per image. It will only work for supported image file formats:

  1. import os, fitz
  2. import PySimpleGUI as psg # for showing a progress bar
  3. doc = fitz.open() # PDF with the pictures
  4. imgdir = "D:/2012_10_05" # where the pics are
  5. imglist = os.listdir(imgdir) # list of them
  6. imgcount = len(imglist) # pic count
  7. for i, f in enumerate(imglist):
  8. img = fitz.open(os.path.join(imgdir, f)) # open pic as document
  9. rect = img[0].rect # pic dimension
  10. pdfbytes = img.convert_to_pdf() # make a PDF stream
  11. img.close() # no longer needed
  12. imgPDF = fitz.open("pdf", pdfbytes) # open stream as PDF
  13. page = doc.new_page(width = rect.width, # new page with ...
  14. height = rect.height) # pic dimension
  15. page.show_pdf_page(rect, imgPDF, 0) # image fills the page
  16. psg.EasyProgressMeter("Import Images", # show our progress
  17. i+1, imgcount)
  18. doc.save("all-my-pics.pdf")

This will generate a PDF only marginally larger than the combined pictures’ size. Some numbers on performance:

The above script needed about 1 minute on my machine for 149 pictures with a total size of 514 MB (and about the same resulting PDF size).

_images/img-import-progress.jpg

Look here for a more complete source code: it offers a directory selection dialog and skips unsupported files and non-file entries.

Note

We might have used Page.insert_image() instead of Page.show_pdf_page(), and the result would have been a similar looking file. However, depending on the image type, it may store images uncompressed. Therefore, the save option deflate = True must be used to achieve a reasonable file size, which hugely increases the runtime for large numbers of images. So this alternative cannot be recommended here.

Method 2: Embedding Files

The second script embeds arbitrary files – not only images. The resulting PDF will have just one (empty) page, required for technical reasons. To later access the embedded files again, you would need a suitable PDF viewer that can display and / or extract embedded files:

  1. import os, fitz
  2. import PySimpleGUI as psg # for showing progress bar
  3. doc = fitz.open() # PDF with the pictures
  4. imgdir = "D:/2012_10_05" # where my files are
  5. imglist = os.listdir(imgdir) # list of pictures
  6. imgcount = len(imglist) # pic count
  7. imglist.sort() # nicely sort them
  8. for i, f in enumerate(imglist):
  9. img = open(os.path.join(imgdir,f), "rb").read() # make pic stream
  10. doc.embfile_add(img, f, filename=f, # and embed it
  11. ufilename=f, desc=f)
  12. psg.EasyProgressMeter("Embedding Files", # show our progress
  13. i+1, imgcount)
  14. page = doc.new_page() # at least 1 page is needed
  15. doc.save("all-my-pics-embedded.pdf")

_images/img-embed-progress.jpg

This is by far the fastest method, and it also produces the smallest possible output file size. The above pictures needed 20 seconds on my machine and yielded a PDF size of 510 MB. Look here for a more complete source code: it offers a directory selection dialog and skips non-file entries.

Method 3: Attaching Files

A third way to achieve this task is attaching files via page annotations see here for the complete source code.

This has a similar performance as the previous script and it also produces a similar file size. It will produce PDF pages which show a ‘FileAttachment’ icon for each attached file.

_images/img-attach-result.jpg

Note

Both, the embed and the attach methods can be used for arbitrary files – not just images.

Note

We strongly recommend using the awesome package PySimpleGUI to display a progress meter for tasks that may run for an extended time span. It’s pure Python, uses Tkinter (no additional GUI package) and requires just one more line of code!


How to Create Vector Images

The usual way to create an image from a document page is Page.get_pixmap(). A pixmap represents a raster image, so you must decide on its quality (i.e. resolution) at creation time. It cannot be changed later.

PyMuPDF also offers a way to create a vector image of a page in SVG format (scalable vector graphics, defined in XML syntax). SVG images remain precise across zooming levels (of course with the exception of any raster graphic elements embedded therein).

Instruction svg = page.get_svg_image(matrix=fitz.Identity) delivers a UTF-8 string svg which can be stored with extension “.svg”.


How to Convert Images

Just as a feature among others, PyMuPDF’s image conversion is easy. It may avoid using other graphics packages like PIL/Pillow in many cases.

Notwithstanding that interfacing with Pillow is almost trivial.

Input Formats

Output Formats

Description

BMP

.

Windows Bitmap

JPEG

.

Joint Photographic Experts Group

JXR

.

JPEG Extended Range

JPX/JP2

.

JPEG 2000

GIF

.

Graphics Interchange Format

TIFF

.

Tagged Image File Format

PNG

PNG

Portable Network Graphics

PNM

PNM

Portable Anymap

PGM

PGM

Portable Graymap

PBM

PBM

Portable Bitmap

PPM

PPM

Portable Pixmap

PAM

PAM

Portable Arbitrary Map

.

PSD

Adobe Photoshop Document

.

PS

Adobe Postscript

The general scheme is just the following two lines:

  1. pix = fitz.Pixmap("input.xxx") # any supported input format
  2. pix.save("output.yyy") # any supported output format

Remarks

  1. The input argument of fitz.Pixmap(arg) can be a file or a bytes / io.BytesIO object containing an image.

  2. Instead of an output file, you can also create a bytes object via pix.tobytes(“yyy”) and pass this around.

  3. As a matter of course, input and output formats must be compatible in terms of colorspace and transparency. The Pixmap class has batteries included if adjustments are needed.

Note

Convert JPEG to Photoshop:

  1. pix = fitz.Pixmap("myfamily.jpg")
  2. pix.save("myfamily.psd")

Note

Save to JPEG using PIL/Pillow:

  1. pix = fitz.Pixmap(...)
  2. pix.pil_save("output.jpg")

Note

Convert JPEG to Tkinter PhotoImage. Any RGB / no-alpha image works exactly the same. Conversion to one of the Portable Anymap formats (PPM, PGM, etc.) does the trick, because they are supported by all Tkinter versions:

  1. import tkinter as tk
  2. pix = fitz.Pixmap("input.jpg") # or any RGB / no-alpha image
  3. tkimg = tk.PhotoImage(data=pix.tobytes("ppm"))

Note

Convert PNG with alpha to Tkinter PhotoImage. This requires removing the alpha bytes, before we can do the PPM conversion:

  1. import tkinter as tk
  2. pix = fitz.Pixmap("input.png") # may have an alpha channel
  3. if pix.alpha: # we have an alpha channel!
  4. pix = fitz.Pixmap(pix, 0) # remove it
  5. tkimg = tk.PhotoImage(data=pix.tobytes("ppm"))

How to Use Pixmaps: Glueing Images

This shows how pixmaps can be used for purely graphical, non-document purposes. The script reads an image file and creates a new image which consist of 3 * 4 tiles of the original:

  1. import fitz
  2. src = fitz.Pixmap("img-7edges.png") # create pixmap from a picture
  3. col = 3 # tiles per row
  4. lin = 4 # tiles per column
  5. tar_w = src.width * col # width of target
  6. tar_h = src.height * lin # height of target
  7. # create target pixmap
  8. tar_pix = fitz.Pixmap(src.colorspace, (0, 0, tar_w, tar_h), src.alpha)
  9. # now fill target with the tiles
  10. for i in range(col):
  11. for j in range(lin):
  12. src.set_origin(src.width * i, src.height * j)
  13. tar_pix.copy(src, src.irect) # copy input to new loc
  14. tar_pix.save("tar.png")

This is the input picture:

_images/img-7edges.png

Here is the output:

_images/img-target.png


How to Use Pixmaps: Making a Fractal

Here is another Pixmap example that creates Sierpinski’s Carpet – a fractal generalizing the Cantor Set to two dimensions. Given a square carpet, mark its 9 sub-suqares (3 times 3) and cut out the one in the center. Treat each of the remaining eight sub-squares in the same way, and continue ad infinitum. The end result is a set with area zero and fractal dimension 1.8928…

This script creates an approximate image of it as a PNG, by going down to one-pixel granularity. To increase the image precision, change the value of n (precision):

  1. import fitz, time
  2. if not list(map(int, fitz.VersionBind.split("."))) >= [1, 14, 8]:
  3. raise SystemExit("need PyMuPDF v1.14.8 for this script")
  4. n = 6 # depth (precision)
  5. d = 3**n # edge length
  6. t0 = time.perf_counter()
  7. ir = (0, 0, d, d) # the pixmap rectangle
  8. pm = fitz.Pixmap(fitz.csRGB, ir, False)
  9. pm.set_rect(pm.irect, (255,255,0)) # fill it with some background color
  10. color = (0, 0, 255) # color to fill the punch holes
  11. # alternatively, define a 'fill' pixmap for the punch holes
  12. # this could be anything, e.g. some photo image ...
  13. fill = fitz.Pixmap(fitz.csRGB, ir, False) # same size as 'pm'
  14. fill.set_rect(fill.irect, (0, 255, 255)) # put some color in
  15. def punch(x, y, step):
  16. """Recursively "punch a hole" in the central square of a pixmap.
  17. Arguments are top-left coords and the step width.
  18. Some alternative punching methods are commented out.
  19. """
  20. s = step // 3 # the new step
  21. # iterate through the 9 sub-squares
  22. # the central one will be filled with the color
  23. for i in range(3):
  24. for j in range(3):
  25. if i != j or i != 1: # this is not the central cube
  26. if s >= 3: # recursing needed?
  27. punch(x+i*s, y+j*s, s) # recurse
  28. else: # punching alternatives are:
  29. pm.set_rect((x+s, y+s, x+2*s, y+2*s), color) # fill with a color
  30. #pm.copy(fill, (x+s, y+s, x+2*s, y+2*s)) # copy from fill
  31. #pm.invert_irect((x+s, y+s, x+2*s, y+2*s)) # invert colors
  32. return
  33. #==============================================================================
  34. # main program
  35. #==============================================================================
  36. # now start punching holes into the pixmap
  37. punch(0, 0, d)
  38. t1 = time.perf_counter()
  39. pm.save("sierpinski-punch.png")
  40. t2 = time.perf_counter()
  41. print ("%g sec to create / fill the pixmap" % round(t1-t0,3))
  42. print ("%g sec to save the image" % round(t2-t1,3))

The result should look something like this:

_images/img-sierpinski.png


How to Interface with NumPy

This shows how to create a PNG file from a numpy array (several times faster than most other methods):

  1. import numpy as np
  2. import fitz
  3. #==============================================================================
  4. # create a fun-colored width * height PNG with fitz and numpy
  5. #==============================================================================
  6. height = 150
  7. width = 100
  8. bild = np.ndarray((height, width, 3), dtype=np.uint8)
  9. for i in range(height):
  10. for j in range(width):
  11. # one pixel (some fun coloring)
  12. bild[i, j] = [(i+j)%256, i%256, j%256]
  13. samples = bytearray(bild.tostring()) # get plain pixel data from numpy array
  14. pix = fitz.Pixmap(fitz.csRGB, width, height, samples, alpha=False)
  15. pix.save("test.png")

How to Add Images to a PDF Page

There are two methods to add images to a PDF page: Page.insert_image() and Page.show_pdf_page(). Both methods have things in common, but there also exist differences.

Criterion

Page.insert_image()

Page.show_pdf_page()

displayable content

image file, image in memory, pixmap

PDF page

display resolution

image resolution

vectorized (except raster page content)

rotation

0, 90, 180 or 270 degrees

any angle

clipping

no (full image only)

yes

keep aspect ratio

yes (default option)

yes (default option)

transparency (water marking)

depends on the image

depends on the page

location / placement

scaled to fit target rectangle

scaled to fit target rectangle

performance

automatic prevention of duplicates;

automatic prevention of duplicates;

multi-page image support

no

yes

ease of use

simple, intuitive;

simple, intuitive; usable for all document types (including images!) after conversion to PDF via Document.convert_to_pdf()

Basic code pattern for Page.insert_image(). Exactly one of the parameters filename / stream / pixmap must be given, if not re-inserting an existing image:

  1. page.insert_image(
  2. rect, # where to place the image (rect-like)
  3. filename=None, # image in a file
  4. stream=None, # image in memory (bytes)
  5. pixmap=None, # image from pixmap
  6. mask=None, # specify alpha channel separately
  7. rotate=0, # rotate (int, multiple of 90)
  8. xref=0, # re-use existing image
  9. oc=0, # control visibility via OCG / OCMD
  10. keep_proportion=True, # keep aspect ratio
  11. overlay=True, # put in foreground
  12. )

Basic code pattern for Page.show_pdf_page(). Source and target PDF must be different Document objects (but may be opened from the same file):

  1. page.show_pdf_page(
  2. rect, # where to place the image (rect-like)
  3. src, # source PDF
  4. pno=0, # page number in source PDF
  5. clip=None, # only display this area (rect-like)
  6. rotate=0, # rotate (float, any value)
  7. oc=0, # control visibility via OCG / OCMD
  8. keep_proportion=True, # keep aspect ratio
  9. overlay=True, # put in foreground
  10. )

Text


How to Extract all Document Text

This script will take a document filename and generate a text file from all of its text.

The document can be any supported type like PDF, XPS, etc.

The script works as a command line tool which expects the document filename supplied as a parameter. It generates one text file named “filename.txt” in the script directory. Text of pages is separated by a form feed character:

  1. import sys, fitz
  2. fname = sys.argv[1] # get document filename
  3. doc = fitz.open(fname) # open document
  4. out = open(fname + ".txt", "wb") # open text output
  5. for page in doc: # iterate the document pages
  6. text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
  7. out.write(text) # write text of page
  8. out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
  9. out.close()

The output will be plain text as it is coded in the document. No effort is made to prettify in any way. Specifically for PDF, this may mean output not in usual reading order, unexpected line breaks and so forth.

You have many options to cure this – see chapter Appendix 2: Considerations on Embedded Files. Among them are:

  1. Extract text in HTML format and store it as a HTML document, so it can be viewed in any browser.

  2. Extract text as a list of text blocks via Page.get_text(“blocks”). Each item of this list contains position information for its text, which can be used to establish a convenient reading order.

  3. Extract a list of single words via Page.get_text(“words”). Its items are words with position information. Use it to determine text contained in a given rectangle – see next section.

See the following two section for examples and further explanations.

How to Extract Text from within a Rectangle

There is now (v1.18.0) more than one way to achieve this. We therefore have created a folder in the PyMuPDF-Utilities repository specifically dealing with this topic.


How to Extract Text in Natural Reading Order

One of the common issues with PDF text extraction is, that text may not appear in any particular reading order.

Responsible for this effect is the PDF creator (software or a human). For example, page headers may have been inserted in a separate step – after the document had been produced. In such a case, the header text will appear at the end of a page text extraction (although it will be correctly shown by PDF viewer software). For example, the following snippet will add some header and footer lines to an existing PDF:

  1. doc = fitz.open("some.pdf")
  2. header = "Header" # text in header
  3. footer = "Page %i of %i" # text in footer
  4. for page in doc:
  5. page.insert_text((50, 50), header) # insert header
  6. page.insert_text( # insert footer 50 points above page bottom
  7. (50, page.rect.height - 50),
  8. footer % (page.number + 1, doc.page_count),
  9. )

The text sequence extracted from a page modified in this way will look like this:

  1. original text

  2. header line

  3. footer line

PyMuPDF has several means to re-establish some reading sequence or even to re-generate a layout close to the original:

  1. Use sort parameter of Page.get_text(). It will sort the output from top-left to bottom-right (ignored for XHTML, HTML and XML output).

  2. Use the fitz module in CLI: python -m fitz gettext ..., which produces a text file where text has been re-arranged in layout-preserving mode. Many options are available to control the output.

You can also use the above mentioned script with your modifications.


How to Extract Tables from Documents

If you see a table in a document, you are not normally looking at something like an embedded Excel or other identifiable object. It usually is just text, formatted to appear as appropriate.

Extracting a tabular data from such a page area therefore means that you must find a way to (1) graphically indicate table and column borders, and (2) then extract text based on this information.

The wxPython GUI script wxTableExtract.py strives to exactly do that. You may want to have a look at it and adjust it to your liking.


How to Mark Extracted Text

There is a standard search function to search for arbitrary text on a page: Page.search_for(). It returns a list of Rect objects which surround a found occurrence. These rectangles can for example be used to automatically insert annotations which visibly mark the found text.

This method has advantages and drawbacks. Pros are

  • The search string can contain blanks and wrap across lines

  • Upper or lower case characters are treated equal

  • Word hyphenation at line ends is detected and resolved

  • return may also be a list of Quad objects to precisely locate text that is not parallel to either axis – using Quad output is also recommend, when page rotation is not zero.

But you also have other options:

  1. import sys
  2. import fitz
  3. def mark_word(page, text):
  4. """Underline each word that contains 'text'.
  5. """
  6. found = 0
  7. wlist = page.getTex("words") # make the word list
  8. for w in wlist: # scan through all words on page
  9. if text in w[4]: # w[4] is the word's string
  10. found += 1 # count
  11. r = fitz.Rect(w[:4]) # make rect from word bbox
  12. page.add_underline_annot(r) # underline
  13. return found
  14. fname = sys.argv[1] # filename
  15. text = sys.argv[2] # search string
  16. doc = fitz.open(fname)
  17. print("underlining words containing '%s' in document '%s'" % (word, doc.name))
  18. new_doc = False # indicator if anything found at all
  19. for page in doc: # scan through the pages
  20. found = mark_word(page, text) # mark the page's words
  21. if found: # if anything found ...
  22. new_doc = True
  23. print("found '%s' %i times on page %i" % (text, found, page.number + 1))
  24. if new_doc:
  25. doc.save("marked-" + doc.name)

This script uses Page.get_text("words")() to look for a string, handed in via cli parameter. This method separates a page’s text into “words” using spaces and line breaks as delimiters. Therefore the words in this lists do not contain these characters. Further remarks:

  • If found, the complete word containing the string is marked (underlined) – not only the search string.

  • The search string may not contain spaces or other white space.

  • As shown here, upper / lower cases are respected. But this can be changed by using the string method lower() (or even regular expressions) in function mark_word.

  • There is no upper limit: all occurrences will be detected.

  • You can use anything to mark the word: ‘Underline’, ‘Highlight’, ‘StrikeThrough’ or ‘Square’ annotations, etc.

  • Here is an example snippet of a page of this manual, where “MuPDF” has been used as the search string. Note that all strings containing “MuPDF” have been completely underlined (not just the search string).

_images/img-markedpdf.jpg


How to Mark Searched Text

This script searches for text and marks it:

  1. # -*- coding: utf-8 -*-
  2. import fitz
  3. # the document to annotate
  4. doc = fitz.open("tilted-text.pdf")
  5. # the text to be marked
  6. t = "¡La práctica hace el campeón!"
  7. # work with first page only
  8. page = doc[0]
  9. # get list of text locations
  10. # we use "quads", not rectangles because text may be tilted!
  11. rl = page.search_for(t, quads = True)
  12. # mark all found quads with one annotation
  13. page.add_squiggly_annot(rl)
  14. # save to a new PDF
  15. doc.save("a-squiggly.pdf")

The result looks like this:

_images/img-textmarker.jpg


How to Mark Non-horizontal Text

The previous section already shows an example for marking non-horizontal text, that was detected by text searching.

But text extraction with the “dict” / “rawdict” options of Page.get_text() may also return text with a non-zero angle to the x-axis. This is indicated by the value of the line dictionary’s "dir" key: it is the tuple (cosine, sine) for that angle. If line["dir"] != (1, 0), then the text of all its spans is rotated by (the same) angle != 0.

The “bboxes” returned by the method however are rectangles only – not quads. So, to mark span text correctly, its quad must be recovered from the data contained in the line and span dictionary. Do this with the following utility function (new in v1.18.9):

  1. span_quad = fitz.recover_quad(line["dir"], span)
  2. annot = page.add_highlight_annot(span_quad) # this will mark the complete span text

If you want to mark the complete line or a subset of its spans in one go, use the following snippet (works for v1.18.10 or later):

  1. line_quad = fitz.recover_line_quad(line, spans=line["spans"][1:-1])
  2. page.add_highlight_annot(line_quad)

_images/img-linequad.jpg

The spans argument above may specify any sub-list of line["spans"]. In the example above, the second to second-to-last span are marked. If omitted, the complete line is taken.


How to Analyze Font Characteristics

To analyze the characteristics of text in a PDF use this elementary script as a starting point:

  1. import fitz
  2. def flags_decomposer(flags):
  3. """Make font flags human readable."""
  4. l = []
  5. if flags & 2 ** 0:
  6. l.append("superscript")
  7. if flags & 2 ** 1:
  8. l.append("italic")
  9. if flags & 2 ** 2:
  10. l.append("serifed")
  11. else:
  12. l.append("sans")
  13. if flags & 2 ** 3:
  14. l.append("monospaced")
  15. else:
  16. l.append("proportional")
  17. if flags & 2 ** 4:
  18. l.append("bold")
  19. return ", ".join(l)
  20. doc = fitz.open("text-tester.pdf")
  21. page = doc[0]
  22. # read page text as a dictionary, suppressing extra spaces in CJK fonts
  23. blocks = page.get_text("dict", flags=11)["blocks"]
  24. for b in blocks: # iterate through the text blocks
  25. for l in b["lines"]: # iterate through the text lines
  26. for s in l["spans"]: # iterate through the text spans
  27. print("")
  28. font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
  29. s["font"], # font name
  30. flags_decomposer(s["flags"]), # readable font flags
  31. s["size"], # font size
  32. s["color"], # font color
  33. )
  34. print("Text: '%s'" % s["text"]) # simple print of text
  35. print(font_properties)

Here is the PDF page and the script output:

_images/img-pdftext.jpg


How to Insert Text

PyMuPDF provides ways to insert text on new or existing PDF pages with the following features:

  • choose the font, including built-in fonts and fonts that are available as files

  • choose text characteristics like bold, italic, font size, font color, etc.

  • position the text in multiple ways:

    • either as simple line-oriented output starting at a certain point,

    • or fitting text in a box provided as a rectangle, in which case text alignment choices are also available,

    • choose whether text should be put in foreground (overlay existing content),

    • all text can be arbitrarily “morphed”, i.e. its appearance can be changed via a Matrix, to achieve effects like scaling, shearing or mirroring,

    • independently from morphing and in addition to that, text can be rotated by integer multiples of 90 degrees.

All of the above is provided by three basic Page, resp. Shape methods:

Note

Both text insertion methods automatically install the font as necessary.

How to Write Text Lines

Output some text lines on a page:

  1. import fitz
  2. doc = fitz.open(...) # new or existing PDF
  3. page = doc.new_page() # new or existing page via doc[n]
  4. p = fitz.Point(50, 72) # start point of 1st line
  5. text = "Some text,\nspread across\nseveral lines."
  6. # the same result is achievable by
  7. # text = ["Some text", "spread across", "several lines."]
  8. rc = page.insert_text(p, # bottom-left of 1st char
  9. text, # the text (honors '\n')
  10. fontname = "helv", # the default font
  11. fontsize = 11, # the default font size
  12. rotate = 0, # also available: 90, 180, 270
  13. )
  14. print("%i lines printed on page %i." % (rc, page.number))
  15. doc.save("text.pdf")

With this method, only the number of lines will be controlled to not go beyond page height. Surplus lines will not be written and the number of actual lines will be returned. The calculation uses a line height calculated from the fontsize and 36 points (0.5 inches) as bottom margin.

Line width is ignored. The surplus part of a line will simply be invisible.

However, for built-in fonts there are ways to calculate the line width beforehand - see get_text_length().

Here is another example. It inserts 4 text strings using the four different rotation options, and thereby explains, how the text insertion point must be chosen to achieve the desired result:

  1. import fitz
  2. doc = fitz.open()
  3. page = doc.new_page()
  4. # the text strings, each having 3 lines
  5. text1 = "rotate=0\nLine 2\nLine 3"
  6. text2 = "rotate=90\nLine 2\nLine 3"
  7. text3 = "rotate=-90\nLine 2\nLine 3"
  8. text4 = "rotate=180\nLine 2\nLine 3"
  9. red = (1, 0, 0) # the color for the red dots
  10. # the insertion points, each with a 25 pix distance from the corners
  11. p1 = fitz.Point(25, 25)
  12. p2 = fitz.Point(page.rect.width - 25, 25)
  13. p3 = fitz.Point(25, page.rect.height - 25)
  14. p4 = fitz.Point(page.rect.width - 25, page.rect.height - 25)
  15. # create a Shape to draw on
  16. shape = page.new_shape()
  17. # draw the insertion points as red, filled dots
  18. shape.draw_circle(p1,1)
  19. shape.draw_circle(p2,1)
  20. shape.draw_circle(p3,1)
  21. shape.draw_circle(p4,1)
  22. shape.finish(width=0.3, color=red, fill=red)
  23. # insert the text strings
  24. shape.insert_text(p1, text1)
  25. shape.insert_text(p3, text2, rotate=90)
  26. shape.insert_text(p2, text3, rotate=-90)
  27. shape.insert_text(p4, text4, rotate=180)
  28. # store our work to the page
  29. shape.commit()
  30. doc.save(...)

This is the result:

_images/img-inserttext.jpg


How to Fill a Text Box

This script fills 4 different rectangles with text, each time choosing a different rotation value:

  1. import fitz
  2. doc = fitz.open(...) # new or existing PDF
  3. page = doc.new_page() # new page, or choose doc[n]
  4. r1 = fitz.Rect(50,100,100,150) # a 50x50 rectangle
  5. disp = fitz.Rect(55, 0, 55, 0) # add this to get more rects
  6. r2 = r1 + disp # 2nd rect
  7. r3 = r1 + disp * 2 # 3rd rect
  8. r4 = r1 + disp * 3 # 4th rect
  9. t1 = "text with rotate = 0." # the texts we will put in
  10. t2 = "text with rotate = 90."
  11. t3 = "text with rotate = -90."
  12. t4 = "text with rotate = 180."
  13. red = (1,0,0) # some colors
  14. gold = (1,1,0)
  15. blue = (0,0,1)
  16. """We use a Shape object (something like a canvas) to output the text and
  17. the rectangles surrounding it for demonstration.
  18. """
  19. shape = page.new_shape() # create Shape
  20. shape.draw_rect(r1) # draw rectangles
  21. shape.draw_rect(r2) # giving them
  22. shape.draw_rect(r3) # a yellow background
  23. shape.draw_rect(r4) # and a red border
  24. shape.finish(width = 0.3, color = red, fill = gold)
  25. # Now insert text in the rectangles. Font "Helvetica" will be used
  26. # by default. A return code rc < 0 indicates insufficient space (not checked here).
  27. rc = shape.insert_textbox(r1, t1, color = blue)
  28. rc = shape.insert_textbox(r2, t2, color = blue, rotate = 90)
  29. rc = shape.insert_textbox(r3, t3, color = blue, rotate = -90)
  30. rc = shape.insert_textbox(r4, t4, color = blue, rotate = 180)
  31. shape.commit() # write all stuff to page /Contents
  32. doc.save("...")

Several default values were used above: font “Helvetica”, font size 11 and text alignment “left”. The result will look like this:

_images/img-textbox.jpg


How to Use Non-Standard Encoding

Since v1.14, MuPDF allows Greek and Russian encoding variants for the Base14_Fonts. In PyMuPDF this is supported via an additional encoding argument. Effectively, this is relevant for Helvetica, Times-Roman and Courier (and their bold / italic forms) and characters outside the ASCII code range only. Elsewhere, the argument is ignored. Here is how to request Russian encoding with the standard font Helvetica:

  1. page.insert_text(point, russian_text, encoding=fitz.TEXT_ENCODING_CYRILLIC)

The valid encoding values are TEXT_ENCODING_LATIN (0), TEXT_ENCODING_GREEK (1), and TEXT_ENCODING_CYRILLIC (2, Russian) with Latin being the default. Encoding can be specified by all relevant font and text insertion methods.

By the above statement, the fontname helv is automatically connected to the Russian font variant of Helvetica. Any subsequent text insertion with this fontname will use the Russian Helvetica encoding.

If you change the fontname just slightly, you can also achieve an encoding “mixture” for the same base font on the same page:

  1. import fitz
  2. doc=fitz.open()
  3. page = doc.new_page()
  4. shape = page.new_shape()
  5. t="Sômé tèxt wìth nöñ-Lâtîn characterß."
  6. shape.insert_text((50,70), t, fontname="helv", encoding=fitz.TEXT_ENCODING_LATIN)
  7. shape.insert_text((50,90), t, fontname="HElv", encoding=fitz.TEXT_ENCODING_GREEK)
  8. shape.insert_text((50,110), t, fontname="HELV", encoding=fitz.TEXT_ENCODING_CYRILLIC)
  9. shape.commit()
  10. doc.save("t.pdf")

The result:

_images/img-encoding.jpg

The snippet above indeed leads to three different copies of the Helvetica font in the PDF. Each copy is uniquely identified (and referenceable) by using the correct upper-lower case spelling of the reserved word “helv”:

  1. for f in doc.get_page_fonts(0): print(f)
  2. [6, 'n/a', 'Type1', 'Helvetica', 'helv', 'WinAnsiEncoding']
  3. [7, 'n/a', 'Type1', 'Helvetica', 'HElv', 'WinAnsiEncoding']
  4. [8, 'n/a', 'Type1', 'Helvetica', 'HELV', 'WinAnsiEncoding']

Annotations

In v1.14.0, annotation handling has been considerably extended:

  • New annotation type support for ‘Ink’, ‘Rubber Stamp’ and ‘Squiggly’ annotations. Ink annots simulate handwriting by combining one or more lists of interconnected points. Stamps are intended to visually inform about a document’s status or intended usage (like “draft”, “confidential”, etc.). ‘Squiggly’ is a text marker annot, which underlines selected text with a zigzagged line.

  • Extended ‘FreeText’ support:

    1. all characters from the Latin character set are now available,

    2. colors of text, rectangle background and rectangle border can be independently set

    3. text in rectangle can be rotated by either +90 or -90 degrees

    4. text is automatically wrapped (made multi-line) in available rectangle

    5. all Base-14 fonts are now available (normal variants only, i.e. no bold, no italic).

  • MuPDF now supports line end icons for ‘Line’ annots (only). PyMuPDF supported that in v1.13.x already – and for (almost) the full range of applicable types. So we adjusted the appearance of ‘Polygon’ and ‘PolyLine’ annots to closely resemble the one of MuPDF for ‘Line’.

  • MuPDF now provides its own annotation icons where relevant. PyMuPDF switched to using them (for ‘FileAttachment’ and ‘Text’ [“sticky note”] so far).

  • MuPDF now also supports ‘Caret’, ‘Movie’, ‘Sound’ and ‘Signature’ annotations, which we may include in PyMuPDF at some later time.

How to Add and Modify Annotations

In PyMuPDF, new annotations can be added via Page methods. Once an annotation exists, it can be modified to a large extent using methods of the Annot class.

In contrast to many other tools, initial insert of annotations happens with a minimum number of properties. We leave it to the programmer to e.g. set attributes like author, creation date or subject.

As an overview for these capabilities, look at the following script that fills a PDF page with most of the available annotations. Look in the next sections for more special situations:

  1. # -*- coding: utf-8 -*-
  2. """
  3. -------------------------------------------------------------------------------
  4. Demo script showing how annotations can be added to a PDF using PyMuPDF.
  5. It contains the following annotation types:
  6. Caret, Text, FreeText, text markers (underline, strike-out, highlight,
  7. squiggle), Circle, Square, Line, PolyLine, Polygon, FileAttachment, Stamp
  8. and Redaction.
  9. There is some effort to vary appearances by adding colors, line ends,
  10. opacity, rotation, dashed lines, etc.
  11. Dependencies
  12. ------------
  13. PyMuPDF v1.17.0
  14. -------------------------------------------------------------------------------
  15. """
  16. from __future__ import print_function
  17. import gc
  18. import sys
  19. import fitz
  20. print(fitz.__doc__)
  21. if fitz.VersionBind.split(".") < ["1", "17", "0"]:
  22. sys.exit("PyMuPDF v1.17.0+ is needed.")
  23. gc.set_debug(gc.DEBUG_UNCOLLECTABLE)
  24. highlight = "this text is highlighted"
  25. underline = "this text is underlined"
  26. strikeout = "this text is striked out"
  27. squiggled = "this text is zigzag-underlined"
  28. red = (1, 0, 0)
  29. blue = (0, 0, 1)
  30. gold = (1, 1, 0)
  31. green = (0, 1, 0)
  32. displ = fitz.Rect(0, 50, 0, 50)
  33. r = fitz.Rect(72, 72, 220, 100)
  34. t1 = u"têxt üsès Lätiñ charß,\nEUR: €, mu: µ, super scripts: ²³!"
  35. def print_descr(annot):
  36. """Print a short description to the right of each annot rect."""
  37. annot.parent.insert_text(
  38. annot.rect.br + (10, -5), "%s annotation" % annot.type[1], color=red
  39. )
  40. doc = fitz.open()
  41. page = doc.new_page()
  42. page.set_rotation(0)
  43. annot = page.add_caret_annot(r.tl)
  44. print_descr(annot)
  45. r = r + displ
  46. annot = page.add_freetext_annot(
  47. r,
  48. t1,
  49. fontsize=10,
  50. rotate=90,
  51. text_color=blue,
  52. fill_color=gold,
  53. align=fitz.TEXT_ALIGN_CENTER,
  54. )
  55. annot.set_border(width=0.3, dashes=[2])
  56. annot.update(text_color=blue, fill_color=gold)
  57. print_descr(annot)
  58. r = annot.rect + displ
  59. annot = page.add_text_annot(r.tl, t1)
  60. print_descr(annot)
  61. # Adding text marker annotations:
  62. # first insert a unique text, then search for it, then mark it
  63. pos = annot.rect.tl + displ.tl
  64. page.insert_text(
  65. pos, # insertion point
  66. highlight, # inserted text
  67. morph=(pos, fitz.Matrix(-5)), # rotate around insertion point
  68. )
  69. rl = page.search_for(highlight, quads=True) # need a quad b/o tilted text
  70. annot = page.add_highlight_annot(rl[0])
  71. print_descr(annot)
  72. pos = annot.rect.bl # next insertion point
  73. page.insert_text(pos, underline, morph=(pos, fitz.Matrix(-10)))
  74. rl = page.search_for(underline, quads=True)
  75. annot = page.add_underline_annot(rl[0])
  76. print_descr(annot)
  77. pos = annot.rect.bl
  78. page.insert_text(pos, strikeout, morph=(pos, fitz.Matrix(-15)))
  79. rl = page.search_for(strikeout, quads=True)
  80. annot = page.add_strikeout_annot(rl[0])
  81. print_descr(annot)
  82. pos = annot.rect.bl
  83. page.insert_text(pos, squiggled, morph=(pos, fitz.Matrix(-20)))
  84. rl = page.search_for(squiggled, quads=True)
  85. annot = page.add_squiggly_annot(rl[0])
  86. print_descr(annot)
  87. pos = annot.rect.bl
  88. r = fitz.Rect(pos, pos.x + 75, pos.y + 35) + (0, 20, 0, 20)
  89. annot = page.add_polyline_annot([r.bl, r.tr, r.br, r.tl]) # 'Polyline'
  90. annot.set_border(width=0.3, dashes=[2])
  91. annot.set_colors(stroke=blue, fill=green)
  92. annot.set_line_ends(fitz.PDF_ANNOT_LE_CLOSED_ARROW, fitz.PDF_ANNOT_LE_R_CLOSED_ARROW)
  93. annot.update(fill_color=(1, 1, 0))
  94. print_descr(annot)
  95. r += displ
  96. annot = page.add_polygon_annot([r.bl, r.tr, r.br, r.tl]) # 'Polygon'
  97. annot.set_border(width=0.3, dashes=[2])
  98. annot.set_colors(stroke=blue, fill=gold)
  99. annot.set_line_ends(fitz.PDF_ANNOT_LE_DIAMOND, fitz.PDF_ANNOT_LE_CIRCLE)
  100. annot.update()
  101. print_descr(annot)
  102. r += displ
  103. annot = page.add_line_annot(r.tr, r.bl) # 'Line'
  104. annot.set_border(width=0.3, dashes=[2])
  105. annot.set_colors(stroke=blue, fill=gold)
  106. annot.set_line_ends(fitz.PDF_ANNOT_LE_DIAMOND, fitz.PDF_ANNOT_LE_CIRCLE)
  107. annot.update()
  108. print_descr(annot)
  109. r += displ
  110. annot = page.add_rect_annot(r) # 'Square'
  111. annot.set_border(width=1, dashes=[1, 2])
  112. annot.set_colors(stroke=blue, fill=gold)
  113. annot.update(opacity=0.5)
  114. print_descr(annot)
  115. r += displ
  116. annot = page.add_circle_annot(r) # 'Circle'
  117. annot.set_border(width=0.3, dashes=[2])
  118. annot.set_colors(stroke=blue, fill=gold)
  119. annot.update()
  120. print_descr(annot)
  121. r += displ
  122. annot = page.add_file_annot(
  123. r.tl, b"just anything for testing", "testdata.txt" # 'FileAttachment'
  124. )
  125. print_descr(annot) # annot.rect
  126. r += displ
  127. annot = page.add_stamp_annot(r, stamp=10) # 'Stamp'
  128. annot.set_colors(stroke=green)
  129. annot.update()
  130. print_descr(annot)
  131. r += displ + (0, 0, 50, 10)
  132. rc = page.insert_textbox(
  133. r,
  134. "This content will be removed upon applying the redaction.",
  135. color=blue,
  136. align=fitz.TEXT_ALIGN_CENTER,
  137. )
  138. annot = page.add_redact_annot(r)
  139. print_descr(annot)
  140. doc.save(__file__.replace(".py", "-%i.pdf" % page.rotation), deflate=True)

This script should lead to the following output:

_images/img-annots.jpg


How to Use FreeText

This script shows a couple of ways to deal with ‘FreeText’ annotations:

  1. # -*- coding: utf-8 -*-
  2. import fitz
  3. # some colors
  4. blue = (0,0,1)
  5. green = (0,1,0)
  6. red = (1,0,0)
  7. gold = (1,1,0)
  8. # a new PDF with 1 page
  9. doc = fitz.open()
  10. page = doc.new_page()
  11. # 3 rectangles, same size, above each other
  12. r1 = fitz.Rect(100,100,200,150)
  13. r2 = r1 + (0,75,0,75)
  14. r3 = r2 + (0,75,0,75)
  15. # the text, Latin alphabet
  16. t = "¡Un pequeño texto para practicar!"
  17. # add 3 annots, modify the last one somewhat
  18. a1 = page.add_freetext_annot(r1, t, color=red)
  19. a2 = page.add_freetext_annot(r2, t, fontname="Ti", color=blue)
  20. a3 = page.add_freetext_annot(r3, t, fontname="Co", color=blue, rotate=90)
  21. a3.set_border(width=0)
  22. a3.update(fontsize=8, fill_color=gold)
  23. # save the PDF
  24. doc.save("a-freetext.pdf")

The result looks like this:

_images/img-freetext.jpg


Using Buttons and JavaScript

Since MuPDF v1.16, ‘FreeText’ annotations no longer support bold or italic versions of the Times-Roman, Helvetica or Courier fonts.

A big thank you to our user @kurokawaikki, who contributed the following script to circumvent this restriction.

  1. """
  2. Problem: Since MuPDF v1.16 a 'Freetext' annotation font is restricted to the
  3. "normal" versions (no bold, no italics) of Times-Roman, Helvetica, Courier.
  4. It is impossible to use PyMuPDF to modify this.
  5. Solution: Using Adobe's JavaScript API, it is possible to manipulate properties
  6. of Freetext annotations. Check out these references:
  7. https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/js_api_reference.pdf,
  8. or https://www.adobe.com/devnet/acrobat/documentation.html.
  9. Function 'this.getAnnots()' will return all annotations as an array. We loop
  10. over this array to set the properties of the text through the 'richContents'
  11. attribute.
  12. There is no explicit property to set text to bold, but it is possible to set
  13. fontWeight=800 (400 is the normal size) of richContents.
  14. Other attributes, like color, italics, etc. can also be set via richContents.
  15. If we have 'FreeText' annotations created with PyMuPDF, we can make use of this
  16. JavaScript feature to modify the font - thus circumventing the above restriction.
  17. Use PyMuPDF v1.16.12 to create a push button that executes a Javascript
  18. containing the desired code. This is what this program does.
  19. Then open the resulting file with Adobe reader (!).
  20. After clicking on the button, all Freetext annotations will be bold, and the
  21. file can be saved.
  22. If desired, the button can be removed again, using free tools like PyMuPDF or
  23. PDF XChange editor.
  24. Note / Caution:
  25. ---------------
  26. The JavaScript will **only** work if the file is opened with Adobe Acrobat reader!
  27. When using other PDF viewers, the reaction is unforeseeable.
  28. """
  29. import sys
  30. import fitz
  31. # this JavaScript will execute when the button is clicked:
  32. jscript = """
  33. var annt = this.getAnnots();
  34. annt.forEach(function (item, index) {
  35. try {
  36. var span = item.richContents;
  37. span.forEach(function (it, dx) {
  38. it.fontWeight = 800;
  39. })
  40. item.richContents = span;
  41. } catch (err) {}
  42. });
  43. app.alert('Done');
  44. """
  45. i_fn = sys.argv[1] # input file name
  46. o_fn = "bold-" + i_fn # output filename
  47. doc = fitz.open(i_fn) # open input
  48. page = doc[0] # get desired page
  49. # ------------------------------------------------
  50. # make a push button for invoking the JavaScript
  51. # ------------------------------------------------
  52. widget = fitz.Widget() # create widget
  53. # make it a 'PushButton'
  54. widget.field_type = fitz.PDF_WIDGET_TYPE_BUTTON
  55. widget.field_flags = fitz.PDF_BTN_FIELD_IS_PUSHBUTTON
  56. widget.rect = fitz.Rect(5, 5, 20, 20) # button position
  57. widget.script = jscript # fill in JavaScript source text
  58. widget.field_name = "Make bold" # arbitrary name
  59. widget.field_value = "Off" # arbitrary value
  60. widget.fill_color = (0, 0, 1) # make button visible
  61. annot = page.add_widget(widget) # add the widget to the page
  62. doc.save(o_fn) # output the file

How to Use Ink Annotations

Ink annotations are used to contain freehand scribbling. A typical example maybe an image of your signature consisting of first name and last name. Technically an ink annotation is implemented as a list of lists of points. Each point list is regarded as a continuous line connecting the points. Different point lists represent independent line segments of the annotation.

The following script creates an ink annotation with two mathematical curves (sine and cosine function graphs) as line segments:

  1. import math
  2. import fitz
  3. #------------------------------------------------------------------------------
  4. # preliminary stuff: create function value lists for sine and cosine
  5. #------------------------------------------------------------------------------
  6. w360 = math.pi * 2 # go through full circle
  7. deg = w360 / 360 # 1 degree as radians
  8. rect = fitz.Rect(100,200, 300, 300) # use this rectangle
  9. first_x = rect.x0 # x starts from left
  10. first_y = rect.y0 + rect.height / 2. # rect middle means y = 0
  11. x_step = rect.width / 360 # rect width means 360 degrees
  12. y_scale = rect.height / 2. # rect height means 2
  13. sin_points = [] # sine values go here
  14. cos_points = [] # cosine values go here
  15. for x in range(362): # now fill in the values
  16. x_coord = x * x_step + first_x # current x coordinate
  17. y = -math.sin(x * deg) # sine
  18. p = (x_coord, y * y_scale + first_y) # corresponding point
  19. sin_points.append(p) # append
  20. y = -math.cos(x * deg) # cosine
  21. p = (x_coord, y * y_scale + first_y) # corresponding point
  22. cos_points.append(p) # append
  23. #------------------------------------------------------------------------------
  24. # create the document with one page
  25. #------------------------------------------------------------------------------
  26. doc = fitz.open() # make new PDF
  27. page = doc.new_page() # give it a page
  28. #------------------------------------------------------------------------------
  29. # add the Ink annotation, consisting of 2 curve segments
  30. #------------------------------------------------------------------------------
  31. annot = page.addInkAnnot((sin_points, cos_points))
  32. # let it look a little nicer
  33. annot.set_border(width=0.3, dashes=[1,]) # line thickness, some dashing
  34. annot.set_colors(stroke=(0,0,1)) # make the lines blue
  35. annot.update() # update the appearance
  36. page.draw_rect(rect, width=0.3) # only to demonstrate we did OK
  37. doc.save("a-inktest.pdf")

This is the result:

_images/img-inkannot.jpg


Drawing and Graphics

PDF files support elementary drawing operations as part of their syntax. This includes basic geometrical objects like lines, curves, circles, rectangles including specifying colors.

The syntax for such operations is defined in “A Operator Summary” on page 643 of the Adobe PDF References. Specifying these operators for a PDF page happens in its contents objects.

PyMuPDF implements a large part of the available features via its Shape class, which is comparable to notions like “canvas” in other packages (e.g. reportlab).

A shape is always created as a child of a page, usually with an instruction like shape = page.new_shape(). The class defines numerous methods that perform drawing operations on the page’s area. For example, last_point = shape.draw_rect(rect) draws a rectangle along the borders of a suitably defined rect = fitz.Rect(…).

The returned last_point always is the Point where drawing operation ended (“last point”). Every such elementary drawing requires a subsequent Shape.finish() to “close” it, but there may be multiple drawings which have one common finish() method.

In fact, Shape.finish() defines a group of preceding draw operations to form one – potentially rather complex – graphics object. PyMuPDF provides several predefined graphics in shapes_and_symbols.py which demonstrate how this works.

If you import this script, you can also directly use its graphics as in the following example:

  1. # -*- coding: utf-8 -*-
  2. """
  3. Created on Sun Dec 9 08:34:06 2018
  4. @author: Jorj
  5. @license: GNU AFFERO GPL V3
  6. Create a list of available symbols defined in shapes_and_symbols.py
  7. This also demonstrates an example usage: how these symbols could be used
  8. as bullet-point symbols in some text.
  9. """
  10. import fitz
  11. import shapes_and_symbols as sas
  12. # list of available symbol functions and their descriptions
  13. tlist = [
  14. (sas.arrow, "arrow (easy)"),
  15. (sas.caro, "caro (easy)"),
  16. (sas.clover, "clover (easy)"),
  17. (sas.diamond, "diamond (easy)"),
  18. (sas.dontenter, "do not enter (medium)"),
  19. (sas.frowney, "frowney (medium)"),
  20. (sas.hand, "hand (complex)"),
  21. (sas.heart, "heart (easy)"),
  22. (sas.pencil, "pencil (very complex)"),
  23. (sas.smiley, "smiley (easy)"),
  24. ]
  25. r = fitz.Rect(50, 50, 100, 100) # first rect to contain a symbol
  26. d = fitz.Rect(0, r.height + 10, 0, r.height + 10) # displacement to next rect
  27. p = (15, -r.height * 0.2) # starting point of explanation text
  28. rlist = [r] # rectangle list
  29. for i in range(1, len(tlist)): # fill in all the rectangles
  30. rlist.append(rlist[i-1] + d)
  31. doc = fitz.open() # create empty PDF
  32. page = doc.new_page() # create an empty page
  33. shape = page.new_shape() # start a Shape (canvas)
  34. for i, r in enumerate(rlist):
  35. tlist[i][0](shape, rlist[i]) # execute symbol creation
  36. shape.insert_text(rlist[i].br + p, # insert description text
  37. tlist[i][1], fontsize=r.height/1.2)
  38. # store everything to the page's /Contents object
  39. shape.commit()
  40. import os
  41. scriptdir = os.path.dirname(__file__)
  42. doc.save(os.path.join(scriptdir, "symbol-list.pdf")) # save the PDF

This is the script’s outcome:

_images/img-symbols.jpg


Extracting Drawings

  • New in v1.18.0

The drawing commands issued by a page can be extracted. Interestingly, this is possible for all supported document types – not just PDF: so you can use it for XPS, EPUB and others as well.

Page method, Page.get_drawings() accesses draw commands and converts them into a list of Python dictionaries. Each dictionary – called a “path” – represents a separate drawing – it may be simple like a single line, or a complex combination of lines and curves representing one of the shapes of the previous section.

The path dictionary has been designed such that it can easily be used by the Shape class and its methods. Here is an example for a page with one path, that draws a red-bordered yellow circle inside rectangle Rect(100, 100, 200, 200):

  1. >>> pprint(page.get_drawings())
  2. [{'closePath': True,
  3. 'color': [1.0, 0.0, 0.0],
  4. 'dashes': '[] 0',
  5. 'even_odd': False,
  6. 'fill': [1.0, 1.0, 0.0],
  7. 'items': [('c',
  8. Point(100.0, 150.0),
  9. Point(100.0, 177.614013671875),
  10. Point(122.38600158691406, 200.0),
  11. Point(150.0, 200.0)),
  12. ('c',
  13. Point(150.0, 200.0),
  14. Point(177.61399841308594, 200.0),
  15. Point(200.0, 177.614013671875),
  16. Point(200.0, 150.0)),
  17. ('c',
  18. Point(200.0, 150.0),
  19. Point(200.0, 122.385986328125),
  20. Point(177.61399841308594, 100.0),
  21. Point(150.0, 100.0)),
  22. ('c',
  23. Point(150.0, 100.0),
  24. Point(122.38600158691406, 100.0),
  25. Point(100.0, 122.385986328125),
  26. Point(100.0, 150.0))],
  27. 'lineCap': (0, 0, 0),
  28. 'lineJoin': 0,
  29. 'opacity': 1.0,
  30. 'rect': Rect(100.0, 100.0, 200.0, 200.0),
  31. 'width': 1.0}]
  32. >>>

Note

You need (at least) 4 Bézier curves (of 3rd order) to draw a circle with acceptable precision. See this `Wikipedia article<https://en.wikipedia.org/wiki/B%C3%A9zier_curve>`_ for some background.

The following is a code snippet which extracts the drawings of a page and re-draws them on a new page:

  1. import fitz
  2. doc = fitz.open("some.file")
  3. page = doc[0]
  4. paths = page.get_drawings() # extract existing drawings
  5. # this is a list of "paths", which can directly be drawn again using Shape
  6. # -------------------------------------------------------------------------
  7. #
  8. # define some output page with the same dimensions
  9. outpdf = fitz.open()
  10. outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
  11. shape = outpage.new_shape() # make a drawing canvas for the output page
  12. # --------------------------------------
  13. # loop through the paths and draw them
  14. # --------------------------------------
  15. for path in paths:
  16. # ------------------------------------
  17. # draw each entry of the 'items' list
  18. # ------------------------------------
  19. for item in path["items"]: # these are the draw commands
  20. if item[0] == "l": # line
  21. shape.draw_line(item[1], item[2])
  22. elif item[0] == "re": # rectangle
  23. shape.draw_rect(item[1])
  24. elif item[0] == "qu": # quad
  25. shape.draw_quad(item[1])
  26. elif item[0] == "c": # curve
  27. shape.draw_bezier(item[1], item[2], item[3], item[4])
  28. else:
  29. raise ValueError("unhandled drawing", item)
  30. # ------------------------------------------------------
  31. # all items are drawn, now apply the common properties
  32. # to finish the path
  33. # ------------------------------------------------------
  34. shape.finish(
  35. fill=path["fill"], # fill color
  36. color=path["color"], # line color
  37. dashes=path["dashes"], # line dashing
  38. even_odd=path.get("even_odd", True), # control color of overlaps
  39. closePath=path["closePath"], # whether to connect last and first point
  40. lineJoin=path["lineJoin"], # how line joins should look like
  41. lineCap=max(path["lineCap"]), # how line ends should look like
  42. width=path["width"], # line width
  43. stroke_opacity=path.get("stroke_opacity", 1), # same value for both
  44. fill_opacity=path.get("fill_opacity", 1), # opacity parameters
  45. )
  46. # all paths processed - commit the shape to its page
  47. shape.commit()
  48. outpdf.save("drawings-page-0.pdf")

As can bee seen, there is a high congruence level with the Shape class. With one exception: For technical reasons lineCap is a tuple of 3 numbers here, whereas it is an integer in Shape (and in PDF). So we simply take the maximum value of that tuple.

Here is a comparison between input and output of an example page, created by the previous script:

_images/img-getdrawings.png

Note

The reconstruction of graphics like shown here is not perfect. The following aspects will not be reproduced as of this version:

  • Page definitions can be complex and include instructions for not showing / hiding certain areas to keep them invisible. Things like this are ignored by Page.get_drawings() - it will always return all paths.

Note

You can use the path list to make your own lists of e.g. all lines or all rectangles on the page, subselect them by criteria like color or position on the page etc.


Multiprocessing

MuPDF has no integrated support for threading - they call themselves “threading-agnostic”. While there do exist tricky possibilities to still use threading with MuPDF, the baseline consequence for PyMuPDF is:

No Python threading support.

Using PyMuPDF in a Python threading environment will lead to blocking effects for the main thread.

However, there exists the option to use Python’s multiprocessing module in a variety of ways.

If you are looking to speed up page-oriented processing for a large document, use this script as a starting point. It should be at least twice as fast as the corresponding sequential processing.

  1. """
  2. Demonstrate the use of multiprocessing with PyMuPDF.
  3. Depending on the number of CPUs, the document is divided in page ranges.
  4. Each range is then worked on by one process.
  5. The type of work would typically be text extraction or page rendering. Each
  6. process must know where to put its results, because this processing pattern
  7. does not include inter-process communication or data sharing.
  8. Compared to sequential processing, speed improvements in range of 100% (ie.
  9. twice as fast) or better can be expected.
  10. """
  11. from __future__ import print_function, division
  12. import sys
  13. import os
  14. import time
  15. from multiprocessing import Pool, cpu_count
  16. import fitz
  17. # choose a version specific timer function (bytes == str in Python 2)
  18. mytime = time.clock if str is bytes else time.perf_counter
  19. def render_page(vector):
  20. """Render a page range of a document.
  21. Notes:
  22. The PyMuPDF document cannot be part of the argument, because that
  23. cannot be pickled. So we are being passed in just its filename.
  24. This is no performance issue, because we are a separate process and
  25. need to open the document anyway.
  26. Any page-specific function can be processed here - rendering is just
  27. an example - text extraction might be another.
  28. The work must however be self-contained: no inter-process communication
  29. or synchronization is possible with this design.
  30. Care must also be taken with which parameters are contained in the
  31. argument, because it will be passed in via pickling by the Pool class.
  32. So any large objects will increase the overall duration.
  33. Args:
  34. vector: a list containing required parameters.
  35. """
  36. # recreate the arguments
  37. idx = vector[0] # this is the segment number we have to process
  38. cpu = vector[1] # number of CPUs
  39. filename = vector[2] # document filename
  40. mat = vector[3] # the matrix for rendering
  41. doc = fitz.open(filename) # open the document
  42. num_pages = doc.page_count # get number of pages
  43. # pages per segment: make sure that cpu * seg_size >= num_pages!
  44. seg_size = int(num_pages / cpu + 1)
  45. seg_from = idx * seg_size # our first page number
  46. seg_to = min(seg_from + seg_size, num_pages) # last page number
  47. for i in range(seg_from, seg_to): # work through our page segment
  48. page = doc[i]
  49. # page.get_text("rawdict") # use any page-related type of work here, eg
  50. pix = page.get_pixmap(alpha=False, matrix=mat)
  51. # store away the result somewhere ...
  52. # pix.save("p-%i.png" % i)
  53. print("Processed page numbers %i through %i" % (seg_from, seg_to - 1))
  54. if __name__ == "__main__":
  55. t0 = mytime() # start a timer
  56. filename = sys.argv[1]
  57. mat = fitz.Matrix(0.2, 0.2) # the rendering matrix: scale down to 20%
  58. cpu = cpu_count()
  59. # make vectors of arguments for the processes
  60. vectors = [(i, cpu, filename, mat) for i in range(cpu)]
  61. print("Starting %i processes for '%s'." % (cpu, filename))
  62. pool = Pool() # make pool of 'cpu_count()' processes
  63. pool.map(render_page, vectors, 1) # start processes passing each a vector
  64. t1 = mytime() # stop the timer
  65. print("Total time %g seconds" % round(t1 - t0, 2))

Here is a more complex example involving inter-process communication between a main process (showing a GUI) and a child process doing PyMuPDF access to a document.

  1. """
  2. Created on 2019-05-01
  3. @author: yinkaisheng@live.com
  4. @copyright: 2019 yinkaisheng@live.com
  5. @license: GNU AFFERO GPL 3.0
  6. Demonstrate the use of multiprocessing with PyMuPDF
  7. -----------------------------------------------------
  8. This example shows some more advanced use of multiprocessing.
  9. The main process show a Qt GUI and establishes a 2-way communication with
  10. another process, which accesses a supported document.
  11. """
  12. import os
  13. import sys
  14. import time
  15. import multiprocessing as mp
  16. import queue
  17. import fitz
  18. ''' PyQt and PySide namespace unifier shim
  19. https://www.pythonguis.com/faq/pyqt6-vs-pyside6/
  20. simple "if 'PyQt6' in sys.modules:" test fails for me, so the more complex pkgutil use
  21. overkill for most people who might have one or the other, why both?
  22. '''
  23. from pkgutil import iter_modules
  24. def module_exists(module_name):
  25. return module_name in (name for loader, name, ispkg in iter_modules())
  26. if module_exists("PyQt6"):
  27. # PyQt6
  28. from PyQt6 import QtGui, QtWidgets, QtCore
  29. from PyQt6.QtCore import pyqtSignal as Signal, pyqtSlot as Slot
  30. wrapper = "PyQt6"
  31. elif module_exists("PySide6"):
  32. # PySide6
  33. from PySide6 import QtGui, QtWidgets, QtCore
  34. from PySide6.QtCore import Signal, Slot
  35. wrapper = "PySide6"
  36. my_timer = time.clock if str is bytes else time.perf_counter
  37. class DocForm(QtWidgets.QWidget):
  38. def __init__(self):
  39. super().__init__()
  40. self.process = None
  41. self.queNum = mp.Queue()
  42. self.queDoc = mp.Queue()
  43. self.page_count = 0
  44. self.curPageNum = 0
  45. self.lastDir = ""
  46. self.timerSend = QtCore.QTimer(self)
  47. self.timerSend.timeout.connect(self.onTimerSendPageNum)
  48. self.timerGet = QtCore.QTimer(self)
  49. self.timerGet.timeout.connect(self.onTimerGetPage)
  50. self.timerWaiting = QtCore.QTimer(self)
  51. self.timerWaiting.timeout.connect(self.onTimerWaiting)
  52. self.initUI()
  53. def initUI(self):
  54. vbox = QtWidgets.QVBoxLayout()
  55. self.setLayout(vbox)
  56. hbox = QtWidgets.QHBoxLayout()
  57. self.btnOpen = QtWidgets.QPushButton("OpenDocument", self)
  58. self.btnOpen.clicked.connect(self.openDoc)
  59. hbox.addWidget(self.btnOpen)
  60. self.btnPlay = QtWidgets.QPushButton("PlayDocument", self)
  61. self.btnPlay.clicked.connect(self.playDoc)
  62. hbox.addWidget(self.btnPlay)
  63. self.btnStop = QtWidgets.QPushButton("Stop", self)
  64. self.btnStop.clicked.connect(self.stopPlay)
  65. hbox.addWidget(self.btnStop)
  66. self.label = QtWidgets.QLabel("0/0", self)
  67. self.label.setFont(QtGui.QFont("Verdana", 20))
  68. hbox.addWidget(self.label)
  69. vbox.addLayout(hbox)
  70. self.labelImg = QtWidgets.QLabel("Document", self)
  71. sizePolicy = QtWidgets.QSizePolicy(
  72. QtWidgets.QSizePolicy.Policy.Preferred, QtWidgets.QSizePolicy.Policy.Expanding
  73. )
  74. self.labelImg.setSizePolicy(sizePolicy)
  75. vbox.addWidget(self.labelImg)
  76. self.setGeometry(100, 100, 400, 600)
  77. self.setWindowTitle("PyMuPDF Document Player")
  78. self.show()
  79. def openDoc(self):
  80. path, _ = QtWidgets.QFileDialog.getOpenFileName(
  81. self,
  82. "Open Document",
  83. self.lastDir,
  84. "All Supported Files (*.pdf;*.epub;*.xps;*.oxps;*.cbz;*.fb2);;PDF Files (*.pdf);;EPUB Files (*.epub);;XPS Files (*.xps);;OpenXPS Files (*.oxps);;CBZ Files (*.cbz);;FB2 Files (*.fb2)",
  85. #options=QtWidgets.QFileDialog.Options(),
  86. )
  87. if path:
  88. self.lastDir, self.file = os.path.split(path)
  89. if self.process:
  90. self.queNum.put(-1) # use -1 to notify the process to exit
  91. self.timerSend.stop()
  92. self.curPageNum = 0
  93. self.page_count = 0
  94. self.process = mp.Process(
  95. target=openDocInProcess, args=(path, self.queNum, self.queDoc)
  96. )
  97. self.process.start()
  98. self.timerGet.start(40)
  99. self.label.setText("0/0")
  100. self.queNum.put(0)
  101. self.startTime = time.perf_counter()
  102. self.timerWaiting.start(40)
  103. def playDoc(self):
  104. self.timerSend.start(500)
  105. def stopPlay(self):
  106. self.timerSend.stop()
  107. def onTimerSendPageNum(self):
  108. if self.curPageNum < self.page_count - 1:
  109. self.queNum.put(self.curPageNum + 1)
  110. else:
  111. self.timerSend.stop()
  112. def onTimerGetPage(self):
  113. try:
  114. ret = self.queDoc.get(False)
  115. if isinstance(ret, int):
  116. self.timerWaiting.stop()
  117. self.page_count = ret
  118. self.label.setText("{}/{}".format(self.curPageNum + 1, self.page_count))
  119. else: # tuple, pixmap info
  120. num, samples, width, height, stride, alpha = ret
  121. self.curPageNum = num
  122. self.label.setText("{}/{}".format(self.curPageNum + 1, self.page_count))
  123. fmt = (
  124. QtGui.QImage.Format.Format_RGBA8888
  125. if alpha
  126. else QtGui.QImage.Format.Format_RGB888
  127. )
  128. qimg = QtGui.QImage(samples, width, height, stride, fmt)
  129. self.labelImg.setPixmap(QtGui.QPixmap.fromImage(qimg))
  130. except queue.Empty as ex:
  131. pass
  132. def onTimerWaiting(self):
  133. self.labelImg.setText(
  134. 'Loading "{}", {:.2f}s'.format(
  135. self.file, time.perf_counter() - self.startTime
  136. )
  137. )
  138. def closeEvent(self, event):
  139. self.queNum.put(-1)
  140. event.accept()
  141. def openDocInProcess(path, queNum, quePageInfo):
  142. start = my_timer()
  143. doc = fitz.open(path)
  144. end = my_timer()
  145. quePageInfo.put(doc.page_count)
  146. while True:
  147. num = queNum.get()
  148. if num < 0:
  149. break
  150. page = doc.load_page(num)
  151. pix = page.get_pixmap()
  152. quePageInfo.put(
  153. (num, pix.samples, pix.width, pix.height, pix.stride, pix.alpha)
  154. )
  155. doc.close()
  156. print("process exit")
  157. if __name__ == "__main__":
  158. app = QtWidgets.QApplication(sys.argv)
  159. form = DocForm()
  160. sys.exit(app.exec())

General

How to Open with a Wrong File Extension

If you have a document with a wrong file extension for its type, you can still correctly open it.

Assume that “some.file” is actually an XPS. Open it like so:

  1. >>> doc = fitz.open("some.file", filetype="xps")

Note

MuPDF itself does not try to determine the file type from the file contents. You are responsible for supplying the filetype info in some way – either implicitly via the file extension, or explicitly as shown. There are pure Python packages like filetype that help you doing this. Also consult the Document chapter for a full description.

If MuPDF encounters a file with an unknown / missing extension, it will try to open it as a PDF. So in these cases there is no need to for additional precautions. Similarly, for memory documents, you can just specify doc=fitz.open(stream=mem_area) to open it as a PDF document.


How to Embed or Attach Files

PDF supports incorporating arbitrary data. This can be done in one of two ways: “embedding” or “attaching”. PyMuPDF supports both options.

  1. Attached Files: data are attached to a page by way of a FileAttachment annotation with this statement: annot = page.add_file_annot(pos, …), for details see Page.add_file_annot(). The first parameter “pos” is the Point, where a “PushPin” icon should be placed on the page.

  2. Embedded Files: data are embedded on the document level via method Document.embfile_add().

The basic differences between these options are (1) you need edit permission to embed a file, but only annotation permission to attach, (2) like all annotations, attachments are visible on a page, embedded files are not.

There exist several example scripts: embedded-list.py, new-annots.py.

Also look at the sections above and at chapter Appendix 3.


How to Delete and Re-Arrange Pages

With PyMuPDF you have all options to copy, move, delete or re-arrange the pages of a PDF. Intuitive methods exist that allow you to do this on a page-by-page level, like the Document.copy_page() method.

Or you alternatively prepare a complete new page layout in form of a Python sequence, that contains the page numbers you want, in the sequence you want, and as many times as you want each page. The following may illustrate what can be done with Document.select():

doc.select([1, 1, 1, 5, 4, 9, 9, 9, 0, 2, 2, 2])

Now let’s prepare a PDF for double-sided printing (on a printer not directly supporting this):

The number of pages is given by len(doc) (equal to doc.page_count). The following lists represent the even and the odd page numbers, respectively:

  1. >>> p_even = [p in range(doc.page_count) if p % 2 == 0]
  2. >>> p_odd = [p in range(doc.page_count) if p % 2 == 1]

This snippet creates the respective sub documents which can then be used to print the document:

  1. >>> doc.select(p_even) # only the even pages left over
  2. >>> doc.save("even.pdf") # save the "even" PDF
  3. >>> doc.close() # recycle the file
  4. >>> doc = fitz.open(doc.name) # re-open
  5. >>> doc.select(p_odd) # and do the same with the odd pages
  6. >>> doc.save("odd.pdf")

For more information also have a look at this Wiki article.

The following example will reverse the order of all pages (extremely fast: sub-second time for the 756 pages of the Adobe PDF References):

  1. >>> lastPage = doc.page_count - 1
  2. >>> for i in range(lastPage):
  3. doc.move_page(lastPage, i) # move current last page to the front

This snippet duplicates the PDF with itself so that it will contain the pages 0, 1, …, n, 0, 1, …, n (extremely fast and without noticeably increasing the file size!):

  1. >>> page_count = len(doc)
  2. >>> for i in range(page_count):
  3. doc.copy_page(i) # copy this page to after last page

How to Join PDFs

It is easy to join PDFs with method Document.insert_pdf(). Given open PDF documents, you can copy page ranges from one to the other. You can select the point where the copied pages should be placed, you can revert the page sequence and also change page rotation. This Wiki article contains a full description.

The GUI script PDFjoiner.py uses this method to join a list of files while also joining the respective table of contents segments. It looks like this:

_images/img-pdfjoiner.jpg


How to Add Pages

There two methods for adding new pages to a PDF: Document.insert_page() and Document.new_page() (and they share a common code base).

new_page

Document.new_page() returns the created Page object. Here is the constructor showing defaults:

  1. >>> doc = fitz.open(...) # some new or existing PDF document
  2. >>> page = doc.new_page(to = -1, # insertion point: end of document
  3. width = 595, # page dimension: A4 portrait
  4. height = 842)

The above could also have been achieved with the short form page = doc.new_page(). The to parameter specifies the document’s page number (0-based) in front of which to insert.

To create a page in landscape format, just exchange the width and height values.

Use this to create the page with another pre-defined paper format:

  1. >>> w, h = fitz.paper_size("letter-l") # 'Letter' landscape
  2. >>> page = doc.new_page(width = w, height = h)

The convenience function paper_size() knows over 40 industry standard paper formats to choose from. To see them, inspect dictionary paperSizes. Pass the desired dictionary key to paper_size() to retrieve the paper dimensions. Upper and lower case is supported. If you append “-L” to the format name, the landscape version is returned.

Note

Here is a 3-liner that creates a PDF with one empty page. Its file size is 470 bytes:

  1. >>> doc = fitz.open()
  2. >>> doc.new_page()
  3. >>> doc.save("A4.pdf")

insert_page

Document.insert_page() also inserts a new page and accepts the same parameters to, width and height. But it lets you also insert arbitrary text into the new page and returns the number of inserted lines:

  1. >>> doc = fitz.open(...) # some new or existing PDF document
  2. >>> n = doc.insert_page(to = -1, # default insertion point
  3. text = None, # string or sequence of strings
  4. fontsize = 11,
  5. width = 595,
  6. height = 842,
  7. fontname = "Helvetica", # default font
  8. fontfile = None, # any font file name
  9. color = (0, 0, 0)) # text color (RGB)

The text parameter can be a (sequence of) string (assuming UTF-8 encoding). Insertion will start at Point (50, 72), which is one inch below top of page and 50 points from the left. The number of inserted text lines is returned. See the method definition for more details.


How To Dynamically Clean Up Corrupt PDFs

This shows a potential use of PyMuPDF with another Python PDF library (the excellent pure Python package pdfrw is used here as an example).

If a clean, non-corrupt / decompressed PDF is needed, one could dynamically invoke PyMuPDF to recover from many problems like so:

  1. import sys
  2. from io import BytesIO
  3. from pdfrw import PdfReader
  4. import fitz
  5. #---------------------------------------
  6. # 'Tolerant' PDF reader
  7. #---------------------------------------
  8. def reader(fname, password = None):
  9. idata = open(fname, "rb").read() # read the PDF into memory and
  10. ibuffer = BytesIO(idata) # convert to stream
  11. if password is None:
  12. try:
  13. return PdfReader(ibuffer) # if this works: fine!
  14. except:
  15. pass
  16. # either we need a password or it is a problem-PDF
  17. # create a repaired / decompressed / decrypted version
  18. doc = fitz.open("pdf", ibuffer)
  19. if password is not None: # decrypt if password provided
  20. rc = doc.authenticate(password)
  21. if not rc > 0:
  22. raise ValueError("wrong password")
  23. c = doc.tobytes(garbage=3, deflate=True)
  24. del doc # close & delete doc
  25. return PdfReader(BytesIO(c)) # let pdfrw retry
  26. #---------------------------------------
  27. # Main program
  28. #---------------------------------------
  29. pdf = reader("pymupdf.pdf", password = None) # include a password if necessary
  30. print pdf.Info
  31. # do further processing

With the command line utility pdftk (available for Windows only, but reported to also run under Wine) a similar result can be achieved, see here. However, you must invoke it as a separate process via subprocess.Popen, using stdin and stdout as communication vehicles.

How to Split Single Pages

This deals with splitting up pages of a PDF in arbitrary pieces. For example, you may have a PDF with Letter format pages which you want to print with a magnification factor of four: each page is split up in 4 pieces which each go to a separate PDF page in Letter format again:

  1. """
  2. Create a PDF copy with split-up pages (posterize)
  3. ---------------------------------------------------
  4. License: GNU AFFERO GPL V3
  5. (c) 2018 Jorj X. McKie
  6. Usage
  7. ------
  8. python posterize.py input.pdf
  9. Result
  10. -------
  11. A file "poster-input.pdf" with 4 output pages for every input page.
  12. Notes
  13. -----
  14. (1) Output file is chosen to have page dimensions of 1/4 of input.
  15. (2) Easily adapt the example to make n pages per input, or decide per each
  16. input page or whatever.
  17. Dependencies
  18. ------------
  19. PyMuPDF 1.12.2 or later
  20. """
  21. import fitz, sys
  22. infile = sys.argv[1] # input file name
  23. src = fitz.open(infile)
  24. doc = fitz.open() # empty output PDF
  25. for spage in src: # for each page in input
  26. r = spage.rect # input page rectangle
  27. d = fitz.Rect(spage.cropbox_position, # CropBox displacement if not
  28. spage.cropbox_position) # starting at (0, 0)
  29. #--------------------------------------------------------------------------
  30. # example: cut input page into 2 x 2 parts
  31. #--------------------------------------------------------------------------
  32. r1 = r / 2 # top left rect
  33. r2 = r1 + (r1.width, 0, r1.width, 0) # top right rect
  34. r3 = r1 + (0, r1.height, 0, r1.height) # bottom left rect
  35. r4 = fitz.Rect(r1.br, r.br) # bottom right rect
  36. rect_list = [r1, r2, r3, r4] # put them in a list
  37. for rx in rect_list: # run thru rect list
  38. rx += d # add the CropBox displacement
  39. page = doc.new_page(-1, # new output page with rx dimensions
  40. width = rx.width,
  41. height = rx.height)
  42. page.show_pdf_page(
  43. page.rect, # fill all new page with the image
  44. src, # input document
  45. spage.number, # input page number
  46. clip = rx, # which part to use of input page
  47. )
  48. # that's it, save output file
  49. doc.save("poster-" + src.name,
  50. garbage=3, # eliminate duplicate objects
  51. deflate=True, # compress stuff where possible
  52. )

This shows what happens to an input page:

_images/img-posterize.png


How to Combine Single Pages

This deals with joining PDF pages to form a new PDF with pages each combining two or four original ones (also called “2-up”, “4-up”, etc.). This could be used to create booklets or thumbnail-like overviews:

  1. '''
  2. Copy an input PDF to output combining every 4 pages
  3. ---------------------------------------------------
  4. License: GNU AFFERO GPL V3
  5. (c) 2018 Jorj X. McKie
  6. Usage
  7. ------
  8. python 4up.py input.pdf
  9. Result
  10. -------
  11. A file "4up-input.pdf" with 1 output page for every 4 input pages.
  12. Notes
  13. -----
  14. (1) Output file is chosen to have A4 portrait pages. Input pages are scaled
  15. maintaining side proportions. Both can be changed, e.g. based on input
  16. page size. However, note that not all pages need to have the same size, etc.
  17. (2) Easily adapt the example to combine just 2 pages (like for a booklet) or
  18. make the output page dimension dependent on input, or whatever.
  19. Dependencies
  20. -------------
  21. PyMuPDF 1.12.1 or later
  22. '''
  23. import fitz, sys
  24. infile = sys.argv[1]
  25. src = fitz.open(infile)
  26. doc = fitz.open() # empty output PDF
  27. width, height = fitz.paper_size("a4") # A4 portrait output page format
  28. r = fitz.Rect(0, 0, width, height)
  29. # define the 4 rectangles per page
  30. r1 = r / 2 # top left rect
  31. r2 = r1 + (r1.width, 0, r1.width, 0) # top right
  32. r3 = r1 + (0, r1.height, 0, r1.height) # bottom left
  33. r4 = fitz.Rect(r1.br, r.br) # bottom right
  34. # put them in a list
  35. r_tab = [r1, r2, r3, r4]
  36. # now copy input pages to output
  37. for spage in src:
  38. if spage.number % 4 == 0: # create new output page
  39. page = doc.new_page(-1,
  40. width = width,
  41. height = height)
  42. # insert input page into the correct rectangle
  43. page.show_pdf_page(r_tab[spage.number % 4], # select output rect
  44. src, # input document
  45. spage.number) # input page number
  46. # by all means, save new file using garbage collection and compression
  47. doc.save("4up-" + infile, garbage=3, deflate=True)

Example effect:

_images/img-4up.png


How to Convert Any Document to PDF

Here is a script that converts any PyMuPDF supported document to a PDF. These include XPS, EPUB, FB2, CBZ and all image formats, including multi-page TIFF images.

It features maintaining any metadata, table of contents and links contained in the source document:

  1. """
  2. Demo script: Convert input file to a PDF
  3. -----------------------------------------
  4. Intended for multi-page input files like XPS, EPUB etc.
  5. Features:
  6. ---------
  7. Recovery of table of contents and links of input file.
  8. While this works well for bookmarks (outlines, table of contents),
  9. links will only work if they are not of type "LINK_NAMED".
  10. This link type is skipped by the script.
  11. For XPS and EPUB input, internal links however **are** of type "LINK_NAMED".
  12. Base library MuPDF does not resolve them to page numbers.
  13. So, for anyone expert enough to know the internal structure of these
  14. document types, can further interpret and resolve these link types.
  15. Dependencies
  16. --------------
  17. PyMuPDF v1.14.0+
  18. """
  19. import sys
  20. import fitz
  21. if not (list(map(int, fitz.VersionBind.split("."))) >= [1,14,0]):
  22. raise SystemExit("need PyMuPDF v1.14.0+")
  23. fn = sys.argv[1]
  24. print("Converting '%s' to '%s.pdf'" % (fn, fn))
  25. doc = fitz.open(fn)
  26. b = doc.convert_to_pdf() # convert to pdf
  27. pdf = fitz.open("pdf", b) # open as pdf
  28. toc= doc.het_toc() # table of contents of input
  29. pdf.set_toc(toc) # simply set it for output
  30. meta = doc.metadata # read and set metadata
  31. if not meta["producer"]:
  32. meta["producer"] = "PyMuPDF v" + fitz.VersionBind
  33. if not meta["creator"]:
  34. meta["creator"] = "PyMuPDF PDF converter"
  35. meta["modDate"] = fitz.get_pdf_now()
  36. meta["creationDate"] = meta["modDate"]
  37. pdf.set_metadata(meta)
  38. # now process the links
  39. link_cnti = 0
  40. link_skip = 0
  41. for pinput in doc: # iterate through input pages
  42. links = pinput.get_links() # get list of links
  43. link_cnti += len(links) # count how many
  44. pout = pdf[pinput.number] # read corresp. output page
  45. for l in links: # iterate though the links
  46. if l["kind"] == fitz.LINK_NAMED: # we do not handle named links
  47. print("named link page", pinput.number, l)
  48. link_skip += 1 # count them
  49. continue
  50. pout.insert_link(l) # simply output the others
  51. # save the conversion result
  52. pdf.save(fn + ".pdf", garbage=4, deflate=True)
  53. # say how many named links we skipped
  54. if link_cnti > 0:
  55. print("Skipped %i named links of a total of %i in input." % (link_skip, link_cnti))

How to Deal with Messages Issued by MuPDF

Since PyMuPDF v1.16.0, error messages issued by the underlying MuPDF library are being redirected to the Python standard device sys.stderr. So you can handle them like any other output going to this devices.

In addition, these messages go to the internal buffer together with any MuPDF warnings – see below.

We always prefix these messages with an identifying string “mupdf:”. If you prefer to not see recoverable MuPDF errors at all, issue the command fitz.TOOLS.mupdf_display_errors(False).

MuPDF warnings continue to be stored in an internal buffer and can be viewed using Tools.mupdf_warnings().

Please note that MuPDF errors may or may not lead to Python exceptions. In other words, you may see error messages from which MuPDF can recover and continue processing.

Example output for a recoverable error. We are opening a damaged PDF, but MuPDF is able to repair it and gives us a few information on what happened. Then we illustrate how to find out whether the document can later be saved incrementally. Checking the Document.is_dirty attribute at this point also indicates that the open had to repair the document:

  1. >>> import fitz
  2. >>> doc = fitz.open("damaged-file.pdf") # leads to a sys.stderr message:
  3. mupdf: cannot find startxref
  4. >>> print(fitz.TOOLS.mupdf_warnings()) # check if there is more info:
  5. cannot find startxref
  6. trying to repair broken xref
  7. repairing PDF document
  8. object missing 'endobj' token
  9. >>> doc.can_save_incrementally() # this is to be expected:
  10. False
  11. >>> # the following indicates whether there are updates so far
  12. >>> # this is the case because of the repair actions:
  13. >>> doc.is_dirty
  14. True
  15. >>> # the document has nevertheless been created:
  16. >>> doc
  17. fitz.Document('damaged-file.pdf')
  18. >>> # we now know that any save must occur to a new file

Example output for an unrecoverable error:

  1. >>> import fitz
  2. >>> doc = fitz.open("does-not-exist.pdf")
  3. mupdf: cannot open does-not-exist.pdf: No such file or directory
  4. Traceback (most recent call last):
  5. File "<pyshell#1>", line 1, in <module>
  6. doc = fitz.open("does-not-exist.pdf")
  7. File "C:\Users\Jorj\AppData\Local\Programs\Python\Python37\lib\site-packages\fitz\fitz.py", line 2200, in __init__
  8. _fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
  9. RuntimeError: cannot open does-not-exist.pdf: No such file or directory
  10. >>>

How to Deal with PDF Encryption

Starting with version 1.16.0, PDF decryption and encryption (using passwords) are fully supported. You can do the following:

Note

A PDF document may have two different passwords:

  • The owner password provides full access rights, including changing passwords, encryption method, or permission detail.

  • The user password provides access to document content according to the established permission details. If present, opening the PDF in a viewer will require providing it.

Method Document.authenticate() will automatically establish access rights according to the password used.

The following snippet creates a new PDF and encrypts it with separate user and owner passwords. Permissions are granted to print, copy and annotate, but no changes are allowed to someone authenticating with the user password:

  1. import fitz
  2. text = "some secret information" # keep this data secret
  3. perm = int(
  4. fitz.PDF_PERM_ACCESSIBILITY # always use this
  5. | fitz.PDF_PERM_PRINT # permit printing
  6. | fitz.PDF_PERM_COPY # permit copying
  7. | fitz.PDF_PERM_ANNOTATE # permit annotations
  8. )
  9. owner_pass = "owner" # owner password
  10. user_pass = "user" # user password
  11. encrypt_meth = fitz.PDF_ENCRYPT_AES_256 # strongest algorithm
  12. doc = fitz.open() # empty pdf
  13. page = doc.new_page() # empty page
  14. page.insert_text((50, 72), text) # insert the data
  15. doc.save(
  16. "secret.pdf",
  17. encryption=encrypt_meth, # set the encryption method
  18. owner_pw=owner_pass, # set the owner password
  19. user_pw=user_pass, # set the user password
  20. permissions=perm, # set permissions
  21. )

Opening this document with some viewer (Nitro Reader 5) reflects these settings:

_images/img-encrypting.jpg

Decrypting will automatically happen on save as before when no encryption parameters are provided.

To keep the encryption method of a PDF save it using encryption=fitz.PDF_ENCRYPT_KEEP. If doc.can_save_incrementally() == True, an incremental save is also possible.

To change the encryption method specify the full range of options above (encryption, owner_pw, user_pw, permissions). An incremental save is not possible in this case.


Common Issues and their Solutions

Changing Annotations: Unexpected Behaviour

Problem

There are two scenarios:

  1. Updating an annotation with PyMuPDF which was created by some other software.

  2. Creating an annotation with PyMuPDF and later changing it with some other software.

In both cases you may experience unintended changes, like a different annotation icon or text font, the fill color or line dashing have disappeared, line end symbols have changed their size or even have disappeared too, etc.

Cause

Annotation maintenance is handled differently by each PDF maintenance application. Some annotation types may not be supported, or not be supported fully or some details may be handled in a different way than in another application. There is no standard.

Almost always a PDF application also comes with its own icons (file attachments, sticky notes and stamps) and its own set of supported text fonts. For example:

  • (Py-) MuPDF only supports these 5 basic fonts for ‘FreeText’ annotations: Helvetica, Times-Roman, Courier, ZapfDingbats and Symbol – no italics / no bold variations. When changing a ‘FreeText’ annotation created by some other app, its font will probably not be recognized nor accepted and be replaced by Helvetica.

  • PyMuPDF supports all PDF text markers (highlight, underline, strikeout, squiggly), but these types cannot be updated with Adobe Acrobat Reader.

In most cases there also exists limited support for line dashing which causes existing dashes to be replaced by straight lines. For example:

  • PyMuPDF fully supports all line dashing forms, while other viewers only accept a limited subset.

Solutions

Unfortunately there is not much you can do in most of these cases.

  1. Stay with the same software for creating and changing an annotation.

  2. When using PyMuPDF to change an “alien” annotation, try to avoid Annot.update(). The following methods can be used without it, so that the original appearance should be maintained:

Misplaced Item Insertions on PDF Pages

Problem

You inserted an item (like an image, an annotation or some text) on an existing PDF page, but later you find it being placed at a different location than intended. For example an image should be inserted at the top, but it unexpectedly appears near the bottom of the page.

Cause

The creator of the PDF has established a non-standard page geometry without keeping it “local” (as they should!). Most commonly, the PDF standard point (0,0) at bottom-left has been changed to the top-left point. So top and bottom are reversed – causing your insertion to be misplaced.

The visible image of a PDF page is controlled by commands coded in a special mini-language. For an overview of this language consult “Operator Summary” on pp. 643 of the Adobe PDF References. These commands are stored in contents objects as strings (bytes in PyMuPDF).

There are commands in that language, which change the coordinate system of the page for all the following commands. In order to limit the scope of such commands local, they must be wrapped by the command pair q (“save graphics state”, or “stack”) and Q (“restore graphics state”, or “unstack”).

So the PDF creator did this:

  1. stream
  2. 1 0 0 -1 0 792 cm % <=== change of coordinate system:
  3. ... % letter page, top / bottom reversed
  4. ... % remains active beyond these lines
  5. endstream

where they should have done this:

  1. stream
  2. q % put the following in a stack
  3. 1 0 0 -1 0 792 cm % <=== scope of this is limited by Q command
  4. ... % here, a different geometry exists
  5. Q % after this line, geometry of outer scope prevails
  6. endstream

Note

  • In the mini-language’s syntax, spaces and line breaks are equally accepted token delimiters.

  • Multiple consecutive delimiters are treated as one.

  • Keywords “stream” and “endstream” are inserted automatically – not by the programmer.

Solutions

Since v1.16.0, there is the property Page.is_wrapped, which lets you check whether a page’s contents are wrapped in that string pair.

If it is False or if you want to be on the safe side, pick one of the following:

  1. The easiest way: in your script, do a Page.clean_contents() before you do your first item insertion.

  2. Pre-process your PDF with the MuPDF command line utility mutool clean -c … and work with its output file instead.

  3. Directly wrap the page’s contents with the stacking commands before you do your first item insertion.

Solutions 1. and 2. use the same technical basis and do a lot more than what is required in this context: they also clean up other inconsistencies or redundancies that may exist, multiple /Contents objects will be concatenated into one, and much more.

Note

For incremental saves, solution 1. has an unpleasant implication: it will bloat the update delta, because it changes so many things and, in addition, stores the cleaned contents uncompressed. So, if you use Page.clean_contents() you should consider saving to a new file with (at least) garbage=3 and deflate=True.

Solution 3. is completely under your control and only does the minimum corrective action. There exists a handy low-level utility function which you can use for this. Suggested procedure:

  • Prepend the missing stacking command by executing fitz.TOOLS._insert_contents(page, b”qn”, False).

  • Append an unstacking command by executing fitz.TOOLS._insert_contents(page, b”nQ”, True).

  • Alternatively, just use Page._wrap_contents(), which executes the previous two functions.

Note

If small incremental update deltas are a concern, this approach is the most effective. Other contents objects are not touched. The utility method creates two new PDF stream objects and inserts them before, resp. after the page’s other contents. We therefore recommend the following snippet to get this situation under control:

  1. >>> if not page.is_wrapped:
  2. page.wrap_contents()
  3. >>> # start inserting text, images or annotations here

Missing or Unreadable Extracted Text

Fairly often, text extraction does not work text as you would expect: text may be missing at all, or may not appear in the reading sequence visible on your screen, or contain garbled characters (like a ? or a “TOFU” symbol), etc. This can be caused by a number of different problems.

Problem: no text is extracted

Your PDF viewer does display text, but you cannot select it with your cursor, and text extraction delivers nothing.

Cause

  1. You may be looking at an image embedded in the PDF page (e.g. a scanned PDF).

  2. The PDF creator used no font, but simulated text by painting it, using little lines and curves. E.g. a capital “D” could be painted by a line “|” and a left-open semi-circle, an “o” by an ellipse, and so on.

Solution

Use an OCR software like OCRmyPDF to insert a hidden text layer underneath the visible page. The resulting PDF should behave as expected.

Problem: unreadable text

Text extraction does not deliver the text in readable order, duplicates some text, or is otherwise garbled.

Cause

  1. The single characters are redable as such (no “<?>” symbols), but the sequence in which the text is coded in the file deviates from the reading order. The motivation behind may be technical or protection of data against unwanted copies.

  2. Many “<?>” symbols occur, indicating MuPDF could not interpret these characters. The font may indeed be unsupported by MuPDF, or the PDF creator may haved used a font that displays readable text, but on purpose obfuscates the originating corresponding unicode character.

Solution

  1. Use layout preserving text extraction: python -m fitz gettext file.pdf.

  2. If other text extraction tools also don’t work, then the only solution again is OCRing the page.


Low-Level Interfaces

Numerous methods are available to access and manipulate PDF files on a fairly low level. Admittedly, a clear distinction between “low level” and “normal” functionality is not always possible or subject to personal taste.

It also may happen, that functionality previously deemed low-level is later on assessed as being part of the normal interface. This has happened in v1.14.0 for the class Tools – you now find it as an item in the Classes chapter.

Anyway – it is a matter of documentation only: in which chapter of the documentation do you find what. Everything is available always and always via the same interface.


How to Iterate through the xref Table

A PDF’s xref table is a list of all objects defined in the file. This table may easily contain many thousand entries – the manual Adobe PDF References for example has 127’000 objects. Table entry “0” is reserved and must not be touched. The following script loops through the xref table and prints each object’s definition:

  1. >>> xreflen = doc.xref_length() # length of objects table
  2. >>> for xref in range(1, xreflen): # skip item 0!
  3. print("")
  4. print("object %i (stream: %s)" % (xref, doc.is_stream(xref)))
  5. print(doc.xref_object(i, compressed=False))

This produces the following output:

  1. object 1 (stream: False)
  2. <<
  3. /ModDate (D:20170314122233-04'00')
  4. /PXCViewerInfo (PDF-XChange Viewer;2.5.312.1;Feb 9 2015;12:00:06;D:20170314122233-04'00')
  5. >>
  6. object 2 (stream: False)
  7. <<
  8. /Type /Catalog
  9. /Pages 3 0 R
  10. >>
  11. object 3 (stream: False)
  12. <<
  13. /Kids [ 4 0 R 5 0 R ]
  14. /Type /Pages
  15. /Count 2
  16. >>
  17. object 4 (stream: False)
  18. <<
  19. /Type /Page
  20. /Annots [ 6 0 R ]
  21. /Parent 3 0 R
  22. /Contents 7 0 R
  23. /MediaBox [ 0 0 595 842 ]
  24. /Resources 8 0 R
  25. >>
  26. ...
  27. object 7 (stream: True)
  28. <<
  29. /Length 494
  30. /Filter /FlateDecode
  31. >>
  32. ...

A PDF object definition is an ordinary ASCII string.


How to Handle Object Streams

Some object types contain additional data apart from their object definition. Examples are images, fonts, embedded files or commands describing the appearance of a page.

Objects of these types are called “stream objects”. PyMuPDF allows reading an object’s stream via method Document.xref_stream() with the object’s xref as an argument. It is also possible to write back a modified version of a stream using Document.update_stream().

Assume that the following snippet wants to read all streams of a PDF for whatever reason:

  1. >>> xreflen = doc.xref_length() # number of objects in file
  2. >>> for xref in range(1, xreflen): # skip item 0!
  3. if stream := doc.xref_stream(xref):
  4. # do something with it (it is a bytes object or None)
  5. # e.g. just write it back:
  6. doc.update_stream(xref, stream)

Document.xref_stream() automatically returns a stream decompressed as a bytes object – and Document.update_stream() automatically compresses it if beneficial.


How to Handle Page Contents

A PDF page can have zero or multiple contents objects. These are stream objects describing what appears where and how on a page (like text and images). They are written in a special mini-language described e.g. in chapter “APPENDIX A - Operator Summary” on page 643 of the Adobe PDF References.

Every PDF reader application must be able to interpret the contents syntax to reproduce the intended appearance of the page.

If multiple contents objects are provided, they must be interpreted in the specified sequence in exactly the same way as if they were provided as a concatenation of the several.

There are good technical arguments for having multiple contents objects:

  • It is a lot easier and faster to just add new contents objects than maintaining a single big one (which entails reading, decompressing, modifying, recompressing, and rewriting it for each change).

  • When working with incremental updates, a modified big contents object will bloat the update delta and can thus easily negate the efficiency of incremental saves.

For example, PyMuPDF adds new, small contents objects in methods Page.insert_image(), Page.show_pdf_page() and the Shape methods.

However, there are also situations when a single contents object is beneficial: it is easier to interpret and better compressible than multiple smaller ones.

Here are two ways of combining multiple contents of a page:

  1. >>> # method 1: use the MuPDF clean function
  2. >>> page.clean_contents() # cleans and combines multiple Contents
  3. >>> xref = page.get_contents()[0] # only one /Contents now!
  4. >>> cont = doc.xref_stream(xref)
  5. >>> # this has also reformatted the PDF commands
  6. >>> # method 2: extract concatenated contents
  7. >>> cont = page.read_contents()
  8. >>> # the /Contents source itself is unmodified

The clean function Page.clean_contents() does a lot more than just glueing contents objects: it also corrects and optimizes the PDF operator syntax of the page and removes any inconsistencies with the page’s object definition.


How to Access the PDF Catalog

This is a central (“root”) object of a PDF. It serves as a starting point to reach important other objects and it also contains some global options for the PDF:

  1. >>> import fitz
  2. >>> doc=fitz.open("PyMuPDF.pdf")
  3. >>> cat = doc.pdf_catalog() # get xref of the /Catalog
  4. >>> print(doc.xref_object(cat)) # print object definition
  5. <<
  6. /Type/Catalog % object type
  7. /Pages 3593 0 R % points to page tree
  8. /OpenAction 225 0 R % action to perform on open
  9. /Names 3832 0 R % points to global names tree
  10. /PageMode /UseOutlines % initially show the TOC
  11. /PageLabels<</Nums[0<</S/D>>2<</S/r>>8<</S/D>>]>> % labels given to pages
  12. /Outlines 3835 0 R % points to outline tree
  13. >>

Note

Indentation, line breaks and comments are inserted here for clarification purposes only and will not normally appear. For more information on the PDF catalog see section 7.7.2 on page 71 of the Adobe PDF References.


How to Access the PDF File Trailer

The trailer of a PDF file is a dictionary located towards the end of the file. It contains special objects, and pointers to important other information. See Adobe PDF References p. 42. Here is an overview:

Key

Type

Value

Size

int

Number of entries in the cross-reference table + 1.

Prev

int

Offset to previous xref section (indicates incremental updates).

Root

dictionary

(indirect) Pointer to the catalog. See previous section.

Encrypt

dictionary

Pointer to encryption object (encrypted files only).

Info

dictionary

(indirect) Pointer to information (metadata).

ID

array

File identifier consisting of two byte strings.

XRefStm

int

Offset of a cross-reference stream. See Adobe PDF References p. 49.

Access this information via PyMuPDF with Document.pdf_trailer() or, equivalently, via Document.xref_object() using -1 instead of a valid xref number.

  1. >>> import fitz
  2. >>> doc=fitz.open("PyMuPDF.pdf")
  3. >>> print(doc.xref_object(-1)) # or: print(doc.pdf_trailer())
  4. <<
  5. /Type /XRef
  6. /Index [ 0 8263 ]
  7. /Size 8263
  8. /W [ 1 3 1 ]
  9. /Root 8260 0 R
  10. /Info 8261 0 R
  11. /ID [ <4339B9CEE46C2CD28A79EBDDD67CC9B3> <4339B9CEE46C2CD28A79EBDDD67CC9B3> ]
  12. /Length 19883
  13. /Filter /FlateDecode
  14. >>
  15. >>>

How to Access XML Metadata

A PDF may contain XML metadata in addition to the standard metadata format. In fact, most PDF viewer or modification software adds this type of information when saving the PDF (Adobe, Nitro PDF, PDF-XChange, etc.).

PyMuPDF has no way to interpret or change this information directly, because it contains no XML features. XML metadata is however stored as a stream object, so it can be read, modified with appropriate software and written back.

  1. >>> xmlmetadata = doc.get_xml_metadata()
  2. >>> print(xmlmetadata)
  3. <?xpacket begin="\ufeff" id="W5M0MpCehiHzreSzNTczkc9d"?>
  4. <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-702">
  5. <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  6. ...
  7. omitted data
  8. ...
  9. <?xpacket end="w"?>

Using some XML package, the XML data can be interpreted and / or modified and then stored back. The following also works, if the PDF previously had no XML metadata:

  1. >>> # write back modified XML metadata:
  2. >>> doc.set_xml_metadata(xmlmetadata)
  3. >>>
  4. >>> # XML metadata can be deleted like this:
  5. >>> doc.del_xml_metadata()

How to Extend PDF Metadata

Attribute Document.metadata is designed so it works for all supported document types in the same way: it is a Python dictionary with a fixed set of key-value pairs. Correspondingly, Document.set_metadata() only accepts standard keys.

However, PDFs may contain items not accessible like this. Also, there may be reasons to store additional information, like copyrights. Here is a way to handle arbitrary metadata items by using PyMuPDF low-level functions.

As an example, look at this standard metadata output of some PDF:

  1. # ---------------------
  2. # standard metadata
  3. # ---------------------
  4. pprint(doc.metadata)
  5. {'author': 'PRINCE',
  6. 'creationDate': "D:2010102417034406'-30'",
  7. 'creator': 'PrimoPDF http://www.primopdf.com/',
  8. 'encryption': None,
  9. 'format': 'PDF 1.4',
  10. 'keywords': '',
  11. 'modDate': "D:20200725062431-04'00'",
  12. 'producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
  13. 'AppendMode 1.1',
  14. 'subject': '',
  15. 'title': 'Full page fax print',
  16. 'trapped': ''}

Use the following code to see all items stored the metadata object:

  1. # ----------------------------------
  2. # metadata including private items
  3. # ----------------------------------
  4. metadata = {} # make my own metadata dict
  5. what, value = doc.xref_get_key(-1, "Info") # /Info key in the trailer
  6. if what != "xref":
  7. pass # PDF has no metadata
  8. else:
  9. xref = int(value.replace("0 R", "")) # extract the metadata xref
  10. for key in doc.xref_get_keys(xref):
  11. metadata[key] = doc.xref_get_key(xref, key)[1]
  12. pprint(metadata)
  13. {'Author': 'PRINCE',
  14. 'CreationDate': "D:2010102417034406'-30'",
  15. 'Creator': 'PrimoPDF http://www.primopdf.com/',
  16. 'ModDate': "D:20200725062431-04'00'",
  17. 'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb 9 '
  18. "2015;12:00:06;D:20200725062431-04'00'",
  19. 'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
  20. 'AppendMode 1.1',
  21. 'Title': 'Full page fax print'}
  22. # ---------------------------------------------------------------
  23. # note the additional 'PXCViewerInfo' key - ignored in standard!
  24. # ---------------------------------------------------------------

Vice cersa, you can also store private metadata items in a PDF. It is your responsibility making sure, that these items do conform to PDF specifications - especially they must be (unicode) strings. Consult section 14.3 (p. 548) of the Adobe PDF References for details and caveats:

  1. what, value = doc.xref_get_key(-1, "Info") # /Info key in the trailer
  2. if what != "xref":
  3. raise ValueError("PDF has no metadata")
  4. xref = int(value.replace("0 R", "")) # extract the metadata xref
  5. # add some private information
  6. doc.xref_set_key(xref, "mykey", fitz.get_pdf_str("北京 is Beijing"))
  7. #
  8. # after executing the previous code snippet, we will see this:
  9. pprint(metadata)
  10. {'Author': 'PRINCE',
  11. 'CreationDate': "D:2010102417034406'-30'",
  12. 'Creator': 'PrimoPDF http://www.primopdf.com/',
  13. 'ModDate': "D:20200725062431-04'00'",
  14. 'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb 9 '
  15. "2015;12:00:06;D:20200725062431-04'00'",
  16. 'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
  17. 'AppendMode 1.1',
  18. 'Title': 'Full page fax print',
  19. 'mykey': '北京 is Beijing'}

To delete selected keys, use doc.xref_set_key(xref, "mykey", "null"). As explained in the next section, string “null” is the PDF equivalent to Python’s None. A key with that value will be treated like being not specified – and physically removed in garbage collections.


How to Read and Update PDF Objects

There also exist granular, elegant ways to access and manipulate selected PDF dictionary keys.

  • Document.xref_get_keys() returns the PDF keys of the object at xref:

    1. In [1]: import fitz
    2. In [2]: doc = fitz.open("pymupdf.pdf")
    3. In [3]: page = doc[0]
    4. In [4]: from pprint import pprint
    5. In [5]: pprint(doc.xref_get_keys(page.xref))
    6. ('Type', 'Contents', 'Resources', 'MediaBox', 'Parent')
  • Compare with the full object definition:

    1. In [6]: print(doc.xref_object(page.xref))
    2. <<
    3. /Type /Page
    4. /Contents 1297 0 R
    5. /Resources 1296 0 R
    6. /MediaBox [ 0 0 612 792 ]
    7. /Parent 1301 0 R
    8. >>
  • Single keys can also be accessed directly via Document.xref_get_key(). The value always is a string together with type information, that helps interpreting it:

    1. In [7]: doc.xref_get_key(page.xref, "MediaBox")
    2. Out[7]: ('array', '[0 0 612 792]')
  • Here is a full listing of the above page keys:

    1. In [9]: for key in doc.xref_get_keys(page.xref):
    2. ...: print("%s = %s" % (key, doc.xref_get_key(page.xref, key)))
    3. ...:
    4. Type = ('name', '/Page')
    5. Contents = ('xref', '1297 0 R')
    6. Resources = ('xref', '1296 0 R')
    7. MediaBox = ('array', '[0 0 612 792]')
    8. Parent = ('xref', '1301 0 R')
  • An undefined key inquiry returns ('null', 'null') – PDF object type null corresponds to None in Python. Similar for the booleans true and false.

  • Let us add a new key to the page definition that sets its rotation to 90 degrees (you are aware that there actually exists Page.set_rotation() for this?):

    1. In [11]: doc.xref_get_key(page.xref, "Rotate") # no rotation set:
    2. Out[11]: ('null', 'null')
    3. In [12]: doc.xref_set_key(page.xref, "Rotate", "90") # insert a new key
    4. In [13]: print(doc.xref_object(page.xref)) # confirm success
    5. <<
    6. /Type /Page
    7. /Contents 1297 0 R
    8. /Resources 1296 0 R
    9. /MediaBox [ 0 0 612 792 ]
    10. /Parent 1301 0 R
    11. /Rotate 90
    12. >>
  • This method can also be used to remove a key from the xref dictionary by setting its value to null: The following will remove the rotation specification from the page: doc.xref_set_key(page.xref, "Rotate", "null"). Similarly, to remove all links, annotations and fields from a page, use doc.xref_set_key(page.xref, "Annots", "null"). Because Annots by definition is an array, setting en empty array with the statement doc.xref_set_key(page.xref, "Annots", "[]") would do the same job in this case.

  • PDF dictionaries can be hierarchically nested. In the following page object definition both, Font and XObject are subdictionaries of Resources:

    1. In [15]: print(doc.xref_object(page.xref))
    2. <<
    3. /Type /Page
    4. /Contents 1297 0 R
    5. /Resources <<
    6. /XObject <<
    7. /Im1 1291 0 R
    8. >>
    9. /Font <<
    10. /F39 1299 0 R
    11. /F40 1300 0 R
    12. >>
    13. >>
    14. /MediaBox [ 0 0 612 792 ]
    15. /Parent 1301 0 R
    16. /Rotate 90
    17. >>
  • The above situation is supported by methods Document.xref_set_key() and Document.xref_get_key(): use a path-like notation to point at the required key. For example, to retrieve the value of key Im1 above, specify the complete chain of dictionaries “above” it in the key argument: "Resources/XObject/Im1":

    1. In [16]: doc.xref_get_key(page.xref, "Resources/XObject/Im1")
    2. Out[16]: ('xref', '1291 0 R')
  • The path notation can also be used to directly set a value: use the following to let Im1 point to a different object:

    1. In [17]: doc.xref_set_key(page.xref, "Resources/XObject/Im1", "9999 0 R")
    2. In [18]: print(doc.xref_object(page.xref)) # confirm success:
    3. <<
    4. /Type /Page
    5. /Contents 1297 0 R
    6. /Resources <<
    7. /XObject <<
    8. /Im1 9999 0 R
    9. >>
    10. /Font <<
    11. /F39 1299 0 R
    12. /F40 1300 0 R
    13. >>
    14. >>
    15. /MediaBox [ 0 0 612 792 ]
    16. /Parent 1301 0 R
    17. /Rotate 90
    18. >>

    Be aware, that no semantic checks whatsoever will take place here: if the PDF has no xref 9999, it won’t be detected at this point.

  • If a key does not exist, it will be created by setting its value. Moreover, if any intermediate keys do not exist either, they will also be created as necessary. The following creates an array D several levels below the existing dictionary A. Intermediate dictionaries B and C are automatically created:

    1. In [5]: print(doc.xref_object(xref)) # some existing PDF object:
    2. <<
    3. /A <<
    4. >>
    5. >>
    6. In [6]: # the following will create 'B', 'C' and 'D'
    7. In [7]: doc.xref_set_key(xref, "A/B/C/D", "[1 2 3 4]")
    8. In [8]: print(doc.xref_object(xref)) # check out what happened:
    9. <<
    10. /A <<
    11. /B <<
    12. /C <<
    13. /D [ 1 2 3 4 ]
    14. >>
    15. >>
    16. >>
    17. >>
  • When setting key values, basic PDF syntax checking will be done by MuPDF. For example, new keys can only be created below a dictionary. The following tries to create some new string item E below the previously created array D:

    1. In [9]: # 'D' is an array, no dictionary!
    2. In [10]: doc.xref_set_key(xref, "A/B/C/D/E", "(hello)")
    3. mupdf: not a dict (array)
    4. --- ... ---
    5. RuntimeError: not a dict (array)
  • It is also not possible, to create a key if some higher level key is an “indirect” object, i.e. an xref. In other words, xrefs can only be modified directly and not implicitely via other objects referencing them:

    1. In [13]: # the following object points to an xref
    2. In [14]: print(doc.xref_object(4))
    3. <<
    4. /E 3 0 R
    5. >>
    6. In [15]: # 'E' is an indirect object and cannot be modified here!
    7. In [16]: doc.xref_set_key(4, "E/F", "90")
    8. mupdf: path to 'F' has indirects
    9. --- ... ---
    10. RuntimeError: path to 'F' has indirects

Caution

These are expert functions! There are no validations as to whether valid PDF objects, xrefs, etc. are specified. As with other low-level methods there exists the risk to render the PDF, or parts of it unusable.


Journalling

Starting with version 1.19.0, journalling is possible when updating PDF documents.

Journalling is a logging mechanism which permits either reverting or re-applying changes to a PDF. Similar to LUWs “Logical Units of Work” in modern database systems, one can group a set of updates into an “operation”. In MuPDF journalling, an operation plays the role of a LUW.

Note

In contrast to LUW implementations found in database systems, MuPDF journalling happens on a per document level. There is no support for simultaneous updates across multiple PDFs: one would have to establish one’s own logic here.

  • Journalling must be enabled via a document method. Journalling is possible for existing or new documents. Journalling can be disabled only by closing the file.

  • Once enabled, every change must happen inside an operation – otherwise an exception is raised. An operation is started and stopped via document methods. Updates happening between these two calls form an LUW and can thus collectively be rolled back or re-applied, or, in MuPDF terminology “undone” resp. “redone”.

  • At any point, the journalling status can be queried: whether journalling is active, how many operations have been recorded, whether “undo” or “redo” is possible, the current position inside the journal, etc.

  • The journal can be saved to or loaded from a file. These are document methods.

  • When loading a journal file, compatibility with the document is checked and journalling is automatically enabled upon success.

  • For an exising PDF being journalled, a special new save method is available: Document.save_snapshot(). This performs a special incremental save that includes all journalled updates so far. If its journal is saved at the same time (immediately after the document snapshot), then document and journal are in sync and can lateron be used together to undo or redo operations or to continue journalled updates – just as if there had been no interruption.

  • The snapshot PDF is a valid PDF in every aspect and fully usable. If the document is however changed in any way without using its journal file, then a desynchronization will take place and the journal is rendered unusable.

  • Snapshot files are structured like incremental updates. Nevertheless, the internal journalling logic requires, that saving must happen to a new file. So the user should develop a file naming convention to support recognizable relationships between an original PDF, like original.pdf and its snapshot sets, like original-snap1.pdf / original-snap1.log, original-snap2.pdf / original-snap2.log, etc.

Example Session 1

Description:

  • Make a new PDF and enable journalling. Then add a page and some text lines – each as a separate operation.

  • Navigate within the journal, undoing and redoing these updates and diplaying status and file results:

    1. >>> import fitz
    2. >>> doc=fitz.open()
    3. >>> doc.journal_enable()
    4. >>> # try update without an operation:
    5. >>> page = doc.new_page()
    6. mupdf: No journalling operation started
    7. ... omitted lines
    8. RuntimeError: No journalling operation started
    9. >>> doc.journal_start_op("op1")
    10. >>> page = doc.new_page()
    11. >>> doc.journal_stop_op()
    12. >>> doc.journal_start_op("op2")
    13. >>> page.insert_text((100,100), "Line 1")
    14. >>> doc.journal_stop_op()
    15. >>> doc.journal_start_op("op3")
    16. >>> page.insert_text((100,120), "Line 2")
    17. >>> doc.journal_stop_op()
    18. >>> doc.journal_start_op("op4")
    19. >>> page.insert_text((100,140), "Line 3")
    20. >>> doc.journal_stop_op()
    21. >>> # show position in journal
    22. >>> doc.journal_position()
    23. (4, 4)
    24. >>> # 4 operations recorded - positioned at bottom
    25. >>> # what can we do?
    26. >>> doc.journal_can_do()
    27. {'undo': True, 'redo': False}
    28. >>> # currently only undos are possible. Print page content:
    29. >>> print(page.get_text())
    30. Line 1
    31. Line 2
    32. Line 3
    33. >>> # undo last insert:
    34. >>> doc.journal_undo()
    35. >>> # show combined status again:
    36. >>> doc.journal_position();doc.journal_can_do()
    37. (3, 4)
    38. {'undo': True, 'redo': True}
    39. >>> print(page.get_text())
    40. Line 1
    41. Line 2
    42. >>> # our position is now second to last
    43. >>> # last text insertion was reverted
    44. >>> # but we can redo / move forward as well:
    45. >>> doc.journal_redo()
    46. >>> # our combined status:
    47. >>> doc.journal_position();doc.journal_can_do()
    48. (4, 4)
    49. {'undo': True, 'redo': False}
    50. >>> print(page.get_text())
    51. Line 1
    52. Line 2
    53. Line 3
    54. >>> # line 3 has appeared again!

Example Session 2

Description:

  • Similar to previous, but after undoing some operations, we now add a different update. This will cause:

    • permanent removal of the undone journal entries

    • the new update operation will become the new last entry.

    1. >>> doc=fitz.open()
    2. >>> doc.journal_enable()
    3. >>> doc.journal_start_op("Page insert")
    4. >>> page=doc.new_page()
    5. >>> doc.journal_stop_op()
    6. >>> for i in range(5):
    7. doc.journal_start_op("insert-%i" % i)
    8. page.insert_text((100, 100 + 20*i), "text line %i" %i)
    9. doc.journal_stop_op()
    1. >>> # combined status info:
    2. >>> doc.journal_position();doc.journal_can_do()
    3. (6, 6)
    4. {'undo': True, 'redo': False}
    1. >>> for i in range(3): # revert last three operations
    2. doc.journal_undo()
    3. >>> doc.journal_position();doc.journal_can_do()
    4. (3, 6)
    5. {'undo': True, 'redo': True}
    1. >>> # now do a different update:
    2. >>> doc.journal_start_op("Draw some line")
    3. >>> page.draw_line((100,150), (300,150))
    4. Point(300.0, 150.0)
    5. >>> doc.journal_stop_op()
    6. >>> doc.journal_position();doc.journal_can_do()
    7. (4, 4)
    8. {'undo': True, 'redo': False}
    1. >>> # this has changed the journal:
    2. >>> # previous last 3 text line operations were removed, and
    3. >>> # we have only 4 operations: drawing the line is the new last one