Appendix 3: Assorted Technical Information

This section deals with various technical topics, that are not necessarily related to each other.


Image Transformation Matrix

Starting with version 1.18.11, the image transformation matrix is returned by some methods for text and image extraction: Page.get_text() and Page.get_image_bbox().

The transformation matrix contains information about how an image was transformed to fit into the rectangle (its “boundary box” = “bbox”) on some document page. By inspecting the image’s bbox on the page and this matrix, one can determine for example, whether and how the image is displayed scaled or rotated on a page.

The relationship between image dimension and its bbox on a page is the following:

  1. Using the original image’s width and height,

    • define the image rectangle imgrect = fitz.Rect(0, 0, width, height)

    • define the “shrink matrix” shrink = fitz.Matrix(1/width, 0, 0, 1/height, 0, 0).

  2. Transforming the image rectangle with its shrink matrix, will result in the unit rectangle: imgrect * shrink = fitz.Rect(0, 0, 1, 1).

  3. Using the image transformation matrix “transform”, the following steps will compute the bbox:

    1. imgrect = fitz.Rect(0, 0, width, height)
    2. shrink = fitz.Matrix(1/width, 0, 0, 1/height, 0, 0)
    3. bbox = imgrect * shrink * transform
  4. Inspecting the matrix product shrink * transform will reveal all information about what happened to the image rectangle to make it fit into the bbox on the page: rotation, scaling of its sides and translation of its origin. Let us look at an example:

    1. >>> imginfo = page.get_images()[0] # get an image item on a page
    2. >>> imginfo
    3. (5, 0, 439, 501, 8, 'DeviceRGB', '', 'fzImg0', 'DCTDecode')
    4. >>> #------------------------------------------------
    5. >>> # define image shrink matrix and rectangle
    6. >>> #------------------------------------------------
    7. >>> shrink = fitz.Matrix(1 / 439, 0, 0, 1 / 501, 0, 0)
    8. >>> imgrect = fitz.Rect(0, 0, 439, 501)
    9. >>> #------------------------------------------------
    10. >>> # determine image bbox and transformation matrix:
    11. >>> #------------------------------------------------
    12. >>> bbox, transform = page.get_image_bbox("fzImg0", transform=True)
    13. >>> #------------------------------------------------
    14. >>> # confirm equality - permitting rounding errors
    15. >>> #------------------------------------------------
    16. >>> bbox
    17. Rect(100.0, 112.37525939941406, 300.0, 287.624755859375)
    18. >>> imgrect * shrink * transform
    19. Rect(100.0, 112.375244140625, 300.0, 287.6247253417969)
    20. >>> #------------------------------------------------
    21. >>> shrink * transform
    22. Matrix(0.0, -0.39920157194137573, 0.3992016017436981, 0.0, 100.0, 287.6247253417969)
    23. >>> #------------------------------------------------
    24. >>> # the above shows:
    25. >>> # image sides are scaled by same factor ~0.4,
    26. >>> # and the image is rotated by 90 degrees clockwise
    27. >>> # compare this with fitz.Matrix(-90) * 0.4
    28. >>> #------------------------------------------------

PDF Base 14 Fonts

The following 14 builtin font names must be supported by every PDF viewer application. They are available as a dictionary, which maps their full names amd their abbreviations in lower case to the full font basename. Whereever a fontname must be provided in PyMuPDF, any key or value from the dictionary may be used:

  1. In [2]: fitz.Base14_fontdict
  2. Out[2]:
  3. {'courier': 'Courier',
  4. 'courier-oblique': 'Courier-Oblique',
  5. 'courier-bold': 'Courier-Bold',
  6. 'courier-boldoblique': 'Courier-BoldOblique',
  7. 'helvetica': 'Helvetica',
  8. 'helvetica-oblique': 'Helvetica-Oblique',
  9. 'helvetica-bold': 'Helvetica-Bold',
  10. 'helvetica-boldoblique': 'Helvetica-BoldOblique',
  11. 'times-roman': 'Times-Roman',
  12. 'times-italic': 'Times-Italic',
  13. 'times-bold': 'Times-Bold',
  14. 'times-bolditalic': 'Times-BoldItalic',
  15. 'symbol': 'Symbol',
  16. 'zapfdingbats': 'ZapfDingbats',
  17. 'helv': 'Helvetica',
  18. 'heit': 'Helvetica-Oblique',
  19. 'hebo': 'Helvetica-Bold',
  20. 'hebi': 'Helvetica-BoldOblique',
  21. 'cour': 'Courier',
  22. 'coit': 'Courier-Oblique',
  23. 'cobo': 'Courier-Bold',
  24. 'cobi': 'Courier-BoldOblique',
  25. 'tiro': 'Times-Roman',
  26. 'tibo': 'Times-Bold',
  27. 'tiit': 'Times-Italic',
  28. 'tibi': 'Times-BoldItalic',
  29. 'symb': 'Symbol',
  30. 'zadb': 'ZapfDingbats'}

In contrast to their obligation, not all PDF viewers support these fonts correctly and completely – this is especially true for Symbol and ZapfDingbats. Also, the glyph (visual) images will be specific to every reader.

To see how these fonts can be used – including the CJK built-in fonts – look at the table in Page.insert_font().


Adobe PDF References

This PDF Reference manual published by Adobe is frequently quoted throughout this documentation. It can be viewed and downloaded from here.

Note

For a long time, an older version was also available under this link. It seems to be taken off of the web site in October 2021. Earlier (pre 1.19.*) versions of the PyMuPDF documentation used to refer to this document. We have undertaken an effort to replace referrals to the current specification above.


Using Python Sequences as Arguments in PyMuPDF

When PyMuPDF objects and methods require a Python list of numerical values, other Python sequence types are also allowed. Python classes are said to implement the sequence protocol, if they have a __getitem__() method.

This basically means, you can interchangeably use Python list or tuple or even array.array, numpy.array and bytearray types in these cases.

For example, specifying a sequence "s" in any of the following ways

  • s = [1, 2] – a list

  • s = (1, 2) – a tuple

  • s = array.array("i", (1, 2)) – an array.array

  • s = numpy.array((1, 2)) – a numpy array

  • s = bytearray((1, 2)) – a bytearray

will make it usable in the following example expressions:

  • fitz.Point(s)

  • fitz.Point(x, y) + s

  • doc.select(s)

Similarly with all geometry objects Rect, IRect, Matrix and Point.

Because all PyMuPDF geometry classes themselves are special cases of sequences, they (with the exception of Quad – see below) can be freely used where numerical sequences can be used, e.g. as arguments for functions like list(), tuple(), array.array() or numpy.array(). Look at the following snippet to see this work.

  1. >>> import fitz, array, numpy as np
  2. >>> m = fitz.Matrix(1, 2, 3, 4, 5, 6)
  3. >>>
  4. >>> list(m)
  5. [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
  6. >>>
  7. >>> tuple(m)
  8. (1.0, 2.0, 3.0, 4.0, 5.0, 6.0)
  9. >>>
  10. >>> array.array("f", m)
  11. array('f', [1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
  12. >>>
  13. >>> np.array(m)
  14. array([1., 2., 3., 4., 5., 6.])

Note

Quad is a Python sequence object as well and has a length of 4. Its items however are point_like – not numbers. Therefore, the above remarks do not apply.


Ensuring Consistency of Important Objects in PyMuPDF

PyMuPDF is a Python binding for the C library MuPDF. While a lot of effort has been invested by MuPDF’s creators to approximate some sort of an object-oriented behavior, they certainly could not overcome basic shortcomings of the C language in that respect.

Python on the other hand implements the OO-model in a very clean way. The interface code between PyMuPDF and MuPDF consists of two basic files: fitz.py and fitz_wrap.c. They are created by the excellent SWIG tool for each new version.

When you use one of PyMuPDF’s objects or methods, this will result in excution of some code in fitz.py, which in turn will call some C code compiled with fitz_wrap.c.

Because SWIG goes a long way to keep the Python and the C level in sync, everything works fine, if a certain set of rules is being strictly followed. For example: never access a Page object, after you have closed (or deleted or set to None) the owning Document. Or, less obvious: never access a page or any of its children (links or annotations) after you have executed one of the document methods select(), delete_page(), insert_page() … and more.

But just no longer accessing invalidated objects is actually not enough: They should rather be actively deleted entirely, to also free C-level resources (meaning allocated memory).

The reason for these rules lies in the fact that there is a hierachical 2-level one-to-many relationship between a document and its pages and also between a page and its links / annotations. To maintain a consistent situation, any of the above actions must lead to a complete reset – in Python and, synchronously, in C.

SWIG cannot know about this and consequently does not do it.

The required logic has therefore been built into PyMuPDF itself in the following way.

  1. If a page “loses” its owning document or is being deleted itself, all of its currently existing annotations and links will be made unusable in Python, and their C-level counterparts will be deleted and deallocated.

  2. If a document is closed (or deleted or set to None) or if its structure has changed, then similarly all currently existing pages and their children will be made unusable, and corresponding C-level deletions will take place. “Structure changes” include methods like select(), delePage(), insert_page(), insert_pdf() and so on: all of these will result in a cascade of object deletions.

The programmer will normally not realize any of this. If he, however, tries to access invalidated objects, exceptions will be raised.

Invalidated objects cannot be directly deleted as with Python statements like del page or page = None, etc. Instead, their __del__ method must be invoked.

All pages, links and annotations have the property parent, which points to the owning object. This is the property that can be checked on the application level: if obj.parent == None then the object’s parent is gone, and any reference to its properties or methods will raise an exception informing about this “orphaned” state.

A sample session:

  1. >>> page = doc[n]
  2. >>> annot = page.first_annot
  3. >>> annot.type # everything works fine
  4. [5, 'Circle']
  5. >>> page = None # this turns 'annot' into an orphan
  6. >>> annot.type
  7. <... omitted lines ...>
  8. RuntimeError: orphaned object: parent is None
  9. >>>
  10. >>> # same happens, if you do this:
  11. >>> annot = doc[n].first_annot # deletes the page again immediately!
  12. >>> annot.type # so, 'annot' is 'born' orphaned
  13. <... omitted lines ...>
  14. RuntimeError: orphaned object: parent is None

This shows the cascading effect:

  1. >>> doc = fitz.open("some.pdf")
  2. >>> page = doc[n]
  3. >>> annot = page.first_annot
  4. >>> page.rect
  5. fitz.Rect(0.0, 0.0, 595.0, 842.0)
  6. >>> annot.type
  7. [5, 'Circle']
  8. >>> del doc # or doc = None or doc.close()
  9. >>> page.rect
  10. <... omitted lines ...>
  11. RuntimeError: orphaned object: parent is None
  12. >>> annot.type
  13. <... omitted lines ...>
  14. RuntimeError: orphaned object: parent is None

Note

Objects outside the above relationship are not included in this mechanism. If you e.g. created a table of contents by toc = doc.get_toc(), and later close or change the document, then this cannot and does not change variable toc in any way. It is your responsibility to refresh such variables as required.


Design of Method Page.show_pdf_page()

Purpose and Capabilities

The method displays an image of a (“source”) page of another PDF document within a specified rectangle of the current (“containing”, “target”) page.

  • In contrast to Page.insert_image(), this display is vector-based and hence remains accurate across zooming levels.

  • Just like Page.insert_image(), the size of the display is adjusted to the given rectangle.

The following variations of the display are currently supported:

  • Bool parameter keep_proportion controls whether to maintain the aspect ratio (default) or not.

  • Rectangle parameter clip restricts the visible part of the source page rectangle. Default is the full page.

  • float rotation rotates the display by an arbitrary angle (degrees). If the angle is not an integer multiple of 90, only 2 of the 4 corners may be positioned on the target border if also keep_proportion is true.

  • Bool parameter overlay controls whether to put the image on top (foreground, default) of current page content or not (background).

Use cases include (but are not limited to) the following:

  1. “Stamp” a series of pages of the current document with the same image, like a company logo or a watermark.

  2. Combine arbitrary input pages into one output page to support “booklet” or double-sided printing (known as “4-up”, “n-up”).

  3. Split up (large) input pages into several arbitrary pieces. This is also called “posterization”, because you e.g. can split an A4 page horizontally and vertically, print the 4 pieces enlarged to separate A4 pages, and end up with an A2 version of your original page.

Technical Implementation

This is done using PDF “Form XObjects”, see section 8.10 on page 217 of Adobe PDF References. On execution of a Page.show_pdf_page(rect, src, pno, …), the following things happen:

  1. The resources and contents objects of page pno in document src are copied over to the current document, jointly creating a new Form XObject with the following properties. The PDF xref number of this object is returned by the method.

    1. /BBox equals /Mediabox of the source page

    2. /Matrix equals the identity matrix [1 0 0 1 0 0]

    3. /Resources equals that of the source page. This involves a “deep-copy” of hierarchically nested other objects (including fonts, images, etc.). The complexity involved here is covered by MuPDF’s grafting 1 technique functions.

    4. This is a stream object type, and its stream is an exact copy of the combined data of the source page’s /Contents objects.

    This step is only executed once per shown source page. Subsequent displays of the same page only create pointers (done in next step) to this object.

  2. A second Form XObject is then created which the target page uses to invoke the display. This object has the following properties:

    1. /BBox equals the /CropBox of the source page (or clip).

    2. /Matrix represents the mapping of /BBox to the target rectangle.

    3. /XObject references the previous XObject via the fixed name fullpage.

    4. The stream of this object contains exactly one fixed statement: /fullpage Do.

  3. The resources and contents objects of the target page are now modified as follows.

    1. Add an entry to the /XObject dictionary of /Resources with the name fzFrm<n> (with n chosen such that this entry is unique on the page).

    2. Depending on overlay, prepend or append a new object to the page’s /Contents array, containing the statement q /fzFrm<n> Do Q.

Redirecting Error and Warning Messages

Since MuPDF version 1.16 error and warning messages can be redirected via an official plugin.

PyMuPDF will put error messages to sys.stderr prefixed with the string “mupdf:”. Warnings are internally stored and can be accessed via fitz.TOOLS.mupdf_warnings(). There also is a function to empty this store.

Footnotes

1

MuPDF supports “deep-copying” objects between PDF documents. To avoid duplicate data in the target, it uses so-called “graftmaps”, like a form of scratchpad: for each object to be copied, its xref number is looked up in the graftmap. If found, copying is skipped. Otherwise, the new xref is recorded and the copy takes place. PyMuPDF makes use of this technique in two places so far: Document.insert_pdf() and Page.show_pdf_page(). This process is fast and very efficient, because it prevents multiple copies of typically large and frequently referenced data, like images and fonts. However, you may still want to consider using garbage collection (option 4) in any of the following cases:

  1. The target PDF is not new / empty: grafting does not check for resources that already existed (e.g. images, fonts) in the target document before opening it.

  2. Using Page.show_pdf_page() for more than one source document: each grafting occurs within one source PDF only, not across multiple. So if e.g. the same image exists in pages from different source PDFs, then this will not be detected until garbage collection.