Working with pages¶
This section details with how to view and edit the contents of a page.
pikepdf is not an ideal tool for producing new PDFs from scratch – and there are many good tools for that, as mentioned elsewhere. pikepdf is better at inspecting, editing and transforming existing PDFs.
Page objects in PDFs are dictionaries.
In [1]: from pikepdf import Pdf, Page --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-1-45bcca11d974> in <module>() ----> 1 from pikepdf import Pdf, Page ModuleNotFoundError: No module named 'pikepdf' In [2]: example = Pdf.open('../tests/resources/congress.pdf') --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-2-5533e8a6eb0a> in <module>() ----> 1 example = Pdf.open('../tests/resources/congress.pdf') NameError: name 'Pdf' is not defined In [3]: pageobj1 = example.pages[0] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-ee24475d7777> in <module>() ----> 1 pageobj1 = example.pages[0] NameError: name 'example' is not defined In [4]: pageobj1 --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-31a051162690> in <module>() ----> 1 pageobj1 NameError: name 'pageobj1' is not defined
The page’s /Contents
key contains instructions for drawing the page content.
This is a content stream, which is a stream object
that follows special rules.
Also attached to this page is a /Resources
dictionary, which contains a
single XObject image. The image is compressed with the /DCTDecode
filter,
meaning it is encoded with the DCT, so it is
a JPEG. pikepdf has special APIs for working with images.
The /MediaBox
describes the bounding box of the page in PDF pt units
(1/72” or 0.35 mm).
You can access the page dictionary data structure directly, but it’s fairly complicated. There are a number of rules, optional values and implied values. It’s easier to use page helpers, which ensure that the page is modified in a semantically correct manner.
Page helpers¶
pikepdf provides a helper class, pikepdf.Page
, which provides
higher-level functions to manipulate pages than the standard page dictionary
used in the previous examples.
Currently pikepdf does not automatically return helper classes. You must initialize them. In a future release, it will return them automatically.
In [5]: from pikepdf import Pdf, Page --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-5-45bcca11d974> in <module>() ----> 1 from pikepdf import Pdf, Page ModuleNotFoundError: No module named 'pikepdf' In [6]: page = Page(pageobj1) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-e934a57dc77f> in <module>() ----> 1 page = Page(pageobj1) NameError: name 'Page' is not defined In [7]: page.trimbox --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-9422a1223958> in <module>() ----> 1 page.trimbox NameError: name 'page' is not defined
One advantage of page helpers is that they resolve implicit information. For example,
page.trimbox
will return an appropriate trim box for this page, which in this
case is equal to the media box.