Module Pdf

module Pdf: sig .. end

Representing PDF Files in Memory

PDF Objects

type toget 
type stream = 
| Got of Pdfio.bytes
| ToGet of toget

A stream is either in memory, or at a position and of a length in an Pdfio.input.

type pdfobject = 
| Null
| Boolean of bool
| Integer of int
| Real of float
| String of string
| Name of string
| Array of pdfobject list
| Dictionary of (string * pdfobject) list
| Stream of (pdfobject * stream) Stdlib.ref
| Indirect of int

PDF objects. An object is a tree-like structure containing various things. A PDF file is basically a directed graph of objects.

The Object map

You should not expect to manipulate these types and functions directly.

type objectdata = 
| Parsed of pdfobject
| ParsedAlreadyDecrypted of pdfobject
| ToParse
| ToParseFromObjectStream of (int, int list) Stdlib.Hashtbl.t * int * int
* (int -> int list -> (int * (objectdata Stdlib.ref * int)) list)

This type represents a possibly-parsed, possibly-decrypted, possibly-read-from-an-object-stream object.

type pdfobjmap_key = int 
type pdfobjmap = (pdfobjmap_key, objectdata Stdlib.ref * int) Stdlib.Hashtbl.t 

The object map maps object numbers pdfobjmap_key to a reference to the object data and the generation number

val pdfobjmap_empty : unit -> pdfobjmap

Make an empty object map

val pdfobjmap_find : pdfobjmap_key -> pdfobjmap -> objectdata Stdlib.ref * int

Find an object in the object map

type pdfobjects = {
   mutable maxobjnum : int;
   mutable parse : (pdfobjmap_key -> pdfobject) option;
   mutable pdfobjects : pdfobjmap;
   mutable object_stream_ids : (int, int) Stdlib.Hashtbl.t;

The objects. Again, you won't normally manipulate this directly. maxobjnum is the biggest object number seen yet. parse is a function to parse a non-object stream object given its object number, pdfobjects is the object map itself. object_stream_ids is a hash table of (object number, was-stored-in-obect-stream-number) pairs, which is used to reconstruct stream objects when preserving them upon write.

The PDF document

type saved_encryption = {
   from_get_encryption_values : Pdfcryptprimitives.encryption * string * string * int32 * string *
string option * string option
   encrypt_metadata : bool;
   perms : string;
type deferred_encryption = {
   crypt_type : Pdfcryptprimitives.encryption;
   file_encryption_key : string option;
   obj : int;
   gen : int;
   key : int array;
   keylength : int;
   r : int;
type t = {
   mutable major : int;
   mutable minor : int;
   mutable root : int;
   mutable objects : pdfobjects;
   mutable trailerdict : pdfobject;
   mutable was_linearized : bool;
   mutable saved_encryption : saved_encryption option;

A Pdf document. Major and minor version numbers, object number of root, the objects objects and the trailer dictionary as a Dictionary pdfobject.

val empty : unit -> t

The empty document (PDF 1.0, no objects, no root, empty trailer dictionary). Note this is not a well-formed PDF.

Exceptions and errors

exception PDFError of string

This exception is raised when some malformity in a PDF is found -- quite a wide range of circumstances, and may be raised from many functions.

val input_pdferror : Pdfio.input -> string -> string

This function, given a Pdfio.input and an ancilliary string, builds an error string which includes the source of the Pdfio.input (filename, string, bytes etc) so we can trace what it was originally built from

Useful utilities

val getstream : pdfobject -> unit

Get a stream from disc if it hasn't already been got. The input is a Stream pdfobject.

val getnum : pdfobject -> float

Return a float from either a Real or an Int

val lookup_obj : t -> int -> pdfobject

Lookup an object in a document, parsing it if required. Raises Not_found if the object does not exist.

val lookup_fail : string -> t -> string -> pdfobject -> pdfobject

lookup_fail errtext doc key dict looks up a key in a PDF dictionary or the dictionary of a PDF stream. Fails with PDFError errtext if the key is not found. Follows indirect object links.

val lookup_exception : exn -> t -> string -> pdfobject -> pdfobject

Same, but with customised exception.

val lookup_direct : t -> string -> pdfobject -> pdfobject option

lookup_direct doc key dict looks up the key returning an option type.

val indirect_number : t -> string -> pdfobject -> int option

Return the object number of an indirect dictionary object, if it is indirect.

val lookup_direct_orelse : t -> string -> string -> pdfobject -> pdfobject option

Same as lookup_direct, but allow a second, alternative key.

val remove_dict_entry : pdfobject -> string -> pdfobject

Remove a dictionary entry, if it exists.

val replace_dict_entry : pdfobject -> string -> pdfobject -> pdfobject

replace_dict_entry dict key value replaces a dictionary entry, raising Not_found if it's not there.

val add_dict_entry : pdfobject -> string -> pdfobject -> pdfobject

add_dict_entry dict key value adds a dictionary entry, replacing if already there.

val direct : t -> pdfobject -> pdfobject

Make a PDF object direct -- that is, follow any indirect links.

val objcard : t -> int

Return the size of the object map.

val removeobj : t -> int -> unit

Remove the given object

val addobj : t -> pdfobject -> int

Add an object. Returns the number chosen.

val addobj_given_num : t -> int * pdfobject -> unit

Same as addobj, but pick a number ourselves.

Compound structures

val parse_rectangle : pdfobject -> float * float * float * float

Parse a PDF rectangle structure into min x, min y, max x, max y.

val parse_matrix : t -> string -> pdfobject -> Pdftransform.transform_matrix

Calling parse_matrix pdf name dict parses a PDF matrix found under key name in dictionary dict into a Transform.transform_matrix. If there is no matrix, the identity matrix is returned.

val make_matrix : Pdftransform.transform_matrix -> pdfobject

Build a matrix pdfobject.

val renumber_pdfs : t list -> t list

Make a number of PDF documents contain no mutual object numbers. They can then be merged etc. without clashes.

val unique_key : string -> pdfobject -> string

Given a dictionary and a prefix (e.g gs), return a name, starting with the prefix, which is not already in the dictionary (e.g /gs0).


val objiter : (int -> pdfobject -> unit) -> t -> unit

Iterate over the objects in a document. The iterating functions recieves both object number and object from the object map.

val objiter_inorder : (int -> pdfobject -> unit) -> t -> unit

The same, but in object number order.

val objiter_gen : (int -> int -> pdfobject -> unit) -> t -> unit

Iterate over the objects in a document. The iterating functions recieves object number, generation number and object from the object map.

val objselfmap : (pdfobject -> pdfobject) -> t -> unit

Map over all pdf objects in a document. Does not include trailer dictionary.

val iter_stream : (pdfobject -> unit) -> t -> unit

Iterate over just the stream objects in a document.

Garbage collection

val remove_unreferenced : t -> unit

Garbage-collect a pdf document.


These functions were previsouly undocumented. They are documented here for now, and in the future will be categorised more sensibly.

val is_whitespace : char -> bool

True if a character is PDF whitespace.

val is_not_whitespace : char -> bool

True if a character is not PDF whitespace.

val is_delimiter : char -> bool

True if a character is a PDF delimiter.

val page_reference_numbers : t -> int list

List, in order, the page reference numbers of a PDF's page tree.

val objnumbers : t -> int list

List the object numbers in a PDF.

val recurse_dict : (pdfobject -> pdfobject) ->
(string * pdfobject) list -> pdfobject

Use the given function on each element of a PDF dictionary.

val recurse_array : (pdfobject -> pdfobject) -> pdfobject list -> pdfobject

Similarly for an Array. The function is applied to each element.

val changes : t -> (int, int) Stdlib.Hashtbl.t

Calculate the changes required to renumber a PDF's objects 1..n.

val renumber : (int, int) Stdlib.Hashtbl.t -> t -> t

Perform the given renumberings on a PDF.

val renumber_object_parsed : t -> (int, int) Stdlib.Hashtbl.t -> pdfobject -> pdfobject

Renumber an object given a change table.

val bigarray_of_stream : pdfobject -> Pdfio.bytes

Fetch a stream, if necessary, and return its contents (with no processing).

val objects_of_list : (int -> pdfobject) option ->
(int * (objectdata Stdlib.ref * int)) list -> pdfobjects

Make a objects entry from a parser and a list of (number, object) pairs.

val objects_referenced : string list ->
(string * pdfobject) list -> t -> pdfobject -> int list

Calling objects_referenced no_follow_entries no_follow_contains pdf pdfobject find the objects reachable from the given object. Dictionary keys in no_follow_entries are not explored. Dictionaries containing entries in no_follow_contains are not explored.

val generate_id : t -> string -> (unit -> float) -> pdfobject

Generate and ID for a PDF document given its prospective file name (and using the current date and time). If the file name is blank, the ID is still likely to be unique, being based on date and time only. If environment variable CAMLPDF_REPRODUCIBLE_IDS=true is set, the ID will instead be set to a standard value.

val catalog_of_pdf : t -> pdfobject

Return the document catalog.

val find_indirect : string -> pdfobject -> int option

Find the indirect reference given by the value associated with a key in a dictionary.

val nametree_lookup : t -> pdfobject -> pdfobject -> pdfobject option

Calling nametree_lookup pdf k dict looks up the name in the document's name tree

val contents_of_nametree : t -> pdfobject -> (pdfobject * pdfobject) list

Return an ordered list of the key-value pairs in a given name tree.

val deep_copy : t -> t

Copy a PDF data structure so that nothing is shared with the original.

val change_id : t -> string -> unit

Change the /ID string in a PDF's trailer dicfionary