Recommend this page to a friend! Post a comment See comments 19 Trackbacks 0 Top featured articles 1. Read this article that is the first of a series that will teach you about the challenge of processing the PDF file format and how the PdfToText class can be used to extract text and images from it.
By Christian Vigh wuthering-bytes. How to contribute to the development of the PdfToText class? Known Issues The following is a list of known issues. I'm still working on them and they will normally be implemented in future versions : RTL languages, such as Arabic, Hebrew or Syriac, are not correctly processed: they are extracted from left to right Only JPEG images are currently supported There is currently no support for password-protected files note that I'm not intending to develop a password cracker, just a feature that allows you to extract text contents from a password-encrypted PDF file, if you supply the correct password Digitally signed files are not currently supported Text contents may sometimes show badly translated characters.
The reason why will be explained in the next series of articles The extracted text contents may not exactly reflect text positioning on the page. This is especially true regarding PDF files that contain data in tabular format. Again, this issue will be fixed in a future release and explained in one of the future articles about this class. CID fonts Adobe internal fonts, mainly used by eastern languages and developed before the Unicode effort took place are not yet supported.
This will be the subject of another article. Copyright c Icontem For more information send a message to info at phpclasses dot org. All package blogs. Post a comment. See comments It can be any of the constants defined by the gd library regarding image formats :. Note that the association between the constant and corresponding file suffix is automatically handled.
An array of objects inheriting from the PdfImage class. Currently, only the PdfJpegIMage class is implemented. Currently, images stored in proprietary Adobe format are not processed and will not appear in this array. Number of images found in the supplied PDF file. This number will only take into account the images whose format is recognized by the PdfToText class.
This property is set to true if the Pdf file is encrypted through some kind of password protection scheme. Specifies a maximum execution time in seconds for processing a single file. This allows the script to gracefully handle the error instead of PHP itself. Positive values are indicated in seconds. Maximum number of images to be extracted. This static property is the same as MaxExecutionTime , except that it works globally. If you have to process x files, then it will ensure that the global execution time does not exceed the value of this property.
Maximum number of pages to be selected. The default is the value 0, meaning that all pages will be selected for output. A value of 1 will extract the contents of the first page only, which can be useful if your PDF file is large and you're only interested by the contents of the first page.
When this number is negative, selection starts from the end of the file : -1 means "extract the last page", -2 means "extract the last two pages", and so on. For certain ranges of values, when displayed on a graphical device, these consecutive characters appear to be separated by one space or more. Of course, when generating ascii output, we would like to have some equivalent of such spacing.
This is what the MinSpaceWidth property is meant for : insert an ascii space in the generated output whenever the offset found exceeds MinSpaceWidth text units.
A string containing the last document modification date, in UTC format. Note that the elements will not be mapped in the output exactly as they appear with Acrobat Reader : elements physically disjoint on the x-axis will be separated by a space by default. The BlockSeparator property can be used to modify this separator.
The following text for example :. Company1 Company2 address1 address2 city1 city2 will be rendered as :. For example, the following text :. Associative array containing individual page contents. The array key is the page number, starting from 1. String to be used when building the Text property to separate individual pages. The default value is a newline. A string to be used for separating blocks when a negative offset less than thousands of characters is specified between two sequences of characters specified as an array notation.
This trick is often used when a pdf file contains tabular data. A string containing the whole text extracted from the underlying pdf file. Note that pages are separated with a form feed. When a Unicode character cannot be correctly recognized, the Utf8Placeholder property will be used as a substitution. The string can contain format specifiers recognized by the sprintf function.
The parameter passed to sprintf is the Unicode codepoint that could not be recognized an integer value. It can be a combination of any of the following flags :.
Current version of the PdfToText class, as a string containing a major, minor and release version numbers. For example : "1. This exception is thrown if an error occurs when decoding a PDF object.
Normally, most of these exceptions are thrown only if debug mode is activated. Thrown when an error is detected while parsing a template file for retrieving form data, or when retrieving form data. Extracting form data is fairly simple : use the GetFormData method and it will return you an object containing all the field values contained in your PDF file, whether they have been filled or not.
Both methods return a new object inheriting from the PdfToTextFormData class, which mainly contain helper functions that have no interest for the caller. The derived class returned by the GetFormData method has a set of properties that give you access to the form fields contents. The examples given in the following sections are based on the file "sample.
It has been taken from a very common form used in the US, located here :. You can open file sample. This is why you may want to spend some time designing a template XML file that maps PDF field names to human-readable ones Using an XML template does not require many changes to your existing code ; you just need to supply the path of your XML template when calling the GetFormData method :.
All of the above have been defined in the template file, and the parent class, PdfToTextFormData , is able to handle any modifications made to any of the properties involved in a grouped property. String fields within a form are basically specified with the following XML field construct :.
They basically contain the same information as string fields, except that the type attribute is set to choice. Grouped fields allow you to create new properties, coming from the concatenation of existing fields. A typical definition looks like this :. The above example creates an SSN property, which is the result of the concatenation of the specified fields. Note that modifying the value of a property referenced by a grouped field will modify the corresponding grouped field value.
Similarly, modifying the grouped field value will modify its associated properties. Sometimes, it's easier to tell the PdfToText class which area s of text you want to retrieve from which page s , rather than having to struggle with regular expressions to isolate the information you want.
This is especially true when you want to retrieve data from tabular reports. Captures are a solution for such needs ; they allow you to define shapes of the following types :. Capturing areas of your PDF document will require you a few preliminary steps that involve some extra work.
This is why you will have to choose between using captures or using regular expressions on the extracted text :. A PDF file uses a coordinate system whose values are more or less expressed in "relative units". The point at coordinates 0,0 is located at the bottom-left corner of the page ; the point at coordinates x,y , where "x" is the page width and "y" the page height, is located at the top-right corner of the page.
That's nice, but how to find the coordinates x, y, width and height of the rectangle that contains the text you want to capture? But if you don't have such a tool, you're stuck. This may be the only occasion you will have to use this option. The following example script explains how the file sample-report. It gives the page number, its width and its height in graphics coordinates.
The second kind of information that appears between square brackets gives size information regarding the block of text immediately following it :.
An example is given below, which will capture some contents of the PDF file sample-report. For example :. All the coordinates, widths and heights listed in this definition have been taken from the information contained in file sample-report.
Captures are processed independently of text extraction ; the only two things you have to do is :. Call either the SetCaptures or the SetCapturesFromString method to load the capture definitions, as in the following example :.
Each tag that has been defined in the XML definition file sample-report. Every capture name specified in the Capture definitions file can be accessed as a property name in the object returned by the GetCaptures method :.
For Rectangle captures, the property will be an array, since the same capture can be defined on different pages at different locations. To retrieve the Title located on the first page of our PDF file, we will have to write :. Rectangle captures are accessible by their page number. There will be a capture for each page of the document, even if not present in the list of applicable pages.
For lines captures, the situation is a little bit different : the interest of capturing report lines together with their column data is to be able to process them all at once. This is why they are grouped together in a single collection, that you can display or process with such kind of loops :. Step 2 was changed to:. Multi-Processing Part 1. Multi-Processing Part 2. Daemonizing a Process. Tags pdf , pdftotext , PHP. So, are there any obscure security issues when reading PDFs that we should be aware of?
Thanks for this. What pdf version did you test it on? Sign up or log in Sign up using Google. Sign up using Facebook. Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. Stack Gives Back Safety in numbers: crowdsourcing data on nefarious IP addresses.
Featured on Meta.
0コメント