Extract from PDF
This action extracts text and images from a PDF document contained as binary data in a selected binary variable.
Typically, the PDF document has been downloaded into the variable using an Extract Target step.
The output from the "Extract from PDF" action is an HTML page
containing the text and images extracted from the PDF document.
In subsequent steps, the desired information can then be extracted from the page, in the same way as for other HTML pages.
Note that PDF documents do not contain structure information such as tables or paragraphs, only positions of
texts and graphics, that might or might not be positioned to look like tables or paragraphs. This can make
it difficult to extract the desired information from PDF documents. However, the Extract from PDF step will apply
some heuristics to group the text into HTML paragraphs based on the available position information.
The "Extract Text from PDF" action can be configured using the following properties:
- PDF Variable:
The binary variable containing the PDF document as binary data.
- Include Images:
Specifies whether embedded images should be extracted. Note that not all images and graphics can be extracted
from PDF documents; it depends on the way they have originally been embedded in the document.
- Include Positioning:
Specifies whether the positions of the texts should be extracted. The positions may
be useful to derive the structure of the document.
- Include Formatting:
Specifies whether the formatting (font names, sizes etc.) of the texts should be extracted. Like the positions, the formatting
may be useful to derive the structure of the document.
- Merge Text:
As default the converter that generated the HTML from the PDF will merge text that is on the same line into one HTML element
even if these are represented as different text in the PDF document. Though this may often desirable, it may in some cases
have the effect that text that originally far apart will be merges together and appear to be right next to each other.
A typical case where it would be desirable to turn this feature off is if the document contains more than one column. Turning
the feature off will attempt to preserve the column structure.