Extract Text from PDF

Extract Text from PDF in Java

GroupDocs.Parser (which is a part of Conholdate.Total for Java) is very powerful tool to extract text and it can be easily used in simple use cases.

This article shows how to write code for a simplest scenario.

To extract text from PDF simply call the getText method:

TextReader getText();

This method returns an instance of TextReader class with an extracted text. 

TextReader class extends java.io.Reader and adds the following members:

Member Description
readLine Reads a line of characters from the text reader and returns the data as a string.
readToEnd Reads all characters from the current position to the end of the text reader and returns them as one string.

Here are the steps to extract a text from the document:

  • Instantiate Parser object for the initial document;
  • Call getText method and obtain TextReader object;
  • Check if reader isn’t null (text extraction is supported for the document);
  • Read a text from reader.

The following example shows how to extract a text from a document:

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
    // Extract a text into the reader
    try (TextReader reader = parser.getText()) {
        // Print a text from the document
        // If text extraction isn't supported, a reader is null
        System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
    }
}