Extract Text from PDF
Contents
[
Hide
]
Extract Text from PDF in Java
GroupDocs.Parser (which is a part of Conholdate.Total for Java) is very powerful tool to extract text and it can be easily used in simple use cases.
This article shows how to write code for a simplest scenario.
To extract text from PDF simply call the getText method:
TextReader getText();
This method returns an instance of TextReader class with an extracted text.
TextReader class extends java.io.Reader and adds the following members:
Member | Description |
---|---|
readLine | Reads a line of characters from the text reader and returns the data as a string. |
readToEnd | Reads all characters from the current position to the end of the text reader and returns them as one string. |
Here are the steps to extract a text from the document:
- Instantiate Parser object for the initial document;
- Call getText method and obtain TextReader object;
- Check if reader isn’t null (text extraction is supported for the document);
- Read a text from reader.
The following example shows how to extract a text from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}