Extract Images from PDF

Extract Images from PDF in Java

GroupDocs.Parser for Java(which is a part of Conholdate.Total for Java) provides the functionality to extract images from PDF by the getImages method:

Iterable<PageImageArea> getImages();

This method returns a collection of PageImageArea objects:

Member Description
getPage The page that contains the text area.
getRectangle The rectangular area on the page that contains the text area.
getFileType The format of the image.
getRotation The rotation angle of the image.
getImageStream Returns the image stream.
getImageStream(ImageOptions) Returns the image stream in a different format.
save(String) Saves the image to the file.
save(String, ImageOptions) Saves the image to the file in a different format.

ImageOptions class is used to define the image format into which the image is converted. The following image formats are supported:

  • Bmp
  • Gif
  • Jpeg
  • Png
  • WebP

Here are the steps to extract images from the whole document:

  • Instantiate Parser object for the initial document;
  • Call getImages method and obtain collection of PageImageArea objects;
  • Check if collection isn’t null (images extraction is supported for the document);
  • Iterate through the collection and get sizes, image types and image contents.

The following example shows how to extract all images from the whole document:

// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
    // Extract images
    Iterable<PageImageArea> images = parser.getImages();
    // Check if images extraction is supported
    if (images == null) {
        System.out.println("Images extraction isn't supported");
        return;
    }
    // Iterate over images
    for (PageImageArea image : images) {
        // Print a page index, rectangle and image type:
        System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
    }
}