How to OCR Images & Scanned PDFs using Java

Related Products

PDF Converter

PDF Converter

Share

In this article we explain how to use Java and server based Optical Character Recognition (OCR) to convert image based files such as TIFF, PNG and scanned PDFs into fully searchable and indexable PDF files. Read on to learn about how - in addition to Document Conversion, Merging, Watermarking, Splitting and Securing of documents - the latest version of the Muhimbi PDF Converter Services also comes with a comprehensive OCR engine.

A .NET (C#) version of this article is available here.

For those not familiar with the product, the Muhimbi PDF Converter Services is a server based SDK that allows software developers to convert and manipulate typical Office files, including MS-Word, Excel, PowerPoint, Visio, Publisher, AutoCAD, HTML, emails and InfoPath, to PDF and other formats using a robust, scalable but friendly Web Services interface from Java, PHP, Ruby and .NET based solutions. If you have any questions or need more information then please let us know.

Please read this post for a high level overview of our new OCR facilities.

Sample code

The following sample illustrates how to use OCR to convert a file (preferably a scan) into a fully searchable PDF. In this example we use wsimport to generate web service proxy classes, but other web service frameworks are supported as well. We actually prefer Apache Axis2 ( See this sample) as it generates cleaner and more flexible code.

The example described below assumes the following:

  1. The JDK has been installed and configured.
  2. The Conversion Service and all prerequisites have been installed in line with the Administration Guide.
  3. The Conversion Service is running in the default anonymous mode. This is not an absolute requirement, but it makes initial experimentation much easier.

As of version 7.1 this sample code is automatically installed alongside the product. The source code, including pre-generated proxy classes for the web service and a sample file, can be downloaded here. If you choose not to download the sample code then please carry out the steps described below.

The first step is to generate proxy classes for the web service by executing the following command:

wsimport http://localhost:41734/Muhimbi.DocumentConverter.WebService/?wsdl

-d src -Xnocompile -p com.muhimbi.ws

Feel free to change the package name and destination directory to something more suitable for your organisation. If you are generating proxies from a remote system (a system that doesn’t run the Muhimbi Conversion Service) then please see this Knowledge Base Article.

Wsimport automatically generates the Java class names. Unfortunately some of the generated names are rather long and ugly so you may want to consider renaming some, particularly the Exception classes, to something friendlier. This, however, means that if you ever run wsimport again you will need to re-apply those changes. For more information have a look at the high level overview of the Object Model exposed by the web service.

Once the proxy classes have been created add the following sample code to your project. Run the code and make sure the file to OCR is specified on the command line.

As all this functionality is exposed via a Web Services interface, it works equally well from .NET, PHP, Ruby and other web services enabled environments. Please note that you need the OCR and PDF/A Archiving add-on license in addition to a valid PDF Converter Services or PDF Converter for SharePoint License in order to use this functionality.

This code is merely an example of what is possible, feel free to adapt it to you own needs. The possibilities are endless.

Any questions or remarks? Leave a message in the comments below or contact us.

Labels: Articles, Java, OCR, pdf, PDF Converter Professional, PDF Converter Services

Have a Question?
We’re Always Happy to Help.

© Muhimbi Ltd. 2008 - 2024
This website uses cookies to ensure you get the best experience. Learn more