Subscribe to News feed

Converting scans and images to searchable PDFs using Java and server side OCR

Posted at: 6:57 PM on 17 October 2013 by Muhimbi

In this article we explain how to use Java and server based Optical Character Recognition (OCR) to convert image based files such as TIFF, PNG and scanned PDFs into fully searchable and indexable PDF files. Read on to learn about how - in addition to Document Conversion, Merging, Watermarking, Splitting and Securing of documents - the latest version of the Muhimbi PDF Converter Services also comes with a comprehensive OCR engine.

A .NET (C#) version of this article is available here.
 

For those not familiar with the product, the Muhimbi PDF Converter Services is a server based SDK that allows software developers to convert and manipulate typical Office files, including MS-Word, Excel, PowerPoint, Visio, Publisher, AutoCAD, HTML, emails and InfoPath, to PDF and other formats using a robust, scalable but friendly Web Services interface from Java, PHP, Ruby and .NET based solutions. If you have any questions or need more information then please let us know.

Please read this post for a high level overview of our new OCR facilities.

 

Sample code

The following sample illustrates how to use OCR to convert a file (preferably a scan) into a fully searchable PDF. In this example we use wsimport to generate web service proxy classes, but other web service frameworks are supported as well. We actually prefer Apache Axis2 (See this sample) as it generates cleaner and more flexible code.

The example described below assumes the following:

  1. The JDK has been installed and configured.
  2. The Conversion Service and all prerequisites have been installed in line with the Administration Guide.
  3. The Conversion Service is running in the default anonymous mode. This is not an absolute requirement, but it makes initial experimentation much easier.

 

As of version 7.1 this sample code is automatically installed alongside the product. The source code, including pre-generated proxy classes for the web service and a sample file, can be downloaded here. If you choose not to download the sample code then please carry out the steps described below.

The first step is to generate proxy classes for the web service by executing the following command:

     wsimport http://localhost:41734/Muhimbi.DocumentConverter.WebService/?wsdl
    -d src -Xnocompile -p com.muhimbi.ws

 
Feel free to change the package name and destination directory to something more suitable for your organisation. If you are generating proxies from a remote system (a system that doesn’t run the Muhimbi Conversion Service) then please see this Knowledge Base Article.

Wsimport automatically generates the Java class names. Unfortunately some of the generated names are rather long and ugly so you may want to consider renaming some, particularly the Exception classes, to something friendlier. This, however, means that if you ever run wsimport again you will need to re-apply those changes. For more information have a look at the high level overview of the Object Model exposed by the web service.

Once the proxy classes have been created add the following sample code to your project. Run the code and make sure the file to OCR is specified on the command line.
 

package com.muhimbi.app;

import com.muhimbi.ws.*;

import java.io.*;
import java.net.URL;
import java.util.List;

import javax.xml.bind.JAXBElement;
import javax.xml.namespace.QName;

public class WsClient {

  private final static String DOCUMENTCONVERTERSERVICE_WSDL_LOCATION =
        "http://localhost:41734/Muhimbi.DocumentConverter.WebService/?wsdl";

  private static ObjectFactory _objectFactory = new ObjectFactory();
 
  public static void main(String[] args) {
    try {
      if (args.length != 1) {
        System.out.println("Please specify a single file name on the command line.");

      } else {
        // ** Process command line parameters
        String sourceDocumentPath = args[0];
        File file = new File(sourceDocumentPath);
        String fileName = getFileName(file);
        String fileExt = getFileExtension(file);

        System.out.println("Processing file " + sourceDocumentPath);

        // ** Initialise Web Service
        DocumentConverterService_Service dcss = new DocumentConverterService_Service(
            new URL(DOCUMENTCONVERTERSERVICE_WSDL_LOCATION),
            new QName("http://tempuri.org/", "DocumentConverterService"));
        DocumentConverterService dcs = dcss.getBasicHttpBindingDocumentConverterService();

        // ** Only call conversion if file extension is supported
        if (isFileExtensionSupported(fileExt, dcs)) {
          // ** Read source file from disk
          byte[] fileContent = readFile(sourceDocumentPath);

          // ** Converting the file
          OpenOptions openOptions = getOpenOptions(fileName, fileExt);
          ConversionSettings conversionSettings = getConversionSettings();
          byte[] convertedFile = dcs.convert(fileContent, openOptions, conversionSettings);

          // ** Writing converted file to file system
          String destinationDocumentPath = getPDFDocumentPath(file);
          writeFile(convertedFile, destinationDocumentPath);
          System.out.println("File converted successfully to " + destinationDocumentPath);

        } else {
          System.out.println("The file extension is not supported.");
        }
      }

    } catch (IOException e) {
      System.out.println(e.getMessage());
    } catch (DocumentConverterServiceGetConfigurationWebServiceFaultExceptionFaultFaultMessage e) {
      printException(e.getFaultInfo());
    } catch (DocumentConverterServiceConvertWebServiceFaultExceptionFaultFaultMessage e) {
      printException(e.getFaultInfo());
    }
  }

  public static OpenOptions getOpenOptions(String fileName, String fileExtension) {
    OpenOptions openOptions = new OpenOptions();
    // ** Set the minimum required open options. Additional options are available
    openOptions.setOriginalFileName(_objectFactory.createOpenOptionsOriginalFileName(fileName));
    openOptions.setFileExtension(_objectFactory.createOpenOptionsFileExtension(fileExtension));
    return openOptions;
  }

  public static ConversionSettings getConversionSettings() {
    ConversionSettings conversionSettings = new ConversionSettings();
    // ** Set the minimum required conversion settings. Additional settings are available
    conversionSettings.setQuality(ConversionQuality.OPTIMIZE_FOR_PRINT);
    conversionSettings.setRange(ConversionRange.ALL_DOCUMENTS);
    conversionSettings.getFidelity().add("Full");
    conversionSettings.setFormat(OutputFormat.PDF);
    conversionSettings.setOCRSettings(_objectFactory.createConversionSettingsOCRSettings(getOCRSettings()));
    return conversionSettings;
  }

  public static OCRSettings getOCRSettings() {
    OCRSettings ocrSettings = new OCRSettings();
    ocrSettings.setLanguage(_objectFactory.createOCRSettingsLanguage("eng"));
    ocrSettings.setPerformance(OCRPerformance.SLOW);
    ocrSettings.setWhiteList(_objectFactory.createOCRSettingsWhiteList(""));
    ocrSettings.setBlackList(_objectFactory.createOCRSettingsBlackList(""));
    return ocrSettings;
  }

  public static String getFileName(File file) {
    String fileName = file.getName();
    return fileName.substring(0, fileName.lastIndexOf('.'));
  }

  public static String getFileExtension(File file) {
    String fileName = file.getName();
    return fileName.substring(fileName.lastIndexOf('.') + 1, fileName.length());
  }

  public static String getPDFDocumentPath(File file) {
    String fileName = getFileName(file);
    String folder = file.getParent();
    if (folder == null) {
      folder = new File(file.getAbsolutePath()).getParent();
    }
    return folder + File.separatorChar + fileName + "_ocr."
        + OutputFormat.PDF.value();
  }

  public static byte[] readFile(String filepath) throws IOException {
    File file = new File(filepath);
    InputStream is = new FileInputStream(file);
    long length = file.length();
    byte[] bytes = new byte[(int) length];

    int offset = 0;
    int numRead;
    while (offset < bytes.length
        && (numRead = is.read(bytes, offset, bytes.length - offset)) >= 0) {
      offset += numRead;
    }

    if (offset < bytes.length) {
      throw new IOException("Could not completely read file " + file.getName());
    }
    is.close();
    return bytes;
  }

  public static void writeFile(byte[] fileContent, String filepath)
      throws IOException {
    OutputStream os = new FileOutputStream(filepath);
    os.write(fileContent);
    os.close();
  }

  public static boolean isFileExtensionSupported(String extension, DocumentConverterService dcs)
    throws DocumentConverterServiceGetConfigurationWebServiceFaultExceptionFaultFaultMessage
    {
      Configuration configuration = dcs.getConfiguration();
      final JAXBElement<ArrayOfConverterConfiguration> converters = configuration.getConverters();
      final ArrayOfConverterConfiguration ofConverterConfiguration = converters.getValue();
      final List<ConverterConfiguration> cList = ofConverterConfiguration.getConverterConfiguration();
 
      for (ConverterConfiguration cc : cList) {
        final List<String> supportedExtension = cc.getSupportedFileExtensions()
            .getValue().getString();
        if (supportedExtension.contains(extension)) {
          return true;
        }
    }

    return false;
  }

  public static void printException(WebServiceFaultException serviceFaultException) {
    System.out.println(serviceFaultException.getExceptionType());
    JAXBElement<ArrayOfstring> element = serviceFaultException.getExceptionDetails();
    ArrayOfstring value = element.getValue();
    for (String msg : value.getString()) {
      System.out.println(msg);
    }
  }

}

 
As all this functionality is exposed via a Web Services interface, it works equally well from .NET, PHP, Ruby and other web services enabled environments. Please note that you need the PDF Converter Professional add-on license in addition to a valid PDF Converter Services or PDF Converter for SharePoint License in order to use this functionality.

This code is merely an example of what is possible, feel free to adapt it to you own needs. The possibilities are endless.

 

Any questions or remarks? Leave a message in the comments below or contact us.

.

Labels: , , , , ,

Converting scans and images to searchable PDFs using C# and server side OCR

Posted at: 5:49 PM on 16 October 2013 by Muhimbi

OCR-LogoAs of version 7.1 Muhimbi’s range of PDF Conversion products offers support for Optical Character Recognition (OCR). Similar to all other functionality provided by our products, this new OCR facility can be used using our friendly Web Services Interface as well as our SharePoint Designer and Nintex Workflow Actions.


In this post we’ll provide a simple .NET sample that invokes our Web Services interface to make an image based PDF fully searchable. The code is nearly identical to the code to convert and watermark a simple MS-Word file with the following exceptions:

  1. The code looks for PDF source files (an image based PDF is included in the downloadable sample code).
  2. The conversionSettings.OCRSettings property is populated with relevant OCR settings such as the language.
  3. The client.ProcessChanges() method is invoked rather than client.Convert().
  4. All references to watermarks have been removed as they are not part of this sample.

 
You can apply the same changes to the PHP and Ruby samples to make it do the same using those languages. A separate Java based OCR sample is available here.

 

Sample Code

Listed below is sample code to carry out OCR processing. You can either copy the code from this blog post, download the Visual Studio Project or open the project from the Sample Code folder in the Windows Start Menu.

The sample code expects the path of the source PDF file on the command line. If the path is omitted then the first PDF file found in the current directory will be used.

  1. Download and install version 7.1 of the Muhimbi PDF Converter Services or PDF Converter for SharePoint.
     
  2. Create a new Visual Studio C# Console application named OCR_PDF.
     
  3. Add a Service Reference to the following URL and specify ConversionService as the namespace. If you are developing on a remote system (a system that doesn’t run the Muhimbi Conversion Service) then please see this Knowledge Base Article.
      
        http://localhost:41734/Muhimbi.DocumentConverter.WebService/?wsdl
     
  4. Paste the following code into Program.cs.
     
    using System;
    using System.Diagnostics;
    using System.IO;
    using System.ServiceModel;
    using OCR_PDF.ConversionService;
     
    namespace OCR_PDF
    {
        class Program
        {
            //** !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! **
            //** This code sample is identical to a normal conversion request except for    **
            //** the part marked with "OCR OCR OCR". For more information see               **
            //** http://blog.muhimbi.com/2013/09/ocr-facilities-provide-by-muhimbis.html    **
            //** !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! **
     
         // ** The URL where the Web Service is located. Amend host name if needed.
         static string SERVICE_URL = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/";
     
            static void Main(string[] args)
            {
                DocumentConverterServiceClient client = null;
     
                try
                {
                    // ** Delete any processed files from a previous run
                    foreach (FileInfo f in new DirectoryInfo(".").GetFiles("*_ocr.pdf"))
                        f.Delete();
     
                    // ** Determine the source file and read it into a byte array.
                    string sourceFileName = null;
                    if (args.Length == 0)
                    {
                      // ** If nothing is specified then read the first PDF file from the folder.
                      string[] sourceFiles = Directory.GetFiles(Directory.GetCurrentDirectory(), 
    "*.pdf");
                      if (sourceFiles.Length > 0)
                          sourceFileName = sourceFiles[0];
                      else
                      {
                          Console.WriteLine("Please specify a document to OCR.");
                          Console.ReadKey();
                          return;
                      }
                    }
                    else
                        sourceFileName = args[0];
     
                    byte[] sourceFile = File.ReadAllBytes(sourceFileName);
     
                    // ** Open the service and configure the bindings
                    client = OpenService(SERVICE_URL);
     
                    //** Set the absolute minimum open options
                    OpenOptions openOptions = new OpenOptions();
                    openOptions.OriginalFileName = Path.GetFileName(sourceFileName);
                    openOptions.FileExtension = Path.GetExtension(sourceFileName);
     
                    // ** Set the absolute minimum conversion settings.
                    ConversionSettings conversionSettings = new ConversionSettings();
     
                    // ** OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR 
                    OCRSettings ocr = new OCRSettings();
                    ocr.Language = OCRLanguage.English.ToString();
                    ocr.Performance = OCRPerformance.Slow;
                    ocr.WhiteList = string.Empty;
                    ocr.BlackList = string.Empty;
                    conversionSettings.OCRSettings = ocr;
                    // ** OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR OCR 
     
                    // ** Carry out the conversion.
                    Console.WriteLine("Processing file " + sourceFileName + ".");
                    byte[] convFile = client.ProcessChanges(sourceFile, openOptions, 
    conversionSettings);
     
                    // ** Write the processed file back to the file system with a PDF extension.
                    string destinationFileName = Path.GetFileNameWithoutExtension(sourceFileName) 
    +
    "_ocr.pdf";
                    using (FileStream fs = File.Create(destinationFileName))
                    {
                        fs.Write(convFile, 0, convFile.Length);
                        fs.Close();
                    }
     
                    Console.WriteLine("File written to " + destinationFileName);
     
                    // ** Open the generated PDF file in a PDF Reader
                    Console.WriteLine("Launching file in PDF Reader");
                    Process.Start(destinationFileName);
                }
                catch (FaultException<WebServiceFaultException> ex)
                {
                    Console.WriteLine("FaultException occurred: ExceptionType: " +
                                    ex.Detail.ExceptionType.ToString());
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.ToString());
                }
                finally
                {
                    CloseService(client);
                }
                Console.ReadKey();
            }
     
     
            /// <summary>
            /// Configure the Bindings, endpoints and open the service using the specified address.
            /// </summary>
            /// <returns>An instance of the Web Service.</returns>
            public static DocumentConverterServiceClient OpenService(string address)
            {
                DocumentConverterServiceClient client = null;
     
                try
                {
                    BasicHttpBinding binding = new BasicHttpBinding();
                    // ** Use standard Windows Security.
                    binding.Security.Mode = BasicHttpSecurityMode.TransportCredentialOnly;
                    binding.Security.Transport.ClientCredentialType =
                                                                  HttpClientCredentialType.Windows;
                    // ** Increase the client Timeout to deal with (very) long running requests.
                    binding.SendTimeout = TimeSpan.FromMinutes(30);
                    binding.ReceiveTimeout = TimeSpan.FromMinutes(30);
                    // ** Set the maximum document size to 50MB
                    binding.MaxReceivedMessageSize = 50 * 1024 * 1024;
                    binding.ReaderQuotas.MaxArrayLength = 50 * 1024 * 1024;
                    binding.ReaderQuotas.MaxStringContentLength = 50 * 1024 * 1024;
     
                    // ** Specify an identity (any identity) in order to get it past .net3.5 sp1
                    EndpointIdentity epi = EndpointIdentity.CreateUpnIdentity("unknown");
                    EndpointAddress epa = new EndpointAddress(new Uri(address), epi);
     
                    client = new DocumentConverterServiceClient(binding, epa);
     
                    client.Open();
     
                    return client;
                }
                catch (Exception)
                {
                    CloseService(client);
                    throw;
                }
            }
     
            /// <summary>
            /// Check if the client is open and then close it.
            /// </summary>
            /// <param name="client">The client to close</param>
            public static void CloseService(DocumentConverterServiceClient client)
            {
                if (client != null && client.State == CommunicationState.Opened)
                    client.Close();
            }
     
        }
    }

  5. Make sure the output folder contains an image based PDF (e.g. a scan).
     
  6. Compile and execute the application. The processed PDF file will automatically be opened in your system’s PDF reader. Try using your PDF Reader’s search facility to find and highlight the OCRed text.
     

 
As all this functionality is exposed via a Web Services interface, it works equally well from Java, PHP, Ruby and other web services enabled environments. Please note that you need the PDF Converter Professional add-on license in addition to a valid PDF Converter for SharePoint or PDF Converter Services License in order to use this functionality.

This code is merely an example of what is possible, feel free to adapt it to you own needs. The possibilities are endless.

 

Any questions or remarks? Leave a message in the comments below or contact us.

.

Labels: , , , ,

Need support from experts?

Access our Forum

Download Free Trials