OCR Facilities provided by Muhimbi’s server based PDF Conversion products

Posted at: 15:29 on 19 September 2013 by Muhimbi

OCR-LogoStarting with the release of versions 7.1 of the Muhimbi PDF Converter API and Server Platform and the Muhimbi PDF Converter for SharePoint we have added support for OCR (Optical Character Recognition). This has been on our roadmap for years - and requested by many customers - but developing such advanced functionality takes some time.

Please read on for details about scenarios that benefit from Optical Character Recognition, OCR facilities provided by our range of server based PDF Conversion products as well as some sample files. You may also want to have a look at The How and Why of OCR.

 

OCR Use Case 1 – Making scanned documents searchable

One of the more popular questions our support desk receives is if converted PDF files are searchable by users and indexable by search engines. The answer to that question has always been Yes …… when the source document consists of real text such as MS-Word, Excel, MSG, EML, HTML and most of the other file formats we support.

The story is quite different when the source file is a scanned document, which generally just contains a picture of the text. Generally search engines do not understand these image based files and will simply skip them. The solution is to OCR these documents, a process that recognises text and places it in a hidden layer. The resulting document still looks identical to the original file, but search engines and PDF readers are intelligent enough to retrieve the text. As a result scanned documents are fully searchable and content can even be copied to the clipboard for pasting in other applications.

 

OCR Use Case 2 – Extracting text from a page region

Another common use for OCR is extracting text from an area on a page. Let’s take an order processing system as an example. Orders arrive via various channels, but they always use a predetermined template. Each order is passed through a scanner and placed in a file repository such as SharePoint. Although the computer happily stores the scanned file, it cannot interpret its content so it is up to a human to enter meta-data such as the Customer ID and Order Number. Providing this information is always stored (roughly) in the same area, an OCR based solution can extract the details automatically and without human intervention.

 

The use case for making scanned documents searchable is supported by version 7.1 of the PDF Converter. Extracting text is planned for version 7.2.

The key features of the current release are:

  • Server based solution, accessible via a modern Web Service interface (Java, C#, Ruby, PHP etc)
  • Integrates with SharePoint Designer and Nintex workflows.
  • Convert image based files such as TIFF, Scanned PDF, PNG, JPG, BMP, GIF to searchable PDFs.
  • Support for multiple languages (Danish, German, English, Dutch, Finnish, French, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish with more to come) .
  • Additional languages and custom fonts can be added by customers and third parties.
  • OCR is fully integrated with the conversion pipeline allowing a single web service call to Convert, OCR, Watermark, Merge and Secure documents.
  • Whitelist / Blacklist certain characters. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.

 

OCRed-DocumentScanned Document with OCRed text selected

 

Please keep in mind that OCR is not some kind of white (or black) magic. If the source material is of poor quality (a lot of noise, scratches, low resolution, funny fonts) then don’t expect your text to be recognised with a high level of accuracy. However, when the scans use at least 300dpi and the font size is not smaller than 10pt, then you should see good results.

The current release is not perfect, we will continue to improve the OCR facilities over the next few years in the same way that we keep evolving the rest of the product.

The main limitations are currently as follows:

  • The JPXDecode (JPEG2000) image encoding type is currently not supported. As a workaround use our software to convert the JPEG2000 encoded PDF to a PDF version that uses different encoding (e.g. PDF 1.4).
  • Performance is not yet as quick as we would like it to be. Note that OCR performance is measured in seconds per page, not milliseconds per page like most of the other operations carried out by our software. We are looking to improve performance significantly though. Multiple OCR tasks can already execute in parallel to make use of multicore systems.
  • The system cannot be used to recognise human handwriting.

 

Sample code

 

As they say, a picture is worth a thousand words so an example files are provided below. Both the original scanned image as well as a PDF that OCR has been carried out on can be downloaded. Open the OCRed PDF and try searching for a word.
 

Download Original, unprocessed, PDF

Download OCRed PDF

 
This test document was created by printing our license agreement on a standard inkjet printer and then scanning it as a 400dpi monochrome image using JBIG2 (lossless) compression. As you can see text is recognised very well. We left some imperfections in on purpose as we don’t want to provide a doctored example that somehow claims 100% accuracy. No product can do that and neither does ours. This example was prepared using a beta version of the software.

Please note that you need the OCR and PDF/A Archiving add-on license in addition to a valid PDF Converter for SharePoint or PDF Converter API and Server Platform License in order to use this functionality.

 

Any questions or remarks? Leave a message in the comments below or contact us.

.

Labels: , , , , , , ,

RevoDrive3 X2 & Mirrored / RAID-1 Samsung Evo 840 1TB Performance tests

Posted at: 14:17 on 04 September 2013 by Muhimbi

speedAlmost exactly a year ago we published a post about our experience building a nice virtualisation rig from scratch. We included a nice RevoDrive3 X2 as well, but unfortunately this fancy 480GB SSD drive is running out of space. So we recently purchased 2 Samsung Evo 840 1TB SSD drives to give us some breathing space.

As this is the first time in a year this system has been brought down, we decided to run some performance tests. Read on for details about the performance of a RevoDrive after it has been thrashed for a year as well as the new Samsung Evo drives in stand alone and RAID-1 (mirrored) configuration.

This is by no means an extensive performance test, but we find it interesting nevertheless.

The tests were carried out on this system using ATTO Disk Benchmark v2.47 and CrystalDiskMark 3.0.1 x64 all using the default settings and queue depths.

 

RevoDrive3 X2 tests

The following tests were carried out:

  1. New & Empty: Executed after the drive was originally purchased and completely empty.
  2. Full, before optimisation: Executed after a year with the drive largely full.
  3. Empty, before optimisation: Executed after all files were deleted (but not formatted)
  4. Empty, after optimisation: Executed after the RevoDrive BIOS was updated and a secure erase was carried out.
  5. Full, after optimisation: Executed after all files were copied back to it.

 

ATTO-Revo-New-2012  

(1) New & Empty

 
ATTO-Revo-Full-2013-(Before) ATTO-Revo-Empty-2013-(Before)

(2) Full, before optimisation

(3) Empty, before optimisation

ATTO-Revo-Full-2013-(After) ATTO-Revo-Empty-2013-(After)

(5) Full, after optimisation

(4) Empty, after optimisation


What does this show us? Not much, there are too many bars in this graph. The Crystal Disk Mark paints a much easier to understand picture.

CDM-Revo-New-2012  

(1) New & Empty

 
CDM-Revo-Full-2013-(Before) CDM-Revo-Empty-2013-(Before)

(2) Full, before optimisation

(3) Empty, before optimisation

CDM-Revo-Full-2013-(After) CDM-Revo-Empty-2013-(After)

(5) Full, after optimisation

(4) Empty, after optimisation

The picture painted by Crystal Disk Mark is easier to understand. Please note that it is nearly impossible to draw any real conclusions as Windows Server 2012, the OS the test server runs on, has received a number of patches during the year. Regardless, we can see the following:

  1. After a year the write performance between scenarios 1 (new & empty) and 3 (1 year old & empty) has significantly gone down (540MB/s vs 394MB/s)
  2. Reading from a full RevoDrive appears to be faster compared to reading from an empty drive.
  3. Writing to an empty drive appears to be faster compared to writing to a full drive.
  4. Carrying out a secure erase / upgrading the drive's bios makes the drive nice and fast again. (Sequential is actually slightly faster now, random 4K is slightly slower)

 

In conclusion, it is worth carrying out a secure erase / bios upgrade once a year. Please make sure that any data is copied to another disk before carrying out these steps or you will lose it.

 

Samsung Evo 840 1TB

We have no historical figures for the Samsung drive, but we did execute a number of tests to see the difference between running the drive in stand-alone mode, software RAID-1 and hardware RAID-1. The figures are surprising. Please note that Samsung Rapid mode can only accelerate 1 drive, so this was not tested.

The test motherboard comes with 3 different 6Gbps SATA controllers. An excellent 2 port Intel one, which is unfortunately already used as a RAID-1 boot drive for the previously purchased Samsung 830 drives. A rather pathetically slow LSI controller, which automatically devalues any SSD drive, and a Marvel controller. The test below was carried out using the Marvel controller.

The following tests were carried out:

  1. Stand alone: A single drive.
  2. Soft mirror: 2 drives connected to the Marvel controller using Windows Server 2012 Software RAID-1.
  3. Hard mirror: 2 drives connected to the Marvel controller using the controller’s internal RAID-1 capabilities.

 

ATTO-EVO-NoMirror ATTO-EVO-SoftMirror ATTO-EVO-HardMirror

(1) Stand alone

(2) Soft mirror

(3) Hard mirror

CDM-EVO-NoMirror CMD-EVO-SoftMirror CDM-EVO-HardMirror

(1) Stand alone

(2) Soft mirror

(3) Hard mirror


It appears that we can actually draw a conclusion from this: Using RAID-1, writing is significantly slower compared to stand-alone (no RAID) use. Reading, however, on a hardware RAID is significantly faster…. at least under test circumstances.

 

Any comments? Did we do something right or wrong? Leave a message below.

.

Labels:

Subscribe to News feed