Posted at: 6:00 PM on 28 March 2014 by Muhimbi
With the release of version 7.1 of the PDF Converter for SharePoint we added a fundamental new technology to our Document Conversion and Manipulation platform, Optical Character Recognition (OCR). That initial release was able to process scanned / bitmap based content and generate fully searchable PDFs.
With the introduction of version 7.2 we are adding support for a new OCR related use case, which is the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from documents that all use a common template or layout. For example, if a reference number can always be found at the top right corner of scanned documents then that text can be extracted and stored in a SharePoint column from where it can be included in searches or be used in further workflow steps… pretty powerful stuff.
This post describes the SharePoint Designer Workflow Activity. The Nintex Workflow equivalent can be found here.
For more details, including an introduction, see these related blog posts.
- OCR Facilities provided by Muhimbi’s server based PDF Conversion products
- Carry out OCR using a web service call (.NET)
- Carry out OCR using a web service call (Java)
- Converting scans and images to searchable PDFs using SharePoint Designer Workflows
- Converting scans and images to searchable PDFs using OCR & Nintex Workflow
Once the Muhimbi PDF Converter for SharePoint is installed you will find a number of new Workflow Activities in SharePoint Designer. One of these activities is named Extract Text using OCR and looks as follows.
- this document: The source document to OCR and extract text from. For most workflows selecting Current Item will suffice, but some scenarios may require the look up of a different item.
- OCR language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
- OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
- Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
- Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
- Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified - UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then use our software to convert the file to PDF and open it in a PDF reader.
- Page: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
- Result: The recognised text will be stored in this variable (type String)
Although creating simple workflows in SharePoint Designer is relatively easy, there is a first time for everything. If the concept of SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.
Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.
Any questions or comments? Leave a message below or contact us.