Subscribe to News feed

Extract text from scanned content using OCR and Nintex Workflow

Posted at: 14:53 on 31 March 2014 by Muhimbi

OCR-Logo5_thumbWith the release of version 7.1 of the PDF Converter for SharePoint we added a fundamental new technology to our Document Conversion and Manipulation platform, Optical Character Recognition (OCR). That initial release was able to process scanned / bitmap based content and generate fully searchable PDFs.

With the introduction of version 7.2 we are adding support for a new OCR related use case, which is the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from documents that all use a common template or layout. For example, if a reference number can always be found at the top right corner of scanned documents then that text can be extracted and stored in a SharePoint column from where it can be included in searches or be used in further workflow steps… pretty powerful stuff.

This post describes the Nintex Workflow Activity. The SharePoint Designer equivalent can be found here.

For more details, including an introduction, see these related blog posts.

 

Once the Muhimbi PDF Converter for SharePoint is installed, and the Nintex Workflow Integration has been activated, a number of new activities will be added automatically to the list, including the new Extract text using OCR activity. It is compatible with Nintex Workflow 2007, 2010 & 2013 and this is what it looks like.
 

OCR-Extract-Workflow-Activity


Building a full example workflow is out of the scope of this post as it is relatively easy. For details see our generic PDF Conversion for Nintex Workflow example.

The fields supported by this Workflow Activity are as follows:

  1. Language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  2. Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  3. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  4. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  5. Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified - UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then use our software to convert the file to PDF and open it in a PDF reader.
  6. Page Number: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
  7. Output Text: The recognised text will be stored in this variable (type String).
  8. Source List ID & List Item: The item that triggered the workflow is processed by default. You can optionally specify the ID of a different List and List Item using workflow variables. Please use data type string for the List ID workflow variable. For the Item ID use type Item ID (in SharePoint 2007) or Integer (in SharePoint 2010 / 2013)
  9. Error Handling: Similar to the way some of Nintex’ own Workflow Activities allow errors to be captured and evaluated by subsequent actions, all of Muhimbi’s Workflow Activities allow the same. By default this facility is disabled meaning that any error terminates the workflow.

 

For more details about using the PDF Converter for SharePoint in combination with Nintex Workflow see this Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , , , ,

Extract text from scanned content using OCR and SharePoint Designer Workflows

Posted at: 18:00 on 28 March 2014 by Muhimbi

OCR-Logo5With the release of version 7.1 of the PDF Converter for SharePoint we added a fundamental new technology to our Document Conversion and Manipulation platform, Optical Character Recognition (OCR). That initial release was able to process scanned / bitmap based content and generate fully searchable PDFs.

With the introduction of version 7.2 we are adding support for a new OCR related use case, which is the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from documents that all use a common template or layout. For example, if a reference number can always be found at the top right corner of scanned documents then that text can be extracted and stored in a SharePoint column from where it can be included in searches or be used in further workflow steps… pretty powerful stuff.

This post describes the SharePoint Designer Workflow Activity. The Nintex Workflow equivalent can be found here.

For more details, including an introduction, see these related blog posts.

 

Once the Muhimbi PDF Converter for SharePoint is installed you will find a number of new Workflow Activities in SharePoint Designer. One of these activities is named Extract Text using OCR and looks as follows. 
 OCR-Extract-Workflow-Activity

 
In typical Muhimbi fashion the workflow sentence is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.

  1. this document: The source document to OCR and extract text from. For most workflows selecting Current Item will suffice, but some scenarios may require the look up of a different item. 
  2. OCR language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  3. OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  4. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  5. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  6. Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified - UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then use our software to convert the file to PDF and open it in a PDF reader.
  7. Page: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
  8. Result: The recognised text will be stored in this variable (type String)

 

Although creating simple workflows in SharePoint Designer is relatively easy, there is a first time for everything. If the concept of SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , , ,

Converting scans and images to searchable PDFs using OCR & Nintex Workflow

Posted at: 15:37 on by Muhimbi

OCR-Logo5Although it had been years in the planning, we didn’t really make a big deal out of the support for Optical Character Recognition (OCR) when we shipped it as part of version 7.1 of the PDF Converter for SharePoint. We did this for a good reason as – although the underpinnings were working well – the actual integration point with Nintex Workflow wasn’t as nice as we wanted it to be.

With the release of version 7.2 we are adding two new Workflow Activities to both Nintex Workflow and SharePoint Designer. The first activity, described in this post, can be used to convert scanned content into fully searchable PDFs. A separate post will detail the other new OCR Activity, which can extract text from scanned content. For a high level overview of our OCR facilities please read the original announcement.

This post describes the Nintex Workflow version of the Workflow Activity. The SharePoint Designer equivalent can be found here.

 

Optical Character Recognition… sounds quite complex, what would you need that for? Well, most organisations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system / SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology. Content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.

This is where OCR comes in. OCR analyses image based content – e.g. a scanned PDF or an image embedded in an MS-Word file – applies some fancy recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers are clever enough to index this text as well. Confused? Have a look at the screenshot below.

OCRed-Document5 
Scanned Document with OCRed text selected

 
It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. It is compatible with Nintex Workflow 2007, 2010 & 2013 and this is what it looks like.
 
OCR-PDF-Nintex-Workflow-Activity

 
Building a full example workflow is out of the scope of this post as it is relatively easy. For details see our generic PDF Conversion for Nintex Workflow example.

The fields supported by this Workflow Activity are as follows:

  1. Destination Path: The location to write the generated file to. Leave this field empty to use the same location as the source file. For details about how to specify paths to different libraries / site collections see this blog post
  2. Output File Name: The name of the generated file. Leave this field empty to use the same name as the source file. Please note that if your source file is already in PDF format, and the Destination Path is the same as the Source Path, then leaving this field empty will overwrite it. 
  3. Meta data: Control if the source file’s SharePoint meta-data is copied to the destination file.
  4. Language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  5. Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  6. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  7. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  8. Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes) <Regions>…</Regions>.
  9. Output List ID: If you wish to carry out further actions on the generated file, e.g. send it by email or perform a check-in, then you can optionally store the ID of the List the file was written to in a workflow variable of type String.
  10. PDF List Item ID: Similarly to Output List ID, the Item ID of the generated file can optionally be stored in a workflow variable of type Item ID (in SharePoint 2007) or Integer (in SharePoint 2010 / 2013).
  11. Source List ID & List Item: The item that triggered the workflow is processed by default. You can optionally specify the ID of a different List and List Item using workflow variables. Please use the same data types as used by Output List ID and Output List Item ID.
  12. Error Handling: Similar to the way some of Nintex’ own Workflow Activities allow errors to be captured and evaluated by subsequent actions, all of Muhimbi’s Workflow Activities allow the same. By default this facility is disabled meaning that any error terminates the workflow.

 

For more details about using the PDF Converter for SharePoint in combination with Nintex Workflow see this Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , ,

Converting scans and images to searchable PDFs using SharePoint Designer Workflows

Posted at: 13:52 on by Muhimbi

OCR-LogoAlthough it had been years in the planning, we didn’t really make a big deal out of the support for Optical Character Recognition (OCR) when we shipped it as part of version 7.1 of the PDF Converter for SharePoint. We did this for a good reason as – although the underpinnings were working well – the actual integration points with SharePoint, specifically SharePoint Designer Workflows, wasn’t as nice as we wanted it to be.

With the release of version 7.2 we are adding two new Workflow Activities to both Nintex Workflow and SharePoint Designer. The first activity, described in this post, can be used to convert scanned content into fully searchable PDFs. A separate post will detail the other new OCR Activity, which can extract text from scanned content. For a high level overview of our OCR facilities please read the original announcement.

This post describes the SharePoint Designer Workflow Activity. The Nintex Workflow equivalent can be found here.

 

Optical Character Recognition… sounds quite complex, what would you need that for? Well, most organisations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system / SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology. Content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.

This is where OCR comes in. OCR analyses image based content – e.g. a scanned PDF or an image embedded in an MS-Word file – applies some fancy recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers are clever enough to index this text as well. Confused? Have a look at the screenshot below.
 

OCRed-DocumentScanned Document with OCRed text selected

 
It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. This is what it looks like.
 

OCR-PDF-Workflow-Activity

 
In typical Muhimbi fashion the workflow sentence is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.

  1. this document: The source document to Convert and OCR. For most workflows selecting Current Item will suffice, but some scenarios may require the look up of a different item. 
  2. this file: The name and location to write the generated file to. Leave this field empty to use the same location and name as the source file. Please note that if your source file is already in PDF format then leaving this field empty will overwrite it. For details about how to specify paths to different libraries / site collections see this blog post
  3. include / exclude meta data: Control if the source file’s SharePoint meta-data is copied to the destination file.
  4. OCR language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  5. OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  6. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  7. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  8. Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes) <Regions>…</Regions>.
  9. List ID: The ID of the list the processed file was written to. This can later in the workflow be used to perform additional tasks on the file such as a check-in or out.
  10. Item ID: The ID of the processed file. Can be used with the List ID

 

Although creating simple workflows in SharePoint Designer is relatively easy, there is a first time for everything. If the concept of SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , , ,

Converting MSG based Calendar entries to PDF including embedded content

Posted at: 17:38 on 27 March 2014 by Muhimbi

CalendarWhen we first started investigating the conversion of email based files (MSG, EML) we had a good look at the market to see if there was anything ‘out there’ that we could license. Although there were some development libraries on the market that provided part of the solution, to our surprise there was no single capable MSG to PDF Converter on the market, at least not one that did an acceptable job (We have very high standards).

So we did what we always do, rolled up our sleeves and spent the usual cycle of building, testing, listening to our customers and improving some more. Rinse and repeat for a couple of years resulting in the most capable, flexible and - dare I say it – prettiest Email to PDF Converter on the market. It wasn’t easy, believe me, but it works great and happily churns through millions of emails, especially in organisations subject to document retention regulations.

Once you have built a great foundation and products like the Muhimbi PDF Converter for SharePoint and PDF Converter Services (for Java, PHP, Ruby, C#, .NET), companies actually start using it, and naturally ask for more. Oh those pesky customers, how much we.. love.. them. One of our customers recently remarked that calendar entries don’t look particularly great when converted to PDF. “Calendar Entries????” we had not realised that people wanted to convert all kinds of MSG files, not just regular emails.

So, once again we sat down and improved our email converter in the following areas:

  • Converting Calendar entries & Meeting requests to PDF.
  • Conversion of embedded content (OLE) such as Excel sheets.
  • Compatibility for MSG files generated by third party (non-Outlook) libraries.
  • Bug fixes for internationalisation

Calendar-to-PDF-Conversion 
Example of a converted Calendar entry.

 
Naturally the new Calendar converter will automatically benefit from all other facilities provided by our MSG & EML converter including conversion of digitally signed files, conversion (and merging) of attachments, support for rich HTML &, RTF content as well as support for many character sets and languages.

For more details about our Email converter see the original announcement.

Labels: , , , , ,

Need support from experts?

Access our Forum

Download Free Trials