Subscribe to News feed

Programmatically Converting and Merging files attached to PDF Documents

Posted at: 1:54 PM on 17 April 2014 by Muhimbi

One of the cool things you can do when you have a comprehensive PDF Conversion and processing platform such as the Muhimbi PDF Converter Services, is that you can add relatively complex facilities  you didn’t originally envision, with relative ease.

When we originally set out we never thought of converting PDF files to PDF format, why would anyone do that? Well, it turns out there are several good reasons including Converting PDF to PDF/A (or other PDF versions), Changing PDF Viewer preferences or embed strip fonts from a PDF. Starting with version 7.2.1 we are adding another scenario to the mix, which is the ability to convert files attached to PDF Documents.

Similar to emails, a PDF document can have other files attached. Previously we simply ignored these files, but now we actively inspect PDF attachments and offer the option to convert and merge them to the main PDF. Ideal for archiving or printing purposes.

This new facility is accessible from our Web Services interface, see below, as well as SharePoint Designer and Nintex Workflows using our XML Override syntax. Conversion of PDF Attachments can globally be controlled using the Conversion Service’s configuration file by modifying the PDF.ConvertAttachments and PDF.ConvertAttachmentMode keys.

Please note that version 7.2.1 of the Muhimbi PDF Converter is needed in order to use this facility. It is currently (April 2014) in beta and available on request only.

ConverterSpecificSettings_PDF 
The syntax is simple. Create a new instance of ConverterSpecificSettings_PDF, set its properties to the appropriate values and assign it to ConversionSettings.ConverterSpecificSettings before kicking off the conversion operation. A brief code example, that can easily be plugged into our standard sample code, can be found below.

ConverterSpecificSettings_PDF csc = new ConverterSpecificSettings_PDF();
csc.ConvertAttachments = true;
csc.ConvertAttachmentMode = PDFConvertAttachmentMode.RemoveSupported;
conversionSettings.ConverterSpecificSettings = csc;

 
The syntax for Java, Ruby and PHP is similar, but the code needs to be adapted to syntax specific to those environments.

 

The possible values for ConverterAttachmentMode are as follows:

  • RemoveAll: When a PDF file is processed, all attachments will be converted and merged to the main PDF. All attachments will be removed from the PDF, including those of attachments for which the file type is not recognised by the converter.
  • RemoveSupported: When a PDF file is processed, all attachments will be converted and merged to the main PDF, but only those attachments that are supported by the converter are removed from the PDF, all other attachments remain present in the main file.

Naturally these values are only used when ConvertertAttachments is set to True.

 

As this behaviour is part of the PDF Conversion Service’s processing pipeline, this new facility can be used in combination with all Merging, Watermarking, OCR, PDF Encryption and PDF/A post processing facilities.

 

Any questions or feedback? Leave a comment in the section below or contact us, we love talking to our customers.

 

.

Labels: , , , , ,

PDF Converter Services 7.2 - Extract text using OCR, MSG Improvements

Posted at: 5:55 PM on 09 April 2014 by Muhimbi

PDFConverterServicesBox4_thumb3

We are happy to announce version 7.2 of the popular Muhimbi PDF Converter Services. This new release further extends the OCR facility and MSG improvements introduced in the previous version and adds support for extracting text from bitmap based content and rendering of MSG based calendar entries.

A quick introduction for those not familiar with the product: The Muhimbi PDF Converter Services is an ‘on premises’ server based SDK that allows software developers to convert typical Office files to PDF format using a robust, scalable but friendly Web Services interface from Java, .NET, Ruby & PHP based solutions. It supports a large number of file types including MS-Office and ODF file formats as well as HTML, MSG (email), EML, AutoCAD and Image based files and is used by some of the largest organisations in the world for mission critical document conversions. In addition to converting documents the product ships with a sophisticated watermarking engine, PDF Splitting and Merging facilities, an OCR facility and the ability to secure PDF files. A separate SharePoint specific version is available as well.
 

  Example of a converted Calendar entry with an (OLE) embedded Excel sheet


In addition to the changes listed above, some of the main changes and additions in the new version are as follows:

2100 Excel New Optionally scale Excel to page width & height
2059 HTML Fix System.ArgumentException: uri - string can not be empty
1996 HTML Improvement Reduce white space causing occasional extra empty PDF pages at end of file.
1802 Merging Fix Bookmark targets bottom of page
2093 Merging Fix "Unexpected token Unknown before 107448" while merging file
2078 Merging Fix Kernel Error while loading PDF
2073 Merging Fix System.IndexOutOfRangeException while merging
2074 Merging Fix System.NullReferenceException while merging
2075 Merging Fix System.NullReferenceException while merging
2076 Merging Fix Some HTML Converted files cannot be saved in Acrobat Pro after merging
2126 MSG Fix "System.InvalidOperationException: Stack empty" during conversion of 3rd party generated MSG files
2133 MSG Fix "Parameter is not valid" during conversion of 3rd party generated MSG files
2136 MSG Fix Content missing from converted MSG file
2106 MSG Fix Fixed MSG body for 3rd party generated MSG files
2116 MSG Fix Conversion of MSG files with an attached MSG that is signed
2124 MSG Fix "System.IndexOutOfRangeException" Converting German email
2125 MSG Fix Conversion of email never finishes
2105 MSG Fix "Invalid Compressed RTF header" during conversion of 3rd party generated emails
2090 MSG Fix Extra '}' in body text
2058 MSG Fix No bookmark generated for certain attachments
2056 MSG Fix ‘Sent date' not correct on some 3rd party generated emails
2057 MSG Fix Unicode converter issue (also with EML)
2088 MSG Improvement Add support for attendees to meeting invitations
2086 MSG Improvement Optionally throw error if embedded content is encountered that cannot be converted
2013 MSG Improvement From address shows LDAP path
2046 MSG Improvement Web Service support for MSGConverterFullFidelity.EmailAddressDisplayMode and FromEmailAddressDisplayMode
2087 MSG New Convert the visual representation of embedded objects
2068 MSG New Add support for the conversion of Calendar Entries
2050 MSG New Add config value to allow MSG attachments list to be displayed, even when attachments are disabled
2113 MSG/HTML Fix Rendering error in very long emails / HTML pages
2066 MSG/HTML Fix Sometimes content is truncated on systems running IE9, IE10 or IE11
2005 MSG/HTML Fix Fonts look weird in some emails
1786 OCR Fix Handle leak during OCR
2054 OCR Fix Some Mixed content (MS-Word files with scanned images) does not always OCR
1999 OCR Fix Arabic training data causes exception
1788 OCR Improvement Increase OCR Performance
2089 OCR Improvement Update Diagnostics tool to display OCRed text
2081 OCR Improvement In-line images are recognised but text is not placed on it correctly
1998 OCR Improvement Add support for Hebrew
2048 OCR New Support for extracting text from bitmap based content using OCR
2072 Other New Allow timeouts to be specified on web service call
2102 Watermarking Fix Chinese & Japanese fonts are not displayed in watermarks
2103 Watermarking Fix Watermarking some documents causes problem in Adobe Reader 9

 
For more information check out the following resources:


As always, feel free to contact us using Twitter, our Blog, regular email or subscribe to our newsletter.

Download your free trial here (39MB). .

.

Labels: , , ,

PDF Converter for SharePoint 7.2 - OCR Workflow Activities, MSG Improvements

Posted at: 5:24 PM on by Muhimbi

PDFBox5

The new features introduced with version 7.1 of the PDF Converter for SharePoint have proven to be popular with our customers. Today we are happy to announce version 7.2, which takes the existing features and elevates them to the next level while staying compatible with all SharePoint versions including SharePoint 2007, 2010 and 2013.

In addition to a number of bug fixes, the main new features are OCR Workflow Actions for SharePoint Designer and Nintex workflow, the ability to extract text from bitmap based content using OCR as well as further improvements to the MSG and EML based converters, specifically in the area of embedded (OLE) content and calendar entries.

 
For those not familiar with the product, the PDF Converter for SharePoint is a lightweight solution that allows end-users to merge, split, watermark, secure, OCR and convert common document types - including InfoPath, AutoCAD, MSG (email) MS-Office, HTML and images - to PDF as well as other formats from within SharePoint using a friendly user interface, workflows or a web service call without the need to install any client side software or Adobe Acrobat. It integrates at a deep level with SharePoint and leverages facilities such as the Audit log, Nintex Workflow, localisation, security and tracing. It runs on SharePoint 2007, 2010 & 2013 and is available in English, German, Dutch, French, Traditional Chinese and Japanese. For detailed information check out the
product page.
 

 
Example of a converted Calendar entry with an (OLE) embedded Excel sheet


In addition to the changes listed above, some of the main changes and additions in the new version are as follows:

2100 Excel New Optionally scale Excel to page width & height
2059 HTML Fix System.ArgumentException: uri - string can not be empty
1996 HTML Improvement Reduce white space causing occasional extra empty PDF pages at end of file.
1802 Merging Fix Bookmark targets bottom of page
2093 Merging Fix "Unexpected token Unknown before 107448" while merging file
2078 Merging Fix Kernel Error while loading PDF
2073 Merging Fix System.IndexOutOfRangeException while merging
2074 Merging Fix System.NullReferenceException while merging
2075 Merging Fix System.NullReferenceException while merging
2076 Merging Fix Some HTML Converted files cannot be saved in Acrobat Pro after merging
2126 MSG Fix "System.InvalidOperationException: Stack empty" during conversion of 3rd party generated MSG files
2133 MSG Fix "Parameter is not valid" during conversion of 3rd party generated MSG files
2136 MSG Fix Content missing from converted MSG file
2106 MSG Fix Fixed MSG body for 3rd party generated MSG files
2116 MSG Fix Conversion of MSG files with an attached MSG that is signed
2124 MSG Fix "System.IndexOutOfRangeException" Converting German email
2125 MSG Fix Conversion of email never finishes
2105 MSG Fix "Invalid Compressed RTF header" during conversion of 3rd party generated emails
2090 MSG Fix Extra '}' in body text
2058 MSG Fix No bookmark generated for certain attachments
2056 MSG Fix ‘Sent date' not correct on some 3rd party generated emails
2057 MSG Fix Unicode converter issue (also with EML)
2088 MSG Improvement Add support for attendees to meeting invitations
2086 MSG Improvement Optionally throw error if embedded content is encountered that cannot be converted
2013 MSG Improvement From address shows LDAP path
2046 MSG Improvement Web Service support for MSGConverterFullFidelity.EmailAddressDisplayMode and FromEmailAddressDisplayMode
2087 MSG New Convert the visual representation of embedded objects
2068 MSG New Add support for the conversion of Calendar Entries
2050 MSG New Add config value to allow MSG attachments list to be displayed, even when attachments are disabled
2113 MSG/HTML Fix Rendering error in very long emails / HTML pages
2066 MSG/HTML Fix Sometimes content is truncated on systems running IE9, IE10 or IE11
2005 MSG/HTML Fix Fonts look weird in some emails
1786 OCR Fix Handle leak during OCR
2054 OCR Fix Some Mixed content (MS-Word files with scanned images) does not always OCR
1999 OCR Fix Arabic training data causes exception
1788 OCR Improvement Increase OCR Performance
2089 OCR Improvement Update Diagnostics tool to display OCRed text
2081 OCR Improvement In-line images are recognised but text is not placed on it correctly
1998 OCR Improvement Add support for Hebrew
1975 OCR New SharePoint Designer OCR Workflow Activity for generating searchable PDFs
1975 OCR New SharePoint Designer OCR Workflow Activity for extracting text from bitmaps
1976 OCR New Nintex Workflow OCR Activity for generating searchable PDFs
1976 OCR New Nintex Workflow OCR Workflow Activity for extracting text from bitmaps
2048 OCR New Support for extracting text from bitmap based content using OCR
2072 Other New Allow timeouts to be specified on web service call
2102 Watermarking Fix Chinese & Japanese fonts are not displayed in watermarks
2103 Watermarking Fix Watermarking some documents causes problem in Adobe Reader 9
2049 Watermarking New Add support for USER_NAME in addition to the existing REMOTE_USER and LOGON_USER in watermarks


For more information check out the following resources:


As always, feel free to contact us using Twitter, our Blog, regular email or subscribe to our newsletter.

Download your free trial here (46MB). .

.

Labels: , , ,

Get Outlook Mail in and out of SharePoint and Convert it to PDF - The Easy Way

Posted at: 3:53 PM on 03 April 2014 by David Radford

At Muhimbi we’re always looking to add great new features and functionality to our products. At the same time, we need to be careful not to start throwing features in just because they’re cool- a full featured product is great, a schizophrenic one isn’t.

A good example of a feature that doesn’t belong in a conversion product is the transferring of documents in and out of the SharePoint environment. There are so many ways to implement this, that it’d really be its own completely separate product… And it is! Mail2Share from Techtra provides a clean interface to SharePoint from within Outlook, allowing easy SharePoint adoption and integration from the familiar Outlook workspace.

The reasons for storing e-mails as PDFs is not always obvious, but with corporate Document Management Strategies becoming more complex and file formats always changing, there are some clear advantages to this:

  • PDF, particularly PDF/A, is the ideal file format for long term archiving.
  • PDF files can be viewed with a high level of fidelity on mobile devices. For example, if you received an AutoCAD file (dxf, dwg) and want to preview or share it with users that do not have an AutoCAD preview handler, with Mail2Share, you would be able to send the attachment to the configured mount point and receive back a converted version in PDF format.
  • The Muhimbi PDF Converter does not require the installation of a PDF writer on the local machine.

So, how can our PDF Converter and Mail2Share help with all this? Well, as it turns out- quite easily!  Leveraging The PDF Converter’s outstanding e-mail conversion, workflow integration, and watermarking features with Mail2Share’s innovative Outlook connectivity turns complex scenarios like the following into a few simple steps.

 

The scenario:

You have a number of regional offices that need to send in their current sales forecasts- an Excel spreadsheet with the details and then a written summary of the reasoning behind them. When these e-mails arrive, they need to be redirected to various internal groups, but they’re also sensitive and so access needs to be tracked and restricted. How can this all be managed easily in a central manner? How can we do this while also having a single file to move around that contains both the e-mail AND the attachment?


The steps:

  1. Install and configure The Muhimbi PDF Converter for SharePoint following the installation instructions from Chapter 2 of the Administration Guide.
     
  2. Install the Techtra Mail2Share desktop application (there is no server side component to worry about). To configure Mail2Share, just choose your SharePoint server, select the site you want to add libraries from, and then add the libraries you want to see in Outlook (and have SharePoint rights to) and you’re good to go.  
     
    select sharepoint
     
  3. Once that is done, you will have some additional folders in Outlook. To move an e-mail from Outlook to SharePoint, simply drag and drop the selected e-mail to the library you want from the list.

     

    drag and drop with arrow cropped 65 

  4. Once the e-mail arrives in the ‘Incoming Sales Projections’ library, it gets picked-up by a simple SharePoint workflow using our conversion action set to run when new files are created in it. The workflow converts both the body of the e-mail with the reasoning AND the Excel attachment with the details into an easy to manage PDF and then copies it to a different Library.
      workflow cropped

     

  5. The Library the newly created PDF is sent to has our Watermark on Open feature enabled (you might also want to add our PDF Security on Open as well, or instead of this). This watermarks the PDF with the date, location, and username of the person opening the PDF. In this case we have added the following text to the bottom left of every PDF every time it is opened in SharePoint or through Mail2Share.
     

    watermark on open75  

  6. The library is available to specific users in Outlook, using Mail2Share, based on their SharePoint rights.  In this case, the user only has rights to the ‘Outgoing Sales Projections’ library. The user then browses it like any other folder, selects the e-mail and XLS combined PDF, previews it if required, and then simply right clicks to send it as an attachment. The act of downloading it from SharePoint watermarks the PDF in the background and is seamless to the user.
     

    send as attachment75 

  7. Ah, no- that’s it- you’re done!

 

You now can easily convert e-mails (with attachments!), that need to be shared, into PDF, make them available in a central location, and instead of just restricting access- you can track where and when a specific copy of that PDF was created and by who. All automatically, without users needing to navigate into SharePoint or do anything more than drag-and-drop!

This is just a small sample of how Muhimbi’s PDF Converter for SharePoint and Techtra’s Mail2Share applications can work together to facilitate sharing and collaboration for users, while also becoming valuable tools for corporate Document Management Strategies.

 

 

Labels: , , , , , ,

Extract text from scanned content using OCR and Nintex Workflow

Posted at: 2:53 PM on 31 March 2014 by Muhimbi

OCR-Logo5_thumbWith the release of version 7.1 of the PDF Converter for SharePoint we added a fundamental new technology to our Document Conversion and Manipulation platform, Optical Character Recognition (OCR). That initial release was able to process scanned / bitmap based content and generate fully searchable PDFs.

With the introduction of version 7.2 we are adding support for a new OCR related use case, which is the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from documents that all use a common template or layout. For example, if a reference number can always be found at the top right corner of scanned documents then that text can be extracted and stored in a SharePoint column from where it can be included in searches or be used in further workflow steps… pretty powerful stuff.

This post describes the Nintex Workflow Activity. The SharePoint Designer equivalent can be found here.

For more details, including an introduction, see these related blog posts.

 

Once the Muhimbi PDF Converter for SharePoint is installed, and the Nintex Workflow Integration has been activated, a number of new activities will be added automatically to the list, including the new Extract text using OCR activity. It is compatible with Nintex Workflow 2007, 2010 & 2013 and this is what it looks like.
 

OCR-Extract-Workflow-Activity


Building a full example workflow is out of the scope of this post as it is relatively easy. For details see our generic PDF Conversion for Nintex Workflow example.

The fields supported by this Workflow Activity are as follows:

  1. Language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  2. Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  3. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  4. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  5. Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified - UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then use our software to convert the file to PDF and open it in a PDF reader.
  6. Page Number: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
  7. Output Text: The recognised text will be stored in this variable (type String).
  8. Source List ID & List Item: The item that triggered the workflow is processed by default. You can optionally specify the ID of a different List and List Item using workflow variables. Please use data type string for the List ID workflow variable. For the Item ID use type Item ID (in SharePoint 2007) or Integer (in SharePoint 2010 / 2013)
  9. Error Handling: Similar to the way some of Nintex’ own Workflow Activities allow errors to be captured and evaluated by subsequent actions, all of Muhimbi’s Workflow Activities allow the same. By default this facility is disabled meaning that any error terminates the workflow.

 

For more details about using the PDF Converter for SharePoint in combination with Nintex Workflow see this Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , , , ,

Extract text from scanned content using OCR and SharePoint Designer Workflows

Posted at: 6:00 PM on 28 March 2014 by Muhimbi

OCR-Logo5With the release of version 7.1 of the PDF Converter for SharePoint we added a fundamental new technology to our Document Conversion and Manipulation platform, Optical Character Recognition (OCR). That initial release was able to process scanned / bitmap based content and generate fully searchable PDFs.

With the introduction of version 7.2 we are adding support for a new OCR related use case, which is the ability to recognise text on (part of) a page and return the actual text (not a bitmap) to the workflow for further processing. A common use for this functionality is to extract a particular area of text from documents that all use a common template or layout. For example, if a reference number can always be found at the top right corner of scanned documents then that text can be extracted and stored in a SharePoint column from where it can be included in searches or be used in further workflow steps… pretty powerful stuff.

This post describes the SharePoint Designer Workflow Activity. The Nintex Workflow equivalent can be found here.

For more details, including an introduction, see these related blog posts.

 

Once the Muhimbi PDF Converter for SharePoint is installed you will find a number of new Workflow Activities in SharePoint Designer. One of these activities is named Extract Text using OCR and looks as follows. 
 OCR-Extract-Workflow-Activity

 
In typical Muhimbi fashion the workflow sentence is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.

  1. this document: The source document to OCR and extract text from. For most workflows selecting Current Item will suffice, but some scenarios may require the look up of a different item. 
  2. OCR language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  3. OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  4. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  5. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  6. Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified - UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then use our software to convert the file to PDF and open it in a PDF reader.
  7. Page: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
  8. Result: The recognised text will be stored in this variable (type String)

 

Although creating simple workflows in SharePoint Designer is relatively easy, there is a first time for everything. If the concept of SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , , ,

Converting scans and images to searchable PDFs using OCR & Nintex Workflow

Posted at: 3:37 PM on by Muhimbi

OCR-Logo5Although it had been years in the planning, we didn’t really make a big deal out of the support for Optical Character Recognition (OCR) when we shipped it as part of version 7.1 of the PDF Converter for SharePoint. We did this for a good reason as – although the underpinnings were working well – the actual integration point with Nintex Workflow wasn’t as nice as we wanted it to be.

With the release of version 7.2 we are adding two new Workflow Activities to both Nintex Workflow and SharePoint Designer. The first activity, described in this post, can be used to convert scanned content into fully searchable PDFs. A separate post will detail the other new OCR Activity, which can extract text from scanned content. For a high level overview of our OCR facilities please read the original announcement.

This post describes the Nintex Workflow version of the Workflow Activity. The SharePoint Designer equivalent can be found here.

 

Optical Character Recognition… sounds quite complex, what would you need that for? Well, most organisations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system / SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology. Content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.

This is where OCR comes in. OCR analyses image based content – e.g. a scanned PDF or an image embedded in an MS-Word file – applies some fancy recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers are clever enough to index this text as well. Confused? Have a look at the screenshot below.

OCRed-Document5 
Scanned Document with OCRed text selected

 
It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. It is compatible with Nintex Workflow 2007, 2010 & 2013 and this is what it looks like.
 
OCR-PDF-Nintex-Workflow-Activity

 
Building a full example workflow is out of the scope of this post as it is relatively easy. For details see our generic PDF Conversion for Nintex Workflow example.

The fields supported by this Workflow Activity are as follows:

  1. Destination Path: The location to write the generated file to. Leave this field empty to use the same location as the source file. For details about how to specify paths to different libraries / site collections see this blog post
  2. Output File Name: The name of the generated file. Leave this field empty to use the same name as the source file. Please note that if your source file is already in PDF format, and the Destination Path is the same as the Source Path, then leaving this field empty will overwrite it. 
  3. Meta data: Control if the source file’s SharePoint meta-data is copied to the destination file.
  4. Language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  5. Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  6. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  7. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  8. Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes) <Regions>…</Regions>.
  9. Output List ID: If you wish to carry out further actions on the generated file, e.g. send it by email or perform a check-in, then you can optionally store the ID of the List the file was written to in a workflow variable of type String.
  10. PDF List Item ID: Similarly to Output List ID, the Item ID of the generated file can optionally be stored in a workflow variable of type Item ID (in SharePoint 2007) or Integer (in SharePoint 2010 / 2013).
  11. Source List ID & List Item: The item that triggered the workflow is processed by default. You can optionally specify the ID of a different List and List Item using workflow variables. Please use the same data types as used by Output List ID and Output List Item ID.
  12. Error Handling: Similar to the way some of Nintex’ own Workflow Activities allow errors to be captured and evaluated by subsequent actions, all of Muhimbi’s Workflow Activities allow the same. By default this facility is disabled meaning that any error terminates the workflow.

 

For more details about using the PDF Converter for SharePoint in combination with Nintex Workflow see this Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , ,

Converting scans and images to searchable PDFs using SharePoint Designer Workflows

Posted at: 1:52 PM on by Muhimbi

OCR-LogoAlthough it had been years in the planning, we didn’t really make a big deal out of the support for Optical Character Recognition (OCR) when we shipped it as part of version 7.1 of the PDF Converter for SharePoint. We did this for a good reason as – although the underpinnings were working well – the actual integration points with SharePoint, specifically SharePoint Designer Workflows, wasn’t as nice as we wanted it to be.

With the release of version 7.2 we are adding two new Workflow Activities to both Nintex Workflow and SharePoint Designer. The first activity, described in this post, can be used to convert scanned content into fully searchable PDFs. A separate post will detail the other new OCR Activity, which can extract text from scanned content. For a high level overview of our OCR facilities please read the original announcement.

This post describes the SharePoint Designer Workflow Activity. The Nintex Workflow equivalent can be found here.

 

Optical Character Recognition… sounds quite complex, what would you need that for? Well, most organisations deal with scanned (or other bitmap based) content on a regular basis. Faxes are received in a digital inbox, invoices or legal documents are scanned and filed away in a file system / SharePoint library or other Document Management System. The problem is that this is ‘dead information’ that cannot be searched or indexed using traditional technology. Content is stored as one big image which cannot be indexed by search crawlers and, as a result, does not show up in search results.

This is where OCR comes in. OCR analyses image based content – e.g. a scanned PDF or an image embedded in an MS-Word file – applies some fancy recognition logic and then embeds the result in a PDF. The scanned content still looks the same, but you can now copy text from the document and search crawlers are clever enough to index this text as well. Confused? Have a look at the screenshot below.
 

OCRed-DocumentScanned Document with OCRed text selected

 
It is possible to carry out OCR using our standard Convert Document workflow activity, but that requires knowledge of our XML syntax, which - although powerful - is less than user friendly. To make life easier we have created a separate Workflow Activity named Convert to OCRed PDF. This is what it looks like.
 

OCR-PDF-Workflow-Activity

 
In typical Muhimbi fashion the workflow sentence is consistent with our other Workflow Activities (e.g. Converting / Watermarking / Merging / Securing) and largely self-describing.

  1. this document: The source document to Convert and OCR. For most workflows selecting Current Item will suffice, but some scenarios may require the look up of a different item. 
  2. this file: The name and location to write the generated file to. Leave this field empty to use the same location and name as the source file. Please note that if your source file is already in PDF format then leaving this field empty will overwrite it. For details about how to specify paths to different libraries / site collections see this blog post
  3. include / exclude meta data: Control if the source file’s SharePoint meta-data is copied to the destination file.
  4. OCR language: The language the source document is written in. It defaults to English, but we currently (version 7.2) support Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  5. OCR Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default Slow but accurate setting.
  6. Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by whitelisting 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  7. Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.
  8. Regions: By default the entire page is OCRed. To limit OCR to certain parts of a page, e.g. a header and/or footer, you can specify one or more regions using our XML syntax. Have a look at this blog post, but only use the part that starts with (and includes) <Regions>…</Regions>.
  9. List ID: The ID of the list the processed file was written to. This can later in the workflow be used to perform additional tasks on the file such as a check-in or out.
  10. Item ID: The ID of the processed file. Can be used with the List ID

 

Although creating simple workflows in SharePoint Designer is relatively easy, there is a first time for everything. If the concept of SharePoint Designer workflows is new to you then have a look at our simple Getting Started Knowledge Base article.

Please note that the PDF Converter Professional add-on license is needed in order to use OCR in your production environment.

Any questions or comments? Leave a message below or contact us.

.

Labels: , , , , , ,

Converting MSG based Calendar entries to PDF including embedded content

Posted at: 5:38 PM on 27 March 2014 by Muhimbi

CalendarWhen we first started investigating the conversion of email based files (MSG, EML) we had a good look at the market to see if there was anything ‘out there’ that we could license. Although there were some development libraries on the market that provided part of the solution, to our surprise there was no single capable MSG to PDF Converter on the market, at least not one that did an acceptable job (We have very high standards).

So we did what we always do, rolled up our sleeves and spent the usual cycle of building, testing, listening to our customers and improving some more. Rinse and repeat for a couple of years resulting in the most capable, flexible and - dare I say it – prettiest Email to PDF Converter on the market. It wasn’t easy, believe me, but it works great and happily churns through millions of emails, especially in organisations subject to document retention regulations.

Once you have built a great foundation and products like the Muhimbi PDF Converter for SharePoint and PDF Converter Services (for Java, PHP, Ruby, C#, .NET), companies actually start using it, and naturally ask for more. Oh those pesky customers, how much we.. love.. them. One of our customers recently remarked that calendar entries don’t look particularly great when converted to PDF. “Calendar Entries????” we had not realised that people wanted to convert all kinds of MSG files, not just regular emails.

So, once again we sat down and improved our email converter in the following areas:

  • Converting Calendar entries & Meeting requests to PDF.
  • Conversion of embedded content (OLE) such as Excel sheets.
  • Compatibility for MSG files generated by third party (non-Outlook) libraries.
  • Bug fixes for internationalisation

Calendar-to-PDF-Conversion 
Example of a converted Calendar entry.

 
Naturally the new Calendar converter will automatically benefit from all other facilities provided by our MSG & EML converter including conversion of digitally signed files, conversion (and merging) of attachments, support for rich HTML &, RTF content as well as support for many character sets and languages.

For more details about our Email converter see the original announcement.

Labels: , , , , ,

PDF Converter Services 7.1 - Optical Character Recognition, EML and MSG overhaul

Posted at: 1:17 PM on 13 November 2013 by Muhimbi

PDFConverterServicesBox4_thumb3

We are happy to announce version 7.1 of our popular Muhimbi PDF Converter Services and PDF Converter Professional. The main new features are support for OCR (Optical Character Recognition) to convert scanned documents into fully searchable and indexable PDF files, and a completely overhauled converter for the EML (email) format that should really benefit those organisations that don’t use MS-Outlook’s MSG format to store email.

 
A quick introduction for those not familiar with the product: The Muhimbi PDF Converter Services is an ‘on premises’ server based SDK that allows software developers to convert typical Office files to PDF format using a robust, scalable but friendly Web Services interface from Java, .NET, Ruby & PHP based solutions. It supports a large number of file types including MS-Office and ODF file formats as well as HTML, MSG (email), EML, AutoCAD and Image based files and is used by some of the largest organisations in the world for mission critical document conversions. In addition to converting documents the product ships with a sophisticated watermarking engine, PDF Splitting and Merging facilities, an OCR facility and the ability to secure PDF files. A separate SharePoint specific version is available as well.
 

Scanned Document with OCRed text selected


In addition to the changes listed above, some of the main changes and additions in the new version are as follows:

1901 CAD Fix CAD Conversion - AccessViolationException
1931 CAD Improvement CAD Converter does not resolve externally referenced files
1850 CAD Improvement Add support for AutoCAD 2013
1916 Conversion Fix TIFF to PDF Conversion uses dimensions of first page for all pages
1853 Conversion Fix Post processing PDF generated from TIF as 'Screen Optimised' scrambles PDF
676 Conversion Improvement Excel Conversion - Add support for PDF/A
1930 Cross-conversion Fix Folder with Temp files cannot be deleted when converting DOC to HTML for some locales / regions
1879 EML New Implement conversion of RFC2045 / RFC5322 based EML files
1965 HTML Fix HTML Converter hangs on 0.5 page margin
1920 HTML Fix Not all URLs are recognised by HTML Converter
1827 HTML Fix HTML to PDF Conversion for some non-Roman languages lose characters
1840 HTML Fix Last line is truncated when converting HTML to PDF
1953 HTML Improvement Mixed fonts in same sentence are vertically offset when converting HTML to PDF
1940 HTML Improvement HTML Conversion doesn't convert unencoded quotes
1884 HTML Improvement Add configurable delay to HTML to PDF conversion for pages heavy on JavaScript / DHTML (e.g. pages containing Google Maps)
2009 InfoPath Improvement Fix InfoPath forms colour being lost on IE10 systems
1939 InfoPath Improvement InfoPath does not export to PDF well on systems with IE10
2010 Merging Fix System.NullReferenceException when saving merged file
2012 Merging Fix Internal hyperlinks are broken when merging documents
1990 Merging Fix Unexpected token DictionaryEnd while merging
1982 Merging Fix System.IndexOutOfRangeException: Index was outside the bounds of the array. while merging PDF
1984 Merging Fix Bookmark targets bottom of page
1968 Merging Fix Nullreference error in PdfLoadedFormFieldCollection.GetFieldType while merging
1978 Merging Fix Error in 'PdfLoadedPageCollection.GetPage' while merging file
1967 Merging Fix Blank pages while merging
1943 Merging Fix Fatal Error at 9670 while merging
1935 Merging Fix Merged file is empty when merging large bitmapped PDFs
1895 Merging Fix Fatal Error when merging
1892 Merging Fix System.NullReferenceException when merging
2007 MSG Fix MSG - Unexpected line break using plain text conversion
2014 MSG Fix MSG - Unicode / character encoding problem in HTML email
2006 MSG Fix MSG - Hyperlink breaks during conversion
1958 MSG Fix MSG - System.Exception: compressed-RTF CRC32 failed
1959 MSG Fix MSG/EML Converter - Last line is missing from some converted emails
1925 MSG Fix MSG to PDF - Plain text email carriage return handling is incorrect
1913 MSG Fix MSG to PDF - RTF HTML MSG - incorrectly converted accents / diacritics
1914 MSG Fix MSG to PDF - RTF HTML MSG - RTL languages not converted in correct order
1904 MSG Fix MSG to PDF - Sometimes Attachment is not processed
1911 MSG Fix MSG to PDF - Possible regression on in-line images
1912 MSG Fix MSG to PDF - RTF HTML MSG - Azerbaijani, Maltese - some unicode characters not converted, left as \uXXXX
1899 MSG Fix MSG to PDF - German special characters are sometimes not properly converted
1882 MSG Fix MSG to PDF - RTF email is missing portion of first line in body text
1885 MSG Fix MSG to PDF - Handle and Memory leak when converting signed MSG files
1862 MSG Fix MSG to PDF - Incorrect font
1863 MSG Fix MSG to PDF - Numbered list items not rendered
1601 MSG Improvement MSG to PDF - Improve line spacing in HTML to PDF Conversion
1660 MSG Improvement MSG to PDF - Test / Implement remaining languages
1917 MSG Improvement MSG to PDF - RTF HTML MSG - some languages causing small fonts
1903 MSG Improvement MSG to PDF - Implement Best Body Algorithm from MS-OXBBODY specification
1881 MSG Improvement MSG to PDF - Text opaque signed MIME messages lose formatting
2015 MSG New MSG to PDF - Include email address in 'To' field
995 OCR New OCR - Add support for OCR of PDF data to allow searchable PDFs
1985 Other Fix Cannot set PDF Creator / Processor meta data for some files
1972 Other Fix Loading a PDF 1.7 document into a PDFDocument resets it to PDF 1.5
1952 Other Fix Certain PDFs do not permit viewerpreferences to be read
1906 Other Fix Occasional Access Denied in Task Monitor on Win2K12 / InfoPath 2015
1799 Other Improvement Upgrade to .net 3.5
2061 Pro Fix Converting between PDF Versions on a locale that uses ',' as a decimal separator sets the PDF Version to 1.1
1945 Pro Fix PDF/A conversion - The DateTime represented by the string is not supported in calendar System.Globalization.GregorianCalendar.
1922 Pro Fix Re-processing existing PDF/A files for PDF/A output fails
1909 Pro Fix PDF/A Conversion fails when certain characters occur in the PDF Title
1888 Pro Fix Improve reliability of PDF/A2b conversions
1849 Pro Fix Linearization in combination with PDF/A fails
1979 Pro Improvement Always post process for PDFA when _outputFormatSpecificSettings.PostProcessFile == true
1843 Pro Improvement Allow transparent content in PDF/A2b documents
1974 Security Fix When security is removed from PDF files its contents still shows as encrypted

 
For more information check out the following resources:


As always, feel free to contact us using Twitter, our Blog, regular email or subscribe to our newsletter.

Download your free trial here (37MB). .

.

Labels: , , , , , , ,