Automatically convert millions of emails, including all attachments, to PDF

Posted at: 15:07 on 25 March 2020 by Muhimbi

I was talking to one of our customers the other day about an interesting use case that turns out  to be more common than I anticipated.

During our discussion it came to light that their regulatory body requires all communication - exchanged with customers - to be stored in a format suitable for long term archiving. In their case PDF/A. The problem is that doing this by hand is an impossible amount of work and difficult to enforce. This is made even more difficult by the fact that attachments need to be converted to PDF as well.

Guess what.... they need to do this for 100,000 emails per month! Doing this by hand is just not an option, which is why they went looking for a third-party solution.

There are a small number of solutions available in the market. A number of service providers and vendors of development libraries claim to be able to convert EML and MSG files to PDF, but few do this in a way that:

  • generates perfect looking PDFs;
  • supports emails written in a multitude of languages and character sets;
  • converts all attachments and merges them into a single PDF;
  • provides many ways to filter and configure these attachments;
  • takes care of rendering delivery receipts;
  • includes calendar entries and contact cards;
  • outputs PDFs in PDF/A1b, 2b and 3B formats;
  • allows the process to be fully automated via workflow platforms or an API.

We are generally a modest bunch, but we truly believe we have the best email to PDF converter in the world. We know this, because we searched for 3rd party libraries when we first implemented this facility. Nothing existed that was half decent, so we decided to build our own. Our team has spent an enormous amount of time on this facility, more than any of our other converters including our popular and comprehensive InfoPath converter.  The results are clearly visible, this works very well.

PDF renditions of regular emails.


So, this customer was set a very difficult task, how did they end up solving it? Their in-house team built a simple solution using Java code in combination with the REST API exposed by our online service. Things just sit quietly in the background, beavering away 24x7 to generate PDFs out of emails.

The REST API approach works well for them. We also support a SOAP API in combination with hosting our software on your own servers, SharePoint Online, SharePoint on-premise, Power Automate (Microsoft Flow), Azure Logic Apps, UiPath, Nintex Workflow, K2, C#, JavaScript, Python, PHP and anything else that is remotely modern.


PDF Rendition of a calendar entry, including embedded content


We could make up fancy ROI figures for this use case, but the fact is that the requirement was nearly impossible. Whatever figure we come up with is bound to be wrong by an order of magnitude. Let's just says it is working out very well for everyone involved.


Relevant links:


Many of our customers are sitting on gigabytes of emails that need to be archived for eDiscovery, Freedom Of Information requests and SOX, SEC, FTS, FCC, EPA, NLRB, IRS, EEOC, OSH, OFCOM retention regulations. Being able to access these emails 10, 20 or even 40 years down the line, in a universally accepted format such as PDF (including PDF/A), is absolutely essential. Muhimbi’s range of PDF Conversion products make this possible for all common file formats as well as some uncommon ones such as MSG, EML and even InfoPath.

If you have any questions or comments, leave a message below or contact our support desk, we love to help.


Labels: , , ,

Merge Files to PDF using custom Merge Settings and Muhimbi's XML Override

Posted at: 18:17 on 12 March 2020 by Muhimbi

At the time of writing, Muhimbi's range of PDF Conversion and Document Manipulation servers and APIs have been in the market for nearly 12 years. It will come as no great surprise that during those 12 years we have received many questions from customers to implement all kinds of arcane features to suit their particular requirements.

When implementing feature requests, we have always applied one simple rule, which is that we are happy to implement new functionality providing it can be used by all our customers and is generic in nature.


Recently, a large international sports organisation approached us to to replace their legacy on-premise system with our cloud based service. Our software ticked most boxes, but some edge cases were identified for functionality that we did not support, specifically:

  1. Create PDF Bookmarks (and therefore a Table of Contents) based on MS-Word styles that are not defined as headings.
  2. Maintain the correct hierarchy of PDF Bookmarks for MS-Word files that don't start with a Heading 1.


Pretty esoteric stuff.... How can we expose niche functionality like this in our system, and user interfaces, without confusing thousands of users that have no interest in this functionality? Well, it turns out we have dealt with this before as we introduced the concept of XML Override to our Convert Document action all the way back in 2012. Using a bit of XML you can set or override almost any setting supported by our comprehensive object model.

So, we added an XML Override facility to our Merge action as well. At the time of writing this new facility is available in Power Automate (Flow) and in our REST based API. In a next release we'll add this to SharePoint Designer and Nintex Workflow actions as well. Naturally all this functionality is available natively on our SOAP API.


Let's take the following example, where we merge documents as normal, but with the following changes:

  1. Only apply different rules for MS-Word files that are being merged. To accomplish this we have specified a regular expression on the SourceFile element, which filters on the field specified in the SourceFiles element.
  2. Only generate PDF Bookmarks for the first 3 Heading levels and ignore everything else. We achieve this by setting LowerBookmarkLevel to 3.
  3. Map a custom style named 'MyFakeHeadingStyle', which is not defined in MS-Word as a heading style, to heading level 2. We achieve this by defining it in the list of Bookmark Mappings


This results in the following XML.

        <SourceFiles filter="property:SourceFile.OpenOptions.FileExtension">
            <SourceFile filterValue="regex:^docx$">
                    <ConverterSpecificSettings type="ConverterSpecificSettings_WordProcessing">


We can take this XML and paste it in the 'Override settings' field of our Power Automate Merge documents action. A full example of 'iterating over multiple files and compiling a list of files to merge' is beyond the scope of this post. An example can be found here.

Merge XML Override


More details can be found in the Developer Guide. This does require some technical knowledge though.

If you get stuck, leave a comment below or contact our support desk, we love to help.


Labels: , , , , , , ,

Subscribe to News feed