Tuesday, September 13, 2016

Move html table data from email to database using Pentaho, jsoup

In an earlier post, we extracted data from a data table and then emailed it in a tabular format using HTML.

How about going the other direction, where we are receiving emails that contain data in a tabular format, and we need to get that data into a database?

When I started pulling on this thread I started here, on the Pentaho forum. In order to implement this solution you will need to download jsoup, as mentioned in this post on the forum. After downloading jsoup you'll need to tell Pentaho how to find it the jar file, as referenced here (thanks, Reeshu!).

As with most things Pentaho, it's a bit fiddly but it works.

I'll be using the emails that I sent to myself in a previous post as a source of data.
The emails contain simple tables of data about the American Great Lakes.

We'll grab the data from these emails in a Pentaho transformation with four steps. 

Get Emails

Since this is a Transformation and not a Job, I'll be using the Email Messages Input step from the Transformation design tab. The Get Mails step available inside a Job has a few more options, like what to do with the emails after you read them. Check them both out!

You'll have to enter the settings for your email on the first tab, and then you can select the folder from which to get the emails and other options. On the Filters tab you can filter emails by sender, date, and so on.

This step retrieves 20 different fields about each message, but you can pick the ones you want. For this job I'm only going to use Received Date and Body.

Convert to XML

This step is where the jsoup comes in. If you were able to download jsoup and point Pentaho at the external jar file, this should be a piece of cake.

The output will be a field called xhtml, which you will use in the next step.

Get data from XML

The input for this step is the xhtml field from the previous step.

Getting your output the way you want it might take the most fiddling, depending on the format of the table that you're trying to parse. Mine was pretty simple, so:

Select Values

If  you were doing this for real, you'd probably use something other than a Select step. But here is our data, from the five emails that I sent myself last week. It's ready to be imported into database or spreadsheet, depending on your needs.


  1. The data scientist are professionally trained or well talented people, who are expert in handling all type of big data of any company, and they are also expert in for checking the right data for the right place which is used for making any project, you can easily get the data scientist form this https://activewizards.com/.

  2. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon. thebestvpn

  3. i read a considerable measure of stuff and i found that the method for composing to clearifing that precisely need to say was great so i am inspired and ilike to come back again in future.. https://internetprivatsphare.de/bundesliga-stream-mit-vpn/

  4. This is extremely fascinating substance! I have completely delighted in perusing your focuses and have reached the conclusion that you are right about a hefty portion of them. You are extraordinary.  debestevpn.nl

  5. “Sometimes I feel like if you just watch things, just sit still and let the world exist in front of you - sometimes I swear that just for a second time freezes and the world pauses in its tilt. Just for a second. And if you somehow found a way to live in that second, then you would live forever.” https://vpnveteran.com/

  6. Incredible posting this is from you. I am really and truly thrilled to read this marvelous post. You've really impressed me today. I hope you'll continue to do so! privatsphare

  7. I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful. lemigliori vpn

  8. This type of message always inspiring and I prefer to read quality content, so happy to find good place to many here in the post, the writing is just great, thanks for the post.
    find business email

  9. What a thrilling post, you have pointed out some excellent points, I as well believe this is a superb website. I have planned to visit it again and again. Email Login Tips