Thursday, September 22, 2016

Schedule a Pentaho job to run automatically

Got a job that needs to run every morning at the same time, or even every half hour? Pentaho Data Integration Server does have a built-in scheduling view, but if you are using Pentaho CE you may end up using Windows Task Scheduler to call a .bat file that will in turn call kitchen.bat to run your .kjb file. Clear as mud? Here we go ...

I have the world's simplest job called main.kjb. All it does is create a file called hello_world.txt, with an added timestamp before the file extension.

The batch file (.bat)

If I want to use Windows Task Scheduler to run this task, I need a very simple batch file. It needs to call kitchen.bat, and then point Kitchen at the job file that I want to run. It can also create a job log for me, with any level of verbosity you choose. Kitchen has good documentation at http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation.



If you've never run Kitchen from the command line on your machine before, you will probably have to add the path to your Path system environment variable. Search for these variables in the Settings menu. You'll need the full path to where your kitchen.bat lives.


This may look different depending on your version of Windows.

Once you've done this, you can test your .bat file by double-clicking it. This is a good check before setting it up in the task scheduler. 


The task scheduler

The scheduler can be found by searching for it on your machine. If you use the wizard to set up your schedule, you want to select the action of "Start a program". But the program you are starting isn't Spoon, it's the .bat file that you created. 

In the wizard, you can pick the time that the job will run. Elsewhere in the utility you can select fancier options, like having the task run every half hour between 9 am and 5 pm on weekdays.

Let's start a program...

Here's where the .bat file goes...


And here's how the task looks in the schedule list, along with its task siblings.



If you enable history logging, the task scheduler shows you a rough history of every action taken on your task.

The task scheduler also allows you to run the task on demand, if you would like. This means that you don't have to open your job file and risk inadvertently changing something.

Tuesday, September 13, 2016

Move html table data from email to database using Pentaho, jsoup

In an earlier post, we extracted data from a data table and then emailed it in a tabular format using HTML.

How about going the other direction, where we are receiving emails that contain data in a tabular format, and we need to get that data into a database?

When I started pulling on this thread I started here, on the Pentaho forum. In order to implement this solution you will need to download jsoup, as mentioned in this post on the forum. After downloading jsoup you'll need to tell Pentaho how to find it the jar file, as referenced here (thanks, Reeshu!).

As with most things Pentaho, it's a bit fiddly but it works.

I'll be using the emails that I sent to myself in a previous post as a source of data.
The emails contain simple tables of data about the American Great Lakes.

We'll grab the data from these emails in a Pentaho transformation with four steps. 


Get Emails

Since this is a Transformation and not a Job, I'll be using the Email Messages Input step from the Transformation design tab. The Get Mails step available inside a Job has a few more options, like what to do with the emails after you read them. Check them both out!

You'll have to enter the settings for your email on the first tab, and then you can select the folder from which to get the emails and other options. On the Filters tab you can filter emails by sender, date, and so on.


This step retrieves 20 different fields about each message, but you can pick the ones you want. For this job I'm only going to use Received Date and Body.


Convert to XML

This step is where the jsoup comes in. If you were able to download jsoup and point Pentaho at the external jar file, this should be a piece of cake.

The output will be a field called xhtml, which you will use in the next step.

Get data from XML

The input for this step is the xhtml field from the previous step.

Getting your output the way you want it might take the most fiddling, depending on the format of the table that you're trying to parse. Mine was pretty simple, so:

Select Values

If  you were doing this for real, you'd probably use something other than a Select step. But here is our data, from the five emails that I sent myself last week. It's ready to be imported into database or spreadsheet, depending on your needs.








Tuesday, September 6, 2016

Send simple tabular data emails from a database

It's possible to send emails containing tabular data using Pentaho. The setup is a bit labyrinthine and it requires a bit of noodling with HTML/XSL, but once it's set up it works very well.

There are three main steps, plus cleanup:

  1. Create an XML file from the data that you want to send (Make XML transformation)
  2. Use an XSL file to transform the XML into HTML (XSL Transformation job entry)
  3. Place the HTML into a single field which will form the body of the email, and send the email (HTML to email transformation)
  4. Delete files






Make XML

This transformation outputs an XML file called raw_xml.xml, containing data from the Data Grid step. You could substitute a database query for the Data Grid step.



For this example, I made a simple dataset consisting of information about the five American Great Lakes.







Here's what the XML looks like:

XSL Transformation


The XSL transformation converts the XML file into an HTML file. For this step to work, you need an XSL file. Here's what my html_structure.xsl looks like:



Setting up the XSL transformation is easy, using the XML file created in the previous transformation:




HTML to email

The HTML has to be input into a single field (called "html" here) using the "Group By" step. Then you can add any constants needed to send the email, such as login info and subject line.




The html field will be the field used for the Comment entry in the Mail step.



Delete files

And then we do cleanup!


Here's how the output looks in my gmail account.

It's not pretty, but it works, and someone with HTML chops can make it look fantastic.

If anyone knows how to do this without creating the intermediate XSL and HTML files, I'm all ears. I played around with passing the rows between entries and didn't get anywhere.

Stay tuned for part 2, which will show how to get tabular data out of emails and put it into a database.