Tuesday, November 8, 2016

Denormalize data in Pentaho

The Row Denormalizer step in Spoon/PDI/Pentaho is powerful but can be clunky to implement. Let's walk through an example using data from the MySQL Sakila sample database, which I have already installed on my local MySQL instance.

We'll denormalize the data to aggregate one column by creating new columns out of one set of values in the rows.

We need three steps to do this, since we need to sort the rows being sent to the denormalizer.

Table Input

I'm grabbing a set of data concerning movie rating, rental duration, category, and length from the Sakila database. Our task will be to obtain the average length of movie for every combination of rating, rental duration, and category.

We're going to aggregate this data so that we have a row for every combination of rental_duration and category. We'll have columns for each of the ratings.

Sort Rows

The Row Denormalizer step is quite adamant that you sort the rows that it receives, and for good reason. Since we're going to be averaging the length, we don't need to sort it. We need to sort by all of the other fields, but in what order?

Since we're going to have ratings form the columns, it will go last. The other two columns must be sorted first.

Just for giggles, try removing this step between Table Input and Row Denormalizer, and see what the output is. Or sort the rating column before the other two.

Row Denormalizer

This is the step that can be perplexing. Let's try to clear it up.

We've got four fields of data to play with.

Rating is the "key field", since it's going to determine what gets sorted into which column (in yellow).

Length field goes into the "Value fieldname", since it's going to form the new values that Spoon is going to calculate (in green).

That leaves the other two, which will make up the grouping (in pink).

And here's the exciting output!

Notice that we only have output for movies rated G, PG, and R, even though our dataset has other ratings. We didn't specify these other ratings in the Target fields section, and the transformation was still able to run. Feel free to add them as an exercise.

We can see that there is a <null> for some values, like PG-rated Action movies with duration of 5. If we spy on our sorting step, we can see why.

We can definitely do more complicated things with Row Denormalizer. Stay tuned!

Tuesday, October 4, 2016

Exit a batch script when an error occurs

Through Pentaho's shell script step, we can integrate into our ETL file manipulation, Confluence interaction, Tableau server interaction, and just about any other application for which we have a command line interface.

There is a dark side of the shell script that may leave you thinking that your job ran flawlessly, when in reality it just bailed out early.

I've got a very simple test job here. It calls a script to rename the file "foo.bar" to "bar.foo". However, that file doesn't exist, so the shell script step exits with a failure status.

I add a second statement to my script -- this time, to rename a file that does exist. What's going to happen?

If the first statement fails, the second one still runs. In this case, it ran successfully and the "hello.world" file was renamed. This means that according to Pentaho, the success condition for this shell script step was met, we see a nice green check-mark, and if this job contained more steps that depended on the success of the shell script step, everything would proceed as normal. Which probably isn't OK.

Fortunately there's a way around this that doesn't involve having to break these commands up into multiple job steps. As we see on this Microsoft page, the double-pipe gives us a way to execute certain commands only if the preceding command fails.

If we add "|| exit 1" after every command, we can force the job to stop if any command in the script fails. Not only does the script exit, but it exits with a return code of 1, which is a failure.

There are other ways to accomplish a similar result. For example, the same page referenced above indicates that we can use &&: "Use to run the command following && only if the command preceding the symbol is successful."

However, this would require that we put all of our commands on the same line, which is not good for readability if your commands are longer or more complex, or if you have more than a couple of them.

Thursday, September 22, 2016

Schedule a Pentaho job to run automatically

Got a job that needs to run every morning at the same time, or even every half hour? Pentaho Data Integration Server does have a built-in scheduling view, but if you are using Pentaho CE you may end up using Windows Task Scheduler to call a .bat file that will in turn call kitchen.bat to run your .kjb file. Clear as mud? Here we go ...

I have the world's simplest job called main.kjb. All it does is create a file called hello_world.txt, with an added timestamp before the file extension.

The batch file (.bat)

If I want to use Windows Task Scheduler to run this task, I need a very simple batch file. It needs to call kitchen.bat, and then point Kitchen at the job file that I want to run. It can also create a job log for me, with any level of verbosity you choose. Kitchen has good documentation at http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation.

If you've never run Kitchen from the command line on your machine before, you will probably have to add the path to your Path system environment variable. Search for these variables in the Settings menu. You'll need the full path to where your kitchen.bat lives.

This may look different depending on your version of Windows.

Once you've done this, you can test your .bat file by double-clicking it. This is a good check before setting it up in the task scheduler. 

The task scheduler

The scheduler can be found by searching for it on your machine. If you use the wizard to set up your schedule, you want to select the action of "Start a program". But the program you are starting isn't Spoon, it's the .bat file that you created. 

In the wizard, you can pick the time that the job will run. Elsewhere in the utility you can select fancier options, like having the task run every half hour between 9 am and 5 pm on weekdays.

Let's start a program...

Here's where the .bat file goes...

And here's how the task looks in the schedule list, along with its task siblings.

If you enable history logging, the task scheduler shows you a rough history of every action taken on your task.

The task scheduler also allows you to run the task on demand, if you would like. This means that you don't have to open your job file and risk inadvertently changing something.

Tuesday, September 13, 2016

Move html table data from email to database using Pentaho, jsoup

In an earlier post, we extracted data from a data table and then emailed it in a tabular format using HTML.

How about going the other direction, where we are receiving emails that contain data in a tabular format, and we need to get that data into a database?

When I started pulling on this thread I started here, on the Pentaho forum. In order to implement this solution you will need to download jsoup, as mentioned in this post on the forum. After downloading jsoup you'll need to tell Pentaho how to find it the jar file, as referenced here (thanks, Reeshu!).

As with most things Pentaho, it's a bit fiddly but it works.

I'll be using the emails that I sent to myself in a previous post as a source of data.
The emails contain simple tables of data about the American Great Lakes.

We'll grab the data from these emails in a Pentaho transformation with four steps. 

Get Emails

Since this is a Transformation and not a Job, I'll be using the Email Messages Input step from the Transformation design tab. The Get Mails step available inside a Job has a few more options, like what to do with the emails after you read them. Check them both out!

You'll have to enter the settings for your email on the first tab, and then you can select the folder from which to get the emails and other options. On the Filters tab you can filter emails by sender, date, and so on.

This step retrieves 20 different fields about each message, but you can pick the ones you want. For this job I'm only going to use Received Date and Body.

Convert to XML

This step is where the jsoup comes in. If you were able to download jsoup and point Pentaho at the external jar file, this should be a piece of cake.

The output will be a field called xhtml, which you will use in the next step.

Get data from XML

The input for this step is the xhtml field from the previous step.

Getting your output the way you want it might take the most fiddling, depending on the format of the table that you're trying to parse. Mine was pretty simple, so:

Select Values

If  you were doing this for real, you'd probably use something other than a Select step. But here is our data, from the five emails that I sent myself last week. It's ready to be imported into database or spreadsheet, depending on your needs.

Tuesday, September 6, 2016

Send simple tabular data emails from a database

It's possible to send emails containing tabular data using Pentaho. The setup is a bit labyrinthine and it requires a bit of noodling with HTML/XSL, but once it's set up it works very well.

There are three main steps, plus cleanup:

  1. Create an XML file from the data that you want to send (Make XML transformation)
  2. Use an XSL file to transform the XML into HTML (XSL Transformation job entry)
  3. Place the HTML into a single field which will form the body of the email, and send the email (HTML to email transformation)
  4. Delete files

Make XML

This transformation outputs an XML file called raw_xml.xml, containing data from the Data Grid step. You could substitute a database query for the Data Grid step.

For this example, I made a simple dataset consisting of information about the five American Great Lakes.

Here's what the XML looks like:

XSL Transformation

The XSL transformation converts the XML file into an HTML file. For this step to work, you need an XSL file. Here's what my html_structure.xsl looks like:

Setting up the XSL transformation is easy, using the XML file created in the previous transformation:

HTML to email

The HTML has to be input into a single field (called "html" here) using the "Group By" step. Then you can add any constants needed to send the email, such as login info and subject line.

The html field will be the field used for the Comment entry in the Mail step.

Delete files

And then we do cleanup!

Here's how the output looks in my gmail account.

It's not pretty, but it works, and someone with HTML chops can make it look fantastic.

If anyone knows how to do this without creating the intermediate XSL and HTML files, I'm all ears. I played around with passing the rows between entries and didn't get anywhere.

Stay tuned for part 2, which will show how to get tabular data out of emails and put it into a database.

Thursday, August 11, 2016

Delete files from Confluence by date

Deleting files from Confluence can be tedious if you have to do it one at a time, especially if you have potentially dozens of files to be deleted and you need to only delete files with specific attributes.

Using Confluence CLI (Command Line Interface) and Pentaho PDI, this can be easily accomplished. You can even schedule the Pentaho job to run at specified time intervals using Windows Scheduler or your favorite scheduling tool.

This is another small job, with one transformation in the middle that can pass multiple rows to the .bat file in the last step. In this example I'm deleting all files from a space that were posted before yesterday.

get list of files

The first entry in this job is a Shell script step that gets a list of attachments from space YOURSPACE on the page called Your Title. The list will be a .csv file with file attributes, and it will land in the location specified in the "General" tab.

select files to delete

At the heart of the job is a transformation that takes in a list of file attributes in a Confluence space, obtains yesterday's date from the System Info step, performs filtering, and then passes the rows back to the main job.

Here's a list of attributes that are in the attachmentlist.csv. There are plenty to use for filtering. We are going to be filtering on the Created field.

We get Yesterday from the System:

And compare it with Created, sending the rows where we want them. The "Select values" steps that are the targets of the Filter step are there mainly for troubleshooting; they don't do anything.


The last entry in the job is a call to a batch file called removeAttachment.bat, which contains this command:

In order to get this to work, we have to copy previous results to the arguments, and execute for every row.
And that's it! Confluence CLI is full of handy tools, and Pentaho makes them even more flexible.

Tuesday, August 9, 2016

Find latest file in directory with Pentaho

Need to grab the latest file from a directory and do something with it? Or the largest, or the one with the longest name, or any other superlative?

Within Pentaho this is easy to do within a three-step transformation without using any variables, and the filename can be passed to the next step in a job.

D:\MyFiles contains 11 files that all have timestamps in their names, and we could theoretically use those to do a string comparison with today's date, or any other date. But that is cumbersome, and there's always the possibility that we want to grab more than one file, or just the latest file even if we don't know when it arrived.

The first thing we need to do is to have Pentaho get all of the filenames. In the "Get File Names" step I've used the RegExp wildcard of ".*", which will get me everything in the directory. I could also have used ".*.csv" to get all .csv files, or "test.*.csv" to get all csv files that start with "test".

If we preview the rows, we see that Pentaho retrieves other useful information in the Get File Names step, like size.

The next step is a "Sort rows" step, where we sort by the lastmodifiedtime field. Make sure to change the option for ascending vs descending order.

Finally, we sample the rows coming from the sort step. I've chosen rows 1 through 3, but you could easily pick just one, or even pass a variable for the number of rows.

If we execute the transformation and preview the output, we can see that the three latest files are selected.

Now the filename or set of filenames is ready for whatever processing is needed.