Tuesday, November 8, 2016

Denormalize data in Pentaho

The Row Denormalizer step in Spoon/PDI/Pentaho is powerful but can be clunky to implement. Let's walk through an example using data from the MySQL Sakila sample database, which I have already installed on my local MySQL instance.

We'll denormalize the data to aggregate one column by creating new columns out of one set of values in the rows.

We need three steps to do this, since we need to sort the rows being sent to the denormalizer.

Table Input

I'm grabbing a set of data concerning movie rating, rental duration, category, and length from the Sakila database. Our task will be to obtain the average length of movie for every combination of rating, rental duration, and category.

We're going to aggregate this data so that we have a row for every combination of rental_duration and category. We'll have columns for each of the ratings.

Sort Rows

The Row Denormalizer step is quite adamant that you sort the rows that it receives, and for good reason. Since we're going to be averaging the length, we don't need to sort it. We need to sort by all of the other fields, but in what order?

Since we're going to have ratings form the columns, it will go last. The other two columns must be sorted first.

Just for giggles, try removing this step between Table Input and Row Denormalizer, and see what the output is. Or sort the rating column before the other two.

Row Denormalizer

This is the step that can be perplexing. Let's try to clear it up.

We've got four fields of data to play with.

Rating is the "key field", since it's going to determine what gets sorted into which column (in yellow).

Length field goes into the "Value fieldname", since it's going to form the new values that Spoon is going to calculate (in green).

That leaves the other two, which will make up the grouping (in pink).

And here's the exciting output!

Notice that we only have output for movies rated G, PG, and R, even though our dataset has other ratings. We didn't specify these other ratings in the Target fields section, and the transformation was still able to run. Feel free to add them as an exercise.

We can see that there is a <null> for some values, like PG-rated Action movies with duration of 5. If we spy on our sorting step, we can see why.

We can definitely do more complicated things with Row Denormalizer. Stay tuned!

1 comment:

  1. Hello to all, since I am genuinely keen of
    reading this blog's post to be updated regularly.
    It contains fastidious data.