Tuesday, August 9, 2016

Find latest file in directory with Pentaho

Need to grab the latest file from a directory and do something with it? Or the largest, or the one with the longest name, or any other superlative?

Within Pentaho this is easy to do within a three-step transformation without using any variables, and the filename can be passed to the next step in a job.

D:\MyFiles contains 11 files that all have timestamps in their names, and we could theoretically use those to do a string comparison with today's date, or any other date. But that is cumbersome, and there's always the possibility that we want to grab more than one file, or just the latest file even if we don't know when it arrived.

The first thing we need to do is to have Pentaho get all of the filenames. In the "Get File Names" step I've used the RegExp wildcard of ".*", which will get me everything in the directory. I could also have used ".*.csv" to get all .csv files, or "test.*.csv" to get all csv files that start with "test".

If we preview the rows, we see that Pentaho retrieves other useful information in the Get File Names step, like size.

The next step is a "Sort rows" step, where we sort by the lastmodifiedtime field. Make sure to change the option for ascending vs descending order.

Finally, we sample the rows coming from the sort step. I've chosen rows 1 through 3, but you could easily pick just one, or even pass a variable for the number of rows.

If we execute the transformation and preview the output, we can see that the three latest files are selected.

Now the filename or set of filenames is ready for whatever processing is needed.

No comments:

Post a Comment