Thursday, August 11, 2016

Delete files from Confluence by date

Deleting files from Confluence can be tedious if you have to do it one at a time, especially if you have potentially dozens of files to be deleted and you need to only delete files with specific attributes.

Using Confluence CLI (Command Line Interface) and Pentaho PDI, this can be easily accomplished. You can even schedule the Pentaho job to run at specified time intervals using Windows Scheduler or your favorite scheduling tool.

This is another small job, with one transformation in the middle that can pass multiple rows to the .bat file in the last step. In this example I'm deleting all files from a space that were posted before yesterday.


get list of files

The first entry in this job is a Shell script step that gets a list of attachments from space YOURSPACE on the page called Your Title. The list will be a .csv file with file attributes, and it will land in the location specified in the "General" tab.


select files to delete

At the heart of the job is a transformation that takes in a list of file attributes in a Confluence space, obtains yesterday's date from the System Info step, performs filtering, and then passes the rows back to the main job.



Here's a list of attributes that are in the attachmentlist.csv. There are plenty to use for filtering. We are going to be filtering on the Created field.


We get Yesterday from the System:



And compare it with Created, sending the rows where we want them. The "Select values" steps that are the targets of the Filter step are there mainly for troubleshooting; they don't do anything.


removeAttachment.bat

The last entry in the job is a call to a batch file called removeAttachment.bat, which contains this command:




In order to get this to work, we have to copy previous results to the arguments, and execute for every row.
And that's it! Confluence CLI is full of handy tools, and Pentaho makes them even more flexible.


Tuesday, August 9, 2016

Find latest file in directory with Pentaho

Need to grab the latest file from a directory and do something with it? Or the largest, or the one with the longest name, or any other superlative?

Within Pentaho this is easy to do within a three-step transformation without using any variables, and the filename can be passed to the next step in a job.









D:\MyFiles contains 11 files that all have timestamps in their names, and we could theoretically use those to do a string comparison with today's date, or any other date. But that is cumbersome, and there's always the possibility that we want to grab more than one file, or just the latest file even if we don't know when it arrived.











The first thing we need to do is to have Pentaho get all of the filenames. In the "Get File Names" step I've used the RegExp wildcard of ".*", which will get me everything in the directory. I could also have used ".*.csv" to get all .csv files, or "test.*.csv" to get all csv files that start with "test".











If we preview the rows, we see that Pentaho retrieves other useful information in the Get File Names step, like size.








The next step is a "Sort rows" step, where we sort by the lastmodifiedtime field. Make sure to change the option for ascending vs descending order.






Finally, we sample the rows coming from the sort step. I've chosen rows 1 through 3, but you could easily pick just one, or even pass a variable for the number of rows.












If we execute the transformation and preview the output, we can see that the three latest files are selected.









Now the filename or set of filenames is ready for whatever processing is needed.

Wednesday, August 3, 2016

"Copy rows to result" twice in Pentaho PDI

I have two different datasets, and I need to do two different command-line operations with them in the same job. This seems simple enough, but if the second set happens to be empty, strange things can happen.

Here's set 1, in the Input_Field1 transformation:
And here's set 2 in the Input_Field2 transformation, filtered to be empty for this example. I have a list of words that start with "z", and I'm filtering out any words that don't start with "a". 

Think of this as a list of files in a directory, possibly filtered to get back any that were created before yesterday. But what if none fit those criteria, because today is Monday?



The setup:

Main.kjb has two job steps, Echo_Field 1 and Echo_Field 2. For this example, each job has the simple task of echoing back the fields that they receive.
The Echo_Field1 job (Echo_Field2 looks the same):
... where echoField1.bat is a batch file with the single statement "echo %1".


When I initially ran this job, I received the "bat, cat, hat" output for both the Field1 and Field2 steps. What to do? How can I stop the Field1 output from interfering with the step for Field2, if Field2 has nothing to say?




The trick is to check the "Clear list of result rows before execution" box in the advanced tab of the job entry details for the Input_Field2 transformation.

Now Field1 and Field2 can express themselves freely, even if that means not saying anything.




And it works if Field2 has entries as well: