Monday, July 24, 2017

Debugging kettle tasks in MapReduce Sane WriteToLog

Debugging kettle tasks in MapReduce Sane WriteToLog


Finally started to play with bigdata and pentaho. On my specific case, Cloudera CDH3u4. At Mozilla we have a few clusters of over 80 machines that were using to backup a bunch of services

Debugging mapreduce tasks


It took me a while to get my head around the concepts of how kettle integrated with the mapreducer tasks. When I did, the first thing I noticed is how complex it is to know whats happening. Until Matt Casters and friend get the chance to implement PDI-9148, we need to do things manually - as in inspecting logs, etc.

My first approach was writing to text files. I tested direct output to hdfs, but for some reason didnt work. Using direct file system means that output will be spread through all the cluster nodes. This approach generally sucks.

I also thought about using some hand-made logic in a javascript step, but then looked at the WriteToLog step. This step generally works, but with a great flaw on it; it has no way to limit the output of it. If we have millions of rows, well have a huge log generated - and thats not good.


An improved Write To Log step


If its not there, just do it yourself, the code is open. So I did. I added the ability of specifying a limit to the output of the step. This is very useful to inspect how the dataset is looking inside a map or reduce task. Once I deployed this change to my cluster, this is how my tasktracker log looks like (I ran this with a previous writeToLog version and ended up with a crashed browser and almost half a gigabyte of log files). This shows the first 5 lines of our dataset, with the key and value of our dataset:


Ill work with the kettle team in order to put this into the main code line, hopefully will be in 4.4.1 and 5.0. This is PDI-9195




download more info

No comments:

Post a Comment