Using Oozie to Process Daily Logs

At Edmunds we are working to move our existing data warehouse system to a new system based on Hadoop and Netezza. At first, our data warehouse team focused on delivering ad impression data from DoubleClick DART as the first production deliverable. Last week, we achieved a major milestone: DART data is now being delivered on a nightly basis through our new Hadoop/Netezza infrastructure. Once in Hadoop the data is scrubbed and dumped into files that mirror our Netezza table structure. In order to handle the DART processing, we wrote a fair amount of code to deal with daily log rotation and things of that nature.

Toward the end of the project we started using Oozie to coordinate our processing workflows. We chose to use Oozie because we wanted a system that would allow us to schedule jobs, coordinate workflows and allow us to have better visibility about what is running when.

Recently I took on the task of processing some of the logs we generate internally. These internal logs are rotated out daily. Given the functionality provided by Oozie, we thought it would be great to remove our code that handles log rotation, file names, and date calculations and use Oozie to do the work. As powerful as Oozie is for handling date based processing, getting it to work was another story.

I wrote a blog post that describes the configuration I used to get our jobs running using Oozie. I went into some of the mistakes I made so that others can save time and effort using Oozie. 

Categories: , , ,

0 Comments

Add a comment

Advertisement

Archives

Browse Archives