Open Sourcing Databricks Integration Tools at Edmunds

What is Databricks and How is it Useful for Edmunds?

Databricks is a cloud-based, fully managed, big data and analytics processing platform that leverages Apache Spark and the JVM. The big selling point of the Databricks Unified Analytics Platform is that it unifies Big Data and Machine Learning.

Databricks lowers the barrier to entry and/or time needed to work with large quantities of data in several ways:

  1. Databricks Notebooks
  2. Cluster and Job Creation
  3. Polyglot Programming

Finally, Databricks is operated by active Apache Spark committers including many of the original creators. Because of this, Edmunds gets access to not only excellent Spark support, but also the newest versions and features that the framework has to offer.

Databricks Notebooks

Databricks notebooks are similar to Jupyter notebooks and are essentially a REPL with a nice Web UI for Spark. They allow for quick development cycles, interactive testing of code, and even visualizations of datasets in the terabyte (or beyond!) range. The code is programmed via a Web UI or you can run jars or eggs from an IDE (Note: The promise of directly connecting to an IDE is in the works through a feature called Databricks Connect which is currently in private preview). With the recent advances to make notebooks the standard in a data engineering org, we are excited to see what the possibilities are with a “notebook first” approach.

Easy Cluster and Job Creation

Databricks has an easy to use interface for creating clusters to run both interactive and scheduled jobs. This eliminates many operational hurdles of our in-house ticket system making a request that could easily days only take a matter of minutes. Other solutions, like EMR, are moving in this direction as well via the Amazon Console, but they still have a long way to go before they are as easy to use as Databricks. A big part of this is Databricks’ better UI experience which allows users without much software background to quickly hit the ground running on their own Spark cluster. It even provides easy to configure auto-scaling and auto-termination so precious dollars can be saved. Another benefit unique to Databricks is that they can offer the most up to date version of Spark as well as superior Spark support and tuning. As a company, we strive to write all new data engineering projects using the Spark framework, so this is a key consideration for us.

Polyglot Programming

Databricks gives the freedom to choose from five popular programming languages in order to achieve results (Note, using Java is in JAR form only). Languages supported in notebooks include Scala, Python, R, SQL, and Markdown. This opens up access to people from a wide swathe of backgrounds. No longer must you be a Scala or Java expert with experience in distributed programming to develop a scalable job.

Evolving on Databricks

As with most software stacks, software engineers require a set of tools, libraries, and processes that allow them to follow software-engineering best practices. We are no different and need solid ways of developing and deploying productionalized code to the Databricks runtime environment. So, why not build such a system up front? When we introduced Databricks at Edmunds, we focused on allowing for easy adoption of the framework to persuade users that Databricks could play an important role in their work. This meant that there was an emphasis on achieving the fastest velocities possible.

You are probably wondering why we would want to institute policies that could potentially take away from fast velocities which are consistently viewed as one of the most important factors of a company’s technological health. As with almost all things in technology, there are trade-offs to consider. Having absolutely no operative checks and few standards to our Databricks environment meant that every project that was deployed to Databricks on production became technical debt. This debt adds up over time and lowers the velocity of both the team’s deployment cycle but also of all users of Databricks as well. We needed to address requirements across four categories:

So What Did We Build?

First, I want to call out that we were not the first team to try and solve this problem. Our advertising solutions team had already developed a strong Python solution. This includes scripts for syncing notebooks to and from Databricks, as well as a standardized workspace layout that allows for logical separation of workflows such as services, configuration management, and local dev testing. We referred to this process when creating our framework for JVM projects.

The goal of this project was to extend our capabilities on Databricks as well as also accommodate the following additional requirements:

Here is what we built:

What does it roughly Look Like?

High Level What This Delivers for Us?

Why did we open source the REST client and Maven Plugin?

At Edmunds, we believe in open source technology. Contributing what we’ve done will hopefully allow others to improve how they work with Databricks, and we hope contributors will return the favor. Are you interested? Submit pull requests to one of the “good first issues”, here.

Are you interested in solving big data problems or building tools to empower software engineers and data scientists? If so, check out our open positions!

Future Work

Interactive Cluster Management - we plan on adding functions to our maven-plugin to accommodate users who want to automate the management of interactive clusters and codify their creation.

Sam Shuster is Staff Engineer on the Data Engineering team at Edmunds.

Shaun Elliott is Technical Lead on the Data Engineering team at Edmunds.

At Edmunds we’re not just about making car buying easier, we're also passionate about technology!

As with any website that has millions of unique visitors, it's a necessity that we maintain a scalable and highly-available infrastructure with reliable services.

We are excited by software design and strive to create engaging exper-iences while using coding practices that promote streamlined website production and experimentation. We embrace continuous delivery, dev ops, and are constantly improving our processes so we can stay lean and innovative. We also prioritize giving back to the community by providing open APIs for our auto-motive data and open sourcing projects whenever possible.

Recent Posts