Guide for optimizing cost for Azure Databricks

Faiz Chachiya
4 min readMar 15, 2024

--

Azure Databricks is one of the widely used services across customers for performing data related activities right from data engineers, analysts and ML engineers. One of the common concerns that I come across customers is the growing cost as the business requirements and workloads increase on the Databricks instances. In this blog we would be covering some step by step approaches on how to configure the clusters so that analyzing the cost utilization becomes easy for the teams as well as administrators.

Custom tags

Tags are nothings but metadata information that you would apply to your Azure resources like VMs, DBs etc. The benefit of these tags is to easily identify the purpose, usage, owner of these resources. For e.g. If you want to track the deployment environment for your resources, add a key named Environment.

Whenever you create Cluster within Azure Databricks there are certain default tags (snap below) that would get added as part of the cluster creation process.

As you would be aware that there are different kind of clusters (job cluster, warehouse, pools etc.) which may not have the same kind of default tags and when it comes down to narrowing down the cost for the specific cluster it would become very difficult.

For e.g., if you check the following snap where the cost analysis is done by grouping the ‘clustername’ and there are so many untagged workloads

In order to circumvent this problem, it would be better to add the following tags for your

Ensure that you add them for all the computes (Job, SQL Warehouse, Pool etc.)

The advantage of adding this custom tags would be clear segregation of the different job cluster and easy for you to identify the expensive operations.

Cost Optimization

Once we have identified the cluster that are major cost contributors, the next steps is then to look at the cluster’s CPU/memory utilization and also the developer need to closely analyze the Spark DAG for further optimization.

Analyze the CPU/memory utilization

Whether you are analyzing job or certain compute, go to the respective job/cluster and you should be able to open the metrics

Spark scheduled job

The metrics earlier used to be link to the Ganglia UI but it has recently changed where you we now have clear view of the CPU, Memory and other spark related metric. For the CPU and Memory utilization we are interested in the following graphs

Metrics from Azure Databricks

If we look carefully at the CPU Utilization graph, the utilization for this cluster is under utilized, probably just 10–15% and there is an ample opportunity to optimize this further by either reducing the minimum number of nodes or scaling down on the compute SKU.

Similarly we also need to look at the Memory utilization graph, because there could be some which would be memory intensive and not compute intensive and accordingly decide on sizing the right compute SKU.

Very important that this activity be performed with the application/data team owning the cluster as they would have better understanding about their workload.

Spark DAG

Analyzing the DAG can results into lots of areas of optimization which are as follows,

  1. Ensure the data is distributed across the tasks, if few of the tasks are overloaded then that could result into overall job execution time
  2. Reading and writing too many smaller files would also impact the execution time and the number of read/write operations would increase the transactions requests for the blobs thereby increasing the cost further
  3. Ensure the best practices for Spark are followed like maximum parallelism, lazy evaluation, caching, reducing the shuffles etc.

Summarize

Cost Optimization for the Azure Databricks instances can be complex activity and the approaches that I have mentioned in this blog are something that can be performed after you have explored other optimizations on the storage and this would also help you standardize the way cluster are configured and simplify the overall activity.

(The opinions expressed here represent my own and not those of my current or any previous employers.)

--

--

Faiz Chachiya

Faiz Chachiya is Software Architect, Coder, Technophile, Newbie Writer and loves learning languages. Currently working at Microsoft as Cloud Solution Architect.