placeholder

Databricks cost optimization: Practical tips for performance and savings

In a recent Databricks cost-optimization engagement I led, the outcomes were substantial: reduced unnecessary compute usage, faster workloads, and less developer time spent managing infrastructure.

OCT. 3, 2025
9 Min Read
by
Dilorom Abdullah
These gains weren’t just about tweaking configuration knobs. They came from rethinking how cluster resources align with workload behavior.
Over time, I’ve led multiple cost-efficiency initiatives across different organizations, often returning to the same environment months later. That’s because optimization isn’t a one-off event; it’s an iterative process. As cloud capabilities expand, workloads evolve, and Databricks releases new features, new savings opportunities always emerge. Continuous evaluation is key to keeping performance high while maintaining cost efficiency.
When managing Databricks environments, there are several straightforward, practical ways to trim costs, starting with a clear understanding of how clusters operate. Spending a little time learning about cluster settings, available instance families, and default configurations goes a long way. Instead of launching clusters without consideration, that awareness empowers teams to make informed choices that improve performance and control spending.

Cluster policies in action (UI + Terraform)

Let’s start with one of the simplest but most impactful tactics: implementing cluster policies. Whether your cluster is used for interactive notebooks or automated jobs, assigning a policy ensures consistency and prevents misuse. In our workspace, we require every user to apply one. These policies define limits and guardrails on what clusters can be created, helping us avoid unnecessary or costly setups.
The main goal here is cost containment. For instance, we only make a small set of machine learning cluster types available by default, and open access to others only when justified. This prevents users from accidentally spinning up expensive configurations without realizing the impact. Through these policy-driven constraints, we keep the environment efficient and sustainable.
Below are two examples: one from the Databricks UI and another from Terraform, showing how cluster policies are configured.
  • UI: When you create a cluster, you’ll see a “Policy” dropdown. Select from approved policies. For example, when creating a personal test cluster, I chose Personal Compute.
  • Terraform: You can apply a cluster policy using the policy_id attribute. For scheduled job clusters, reference the policy by its ID (for example, Job Compute). In the UI, you’ll see readable names, but Terraform requires the actual policy ID string.
Selecting cluster policy screenshot: UI and Terraform script

Choosing the right Spark version

Databricks frequently updates its Spark runtime versions. We currently use Spark 15.4. Each major version (like 15) includes multiple sub-releases — 15.1, 15.3, and one marked as LTS (Long-Term Support).
  • LTS versions are stable and fully supported for production workloads.
  • Non-LTS versions may introduce experimental features and lack long-term guarantees.
Upgrading to the most recent Spark runtime typically improves performance and cost efficiency. During our last review, I made sure all default clusters pointed to the newest Spark version. Exceptions exist — for example, if certain libraries only support 14.4 — but using older versions (like 11 or 12) risks compatibility issues and degraded performance.
data "databricks_spark_version" "latest_lts" {
  long_term_support = true
}
Screenshot of using the latest Spark version: UI and Terraform script

Driver node: always use on-demand

Each Spark job runs with one Driver node and multiple Worker nodes. The Driver orchestrates the workload, manages task execution, and holds application state. If the Driver fails, the entire job fails regardless of worker stability.
That’s why the Driver node should always run on on-demand infrastructure. Spot instances are cheaper but unreliable. If a spot Driver instance is preempted or unavailable, your entire application can stop or restart, causing unnecessary downtime.
Screenshot of selecting Driver node on demand
When setting up clusters in the UI:
  • Open Advanced Options > Spark.
  • Configure the Driver node to use on-demand instances.
  • For development environments, Worker nodes can be spot or spot-with-fallback.
  • In production, it’s safest to run both Driver and Worker nodes on on-demand instances to ensure reliability.

Spot vs. on-demand: cost vs. reliability

Spot instances can lower compute costs by up to 60% compared to on-demand. However, since they rely on unused cloud capacity, they can be reclaimed at any time, interrupting jobs.
Table comparing use cases and recommendations
Databricks also provides a Spot with Fallback to On-Demand option. The system first attempts to launch spot instances but automatically switches to on-demand if none are available. This hybrid approach balances reliability with savings, though it reduces cost benefits once the fallback kicks in.
Example: For streaming workloads (like Airbyte ingestion), I avoid pure spot instances. Continuous pipelines can’t tolerate unexpected termination. It can result in lost data or failed batches. In those cases, on-demand or fallback modes are preferable.
For non-critical dev jobs, pure spot instances can be a smart way to save money. For production or latency-sensitive pipelines, always stick with on-demand.

Spot bidding strategy

Spot instances run on a bidding system. In our configuration, the bid price is set to 100% of the on-demand rate, meaning we’re willing to pay full price if spot capacity is available. This helps maintain high availability while still benefiting from discounted rates.
If you’re okay with a lower success rate in exchange for higher savings, reduce the bid percentage — say, 80% or less — depending on workload tolerance.
In the UI, scroll right under Worker Type and Driver Type to toggle between on-demand and spot options for each node type.
  • For production or streaming jobs, keep both the Driver and Workers on on-demand.
  • For dev or test clusters, use spot or spot with fallback, where interruptions are acceptable.

EBS volume optimization

Databricks automatically provisions Elastic Block Store (EBS) volumes when clusters are created. Without customization, these default volumes can quickly inflate costs.
Example default: 100GB × 3 volumes = 300GB per cluster Even if unused, those volumes incur storage charges.
Most of our jobs didn’t require that much local disk. Unless you’re running workloads with:
  • Heavy data shuffling
  • Memory spillover
  • Large temporary files
…you likely don’t need large EBS volumes. We changed our defaults to:
  • Volume size: 32GB
  • Count: 1
In some lightweight workloads, we removed EBS volumes entirely. Always verify actual usage before accepting defaults. It’s an easy way to eliminate silent cost creep.

Choose auto-scale

Another simple but high-impact setting is auto-scaling. Enabling this in your cluster configuration allows Databricks to automatically adjust the number of worker nodes depending on job demand.
When workloads spike, the cluster scales out; when idle, it scales back in. You only pay for the compute you actively use, rather than provisioning for peak load that rarely occurs. This setting is ideal for fluctuating data pipelines or variable ETL workloads and reduces manual oversight, freeing engineers from constant tuning.
Screenshot of setting auto-scale: UI and Terraform

Availability zones: use auto for flexibility

My Databricks workspace runs in US East (Ohio), which includes three availability zones — us-east-2a, us-east-2b, and us-east-2c. Locking your cluster to a single zone can backfire: if that zone runs out of capacity, cluster creation will fail or be delayed.
To prevent that, I configured the “auto” setting in our cluster policies. This lets Databricks select any available zone dynamically, improving reliability and startup speed.
While this doesn’t directly reduce cost, it enhances cluster performance and avoids unnecessary wait times.

Selecting the right cluster family

Databricks offers several cluster families across major cloud providers (AWS, Azure, GCP), each designed for different workload types:
  • General Purpose: Balanced CPU and memory
  • Compute Optimized (C): Best for ETL and transformation-heavy jobs
  • Memory Optimized (R): Ideal for joins, caching, and memory-intensive operations
  • Storage Optimized (I): Designed for high I/O throughput
Using the Databricks Pricing Calculator helps estimate monthly costs and compare instance families easily.
If you’re unsure where to start, choose General Purpose. Once you understand workload characteristics, fine-tune from there.
In one case, a team was running a memory-heavy workload on compute-optimized nodes. Switching to a memory-optimized cluster both improved performance and cut monthly costs by 75%.

Real-world example: Google Sheets job optimization

Here’s an example from a Google Sheets ingestion pipeline.
Before Optimization:
  • Node type: c4.4xlarge (compute-optimized)
  • Job behavior: Reading data from multiple Google Sheets into DataFrames and writing them downstream, primarily memory-bound, not compute-heavy.
  • Runtime: ~7 minutes per job, ~5 hours total per month
Cost comparison of the cluster family switch
Switching to a memory-optimized r3.xlarge instance reduced costs by about 75% monthly while maintaining the same output.
Memory Metrics Before and After Before switching, memory utilization regularly hit 100%, which is risky. Ideally, you want it under 80%. After the change, usage stabilized with additional headroom, confirming that caching worked more efficiently.
Cluster memory usage before switching cluster family type
Cluster memory usage after switching cluster family type
This single adjustment eliminated memory-related failures and improved job reliability.
There are more ways to refine memory use, such as better caching logic and code optimizations, which I’ll cover in future posts.

Add cluster tags

While tags don’t directly affect cost or performance, they’re invaluable for visibility and governance. Tagging clusters by owner, project, environment, or cost center makes it easier to track spending and analyze usage patterns.
They also help with monitoring, alerting, and debugging across large shared workspaces. When an unusual cost spike appears, tags make it immediately clear which team or workflow is responsible. It’s a small best practice that pays off in operational transparency.

Unlocking speed and savings with Databricks serverless clusters

Whenever possible, I recommend adopting Databricks Serverless clusters, especially while discounted through April 30. Serverless reduces compute costs and developer overhead by handling configuration, scaling, and cluster management automatically.
For development teams, this eliminates idle spin-up time and allows engineers to start coding immediately. Serverless clusters are fast, efficient, and reduce the maintenance burden that traditional cluster setups require.
If you’re running large workloads on a specific cluster type, it’s also worth negotiating preferred rates with your Databricks account team. I’ll share a separate post soon on migrating jobs to Serverless, a process I’ve already implemented with strong results.
Screenshot of a Databricks workflow runtime on a job cluster and serverless cluster

Final thoughts & resources

Many of these recommendations align with Databricks’ own Best Practices for Cost Optimization guidance. Of course, not every suggestion applies universally. What matters most is understanding your workload patterns and balancing cost against performance. Avoid using defaults blindly, and regularly review configurations for inefficiencies.
With deliberate configuration choices, cluster policies, and a little experimentation, Databricks can deliver exceptional performance without overspending.