What is Databricks Liquid Clustering?

Learn how Databricks Liquid Clustering improves query performance, reduces costs, and simplifies data optimization for large, dynamic datasets.

Share this on:

Databricks has introduced liquid clustering, which is an innovative approach to organise data in Databricks. It replaces traditional optimization techniques like Z-Ordering and partitioning and overcomes their limitations. With the help of liquid clustering, organisations can optimize query performance and cost to run queries.

In this blog, we will delve into Databricks liquid clustering, its advantages, implementation, and real-world impact.

How is Data Organized in Databricks?

Before diving into liquid clustering, let’s understand the ways in which data is organized in Databricks.

1. Partitioning: The First Level of Organization

Partitioning has been the foundation of organizing data in data warehouses and data lake alike. It involves:

However, it has a few limitations. For example: Too many partitions on data files could lead to small file problem which affects the performance of queries drastically. Also partitioning is only effective for queries that filter based on partition columns.

2. Z-Ordering

Z-Ordering is a technique that involves collocating related information of data in same set of files. It is said to be a better approach than partitioning.

The colocation is automatically leveraged by delta tables in delta lake that, in turn, leverage data skipping algorithms to drastically reduce the amount of data that needs to be read.

Although it is a good technique, it comes with its own limitations like:

3. Liquid Clustering: Latest Feature

Liquid Clustering and Automatic liquid clustering are the latest advancements from Databricks which overcome the limitations of traditional approaches mentioned above.

Note: Databricks recommends using Databricks Runtime 15.2 and above for all tables that are liquid clustering enabled.

Liquid clustering allows you to redefine clustering keys without rewriting existing data. It can be beneficial for below use cases:

To enable liquid clustering, we need to add CLUSTER BY command during table creation.

How LumenData Implemented Databricks Liquid Clustering for a FinTech client: Use Case

Recently, Lumendata had an opportunity to implement Databricks liquid clustering for one of our FinTech clients. This feature significantly helped us in optimizing query performance along with compute cost.

Client Challenge:

Our client processes millions of records daily and stores them in Databricks Delta lake. There are multiple teams fetching records and data from multiple tables and building golden layer views for further analysis.

One of the team looks at risk management and does deep dive into regulatory compliance. Other teams do some customer behaviour analysis by examining spending patterns.

Few issues faced with traditional approaches (partitioning and Z-Ordering)

Steps for Implementing Liquid Clustering

Step 1: Prepare Sample Transaction Data

For our example, we have prepared similar dummy transactions data(~5M records) in Databricks with the help of python libraries.

Below is the schema for the data

Step 2: Traditional Approach - Partitioning and Z-Ordering

We created table with partitioned and with Z-Ordering applied on partitioned table.

Table name:: financial_transactions_partitioned_lc_demo

We have highlighted num_files variable that shows 728 files are generated for the table above.

After applying Z-Ordering, let’s see the optimization and data organization effect on partitioned table.

Files were reduced to 91 and partitions optimized 728. This is how Z-ordering helps in data organization and query performance.

Step 3: Implementing Liquid Clustering

Now, let’s implement Liquid Clustering on the same dataset:

We created the below table and added cluster by columns.

Step 4: Query Performance Comparison

Let’s run some sample queries to compare performance:

We have created a set of queries which mimic the original client queries.

Query 1 -> Fraud Detection Query

Query 2 -> Regulatory Compliance query

Query 3 -> Customer behaviour analysis query

We are comparing traditional Z-Order with Liquid clustering for each of the query above.

In this implementation, we achieved significant improvement by 40% for query third.

In real-world deployments, Liquid Clustering typically provides:

How Liquid Clustering Works

Liquid Clustering enables:

How to Choose Between Partitioning, Z-Ordering, and Liquid Clustering

Use Partitioning When:

Use Z-Ordering When:

Use Liquid Clustering When:

Conclusion

Databricks liquid clustering is a significant advancement in underlying data organization for complex analytical workloads.

For financial institutions that handle billions of records along with complex datasets and diverse query patterns, Liquid clustering is the best solution that offers reduced costs, minimal operational overhead.

Also Read: Best Steps for Hadoop to Databricks Migration

Author

Ritesh Chidrewar

Senior Consultant

resources

Read our Case Studies

Case Studies, Snowflake

LumenData Enables Comprehensive Data Cataloging & Lineage using Snowflake and dbt Systems for a Global Consulting Firm

Explore how LumenData helped to increase supplier performance & risk management and reduce time-to-insight for decision-makers.

Learn more

Case Studies, Informatica, Manufacturing

LumenData Helps UD Trucks to Use Informatica SaaS MDM to Enable Faster Data Access and Enhanced Scalability

Explore How LumenData Helps UD Trucks to Use Informatica SaaS MDM to Enable Faster Data Access and Enhanced Scalability

Learn more

Case Studies, Travel & Hospitality

Data Modernization and Intelligent Reporting for a Leading Corporate Travel Provider

See how LumenData empowered a travel firm with data modernization, MDM upgrades, and real-time insights to boost growth and efficiency.

Learn more

What is Databricks Liquid Clustering?

What You'll Learn

How is Data Organized in Databricks?

1. Partitioning: The First Level of Organization

2. Z-Ordering

3. Liquid Clustering: Latest Feature

How LumenData Implemented Databricks Liquid Clustering for a FinTech client: Use Case

Steps for Implementing Liquid Clustering

Step 1: Prepare Sample Transaction Data

Step 2: Traditional Approach - Partitioning and Z-Ordering

Step 3: Implementing Liquid Clustering

Step 4: Query Performance Comparison

In real-world deployments, Liquid Clustering typically provides:

How Liquid Clustering Works

How to Choose Between Partitioning, Z-Ordering, and Liquid Clustering

Use Partitioning When:

Use Z-Ordering When:

Use Liquid Clustering When:

Conclusion

Author

Read our Case Studies

LumenData Achieves Approved Delivery Partner Status with Databricks Professional Services

Solutions

LumenData Accelerator for MDM Modernization

LumenData 360++ Extension for Supplier 360

LumenData Accelerator for Higher Ed 360

LumenData Axon to CDGC Modernization

Informatica Reference 360 SaaS Accelerator

Life Science Accelerator for Customer360 SaaS

Migrating from Oracle DRM to Informatica R360

Salesforce Accelerator for Customer360 SaaS

SAP Accelerator for Customer360 SaaS

Salesforce Connector for Informatica MDM

Salesforce Connector for Oracle MDM Product

Reltio Integration for Salesforce

Industry Focus

Public Sector

Financial Services

Higher Education

Retail

Healthcare

Manufacturing

High Tech

Travel & Hospitality

Featured Case Studies

Services Focus

Cloud Modernization

Data Engineering & Analytics

Data Strategy & Business Value Assessment

Solutions

Industry Focus

Telemedicine

Dating Apps

Fintech

Consulting Providers

Featured Case Studies

Simplifying IT for a complex world.

Platform partnerships

Services

Business Challenges

Digital Transformation

Security

Automation

Gaining Efficiency

Industry Focus

Simplifying IT
for a complex world.