June 30, 2023

Optimize Costs by Migrating ELT from Cloud Data Warehouses to Data Lakehouses

Written by:

Po Hong

Optimize Costs by Migrating ELT from Cloud Data Warehouses to Data Lakehouses

Many customers have turned to popular cloud warehouse engines like Snowflake, Amazon Redshift, and Google BigQuery for resource-intensive and expensive ELT (extract, load, and transform) and SQL merge operations. However, they may not be aware of the potential cost savings achievable by adopting an interoperable data lakehouse approach instead.

In this blog, we will explore the advantages and feasibility of offloading ELT operations from a cloud data warehouse, highlighting the potential benefits of flexibility and cost savings that can be realized by performing ELT in a data lakehouse.

Cloud Data Warehouse vs. Data Lakehouse

The need for leveraging unstructured and semi-structured data has grown significantly in recent years, with an estimated 80% of data being unstructured by 2025, as stated by IDC. Data lakes were originally designed to accommodate such data types, along with structured data. Cloud storage systems like Amazon S3, Google Cloud Storage (GCS), and Azure Data Lake Storage (ADLS) have made data lakes more accessible and scalable, enabling organizations to store vast amounts of data for predictive analytics, machine learning (ML), and AI applications.

Like on-premises data warehouses, cloud data warehouses have served as centralized repositories for structured and semi-structured data, offering features such as data loading, transformation, and analytics through business intelligence (BI) tools. Cloud data warehouses like Snowflake, Amazon Redshift, and Google BigQuery leverage cloud storage solutions to facilitate complex, parallel processing and transformations of large volumes of data. While cloud data warehouses often outperform data lakes for traditional BI workloads and provide excellent SQL experience, they are in general more expensive, and often lack built-in support for machine learning and AI use cases.

Consequently, many organizations resort to a hybrid architecture, maintaining multiple copies of data in both data warehouses and data lakes. However, this approach introduces challenges of data fragmentation, making it difficult to keep data synchronized and to onboard new use cases, resulting in increased maintenance and storage costs.

The modern data lakehouse architecture, pioneered by Uber in 2016, is becoming highly popular for managing large-scale real-time data pipelines. It combines the strengths of data lakes and data warehouses into a unified platform, enabling storage and analysis of the full range of data types - unstructured, semi-structured, and structured. This architecture supports various analytics capabilities, including business intelligence (BI), predictive analytics, machine learning (ML), and AI, while leveraging the scalability and cost-effective storage of data lakes.

The lakehouse architecture is typically built upon one of the three key open source projects: Apache Hudi™, Apache Iceberg™, and Delta Lake. These projects have contributed to the rapid growth and adoption of the data lakehouse concept. Although each project offers unique benefits, they share core attributes that define the lakehouse architecture. These include ACID transaction support, the ability to apply SQL on top of the data lake, role-based access control (RBAC), and a unified storage layer for handling various data types, including both structured and less structured data.

Ultimately, the choice between a cloud data warehouse and a data lakehouse depends on specific business needs. Factors such as the volume and variety of data, the desired level of data organization and governance, and the need for real-time analytics should all be considered. Hybrid approaches that combine elements of both architectures in deliberate ways are also gaining traction, as they allow organizations to leverage the strengths of each approach while mitigating their limitations.

What is ELT?

ELT (extract, load, and transform) is a new take on an old approach. ETL (extract, transform, and load) is the decades-old technique used to take data from one source, transform it, then write it to a target. An entire industry of companies have been founded simply to handle the “T” component - transforming data. ELT is a variation of ETL where data is extracted and loaded before undergoing any further transformation.

In ELT, raw data is extracted and loaded directly into the target data warehouse, where further transformations such as data cleansing, de-duping, surrogate key generation, multi-table joins, data enrichment, and data transformation take place. Raw data is stored in the staging area in the data warehouse for an extended period of time, enabling data validation, repetitive running of the ELT pipelines, and data lineage tracking, without rewinding and reprocessing from an upstream source like a relational database (RDBMS), which can get operationally burdensome and risk production outage.

The cost of ELT on a cloud data warehouse

While pay-as-you-go pricing is a notable advantage of cloud data warehouses, using them as ELT tools can be more expensive than alternatives such as Apache Spark™ or Apache Flink™. The cost of cloud data warehouses heavily depends on the ELT usage pattern. ELT workloads typically have high data volumes, complex streaming merges and critical service level agreement (SLA) requirements, since they affect the data freshness across the entire data architecture. Cloud Data Warehouses have not been engineered for such demanding workloads. On the other hand, ELT workloads may not require the advanced SQL features of a premium data warehouse engine, presenting a potential for cost savings by choosing judiciously.

Numerous public blogs and whitepapers have addressed this topic, such as The Modern Data Cloud: Warehouse vs Lakehouse, Data Warehouse vs Data Lake vs Data Lakehouse, Cloud Data Warehouses and Data Lakehouses side-by-side evaluation, and Why Lakehouse over Data Warehouse. Many of these resources only offer high-level discussions without supporting evidence. In this blog, we aim to solidify this argument by presenting a real customer use case that exemplifies the cost benefit of performing ELT in a data lakehouse vs. a cloud data warehouse.

Cost of performing ELT in a cloud data warehouse

As an example, a Onehouse customer utilized a cloud ingestion tool to load approximately 25-30 TB of mutable data per year into a popular cloud data warehouse (CDW_X), followed by executing SQL merge operations within CDW_X into the staging/raw layer. This process incurred an annual total cost of around $1.16 million, with the third-party ingestion tool accounting for approximately $160K and the merge operations costing approximately $1.0 million in CDW_X.

The case for performing ELT on a data lakehouse

Loading data into a data lakehouse brings scalability, ease of access, low compute costs, and operational efficiency. The storage and processing capabilities provided by today’s cloud have led to a shift in data processing approaches, favoring processing data after ingestion and replication into the cloud.

In the aforementioned customer use case, by replacing the cloud ingestion tool and the SQL merge process in CDW_X with a Onehouse data lakehouse powered by Apache Hudi, the estimated total cost could be reduced from $1.16 million to approximately $200K per year, resulting in a massive cost saving of $960K, roughly 80%. It also significantly improved performance and enhanced data processing capabilities with continuous incremental updates, reducing data latency from 12-24 hours to minutes. This achievement is remarkable, as it showcases a rare scenario where cost reduction and improved performance are achieved simultaneously.

Furthermore, the transition to the new managed pipeline in the data lakehouse powered by Onehouse offers more than just cost savings.

Open & Interoperable: The new architecture unlocks a unified data architecture, where all different use-cases like BI, Stream processing, OLTP CDC ingestion, Data Science, Machine Learning and AI can sit on top of the same open data layer, without needing duplicate data copies for each.
Reliable Source of Truth: Data entering the cloud through ELT typically acts as the “source of truth” for the entire business. Having a single entry point feeds the same data for various downstream data consumers like Data Analysts, Analytics Engineers and Data Scientists, helping them use a traceable, well-trusted baseline to resolve data discrepancies.
Future Proof: Managing data in the ETL layer in open formats, decoupled from any specific data vendor, ensures the business can bring any new query engines to it, without being beholden to vendor lock-ins and their roadmaps. The data stored in any downstream system can be entirely recomputed in cases where a vendor ceases to operate or there is a need to onboard a new vendor.

Conclusion

We are very excited about the opportunity to offload costly ELT operations from the cloud data warehouse to a data lakehouse, as it presents numerous advantages for customers. This approach lays the foundation for a successful implementation of a flexible and cost-effective tiered data lakehouse architecture, which has already gained popularity among numerous data-driven organizations at a large scale. Leveraging the Onehouse Managed Lakehouse platform, customers can efficiently and optimally ingest and transform data within the data lakehouse, ensuring solid performance and cost optimization throughout the entire data lifecycle.

If you want to learn more about Onehouse and would like to give it a try, please visit the Onehouse listing on the AWS Marketplace or contact gtm@onehouse.ai.

Authors

No items found.