The Field Guide to Databricks: Embracing the Lakehouse

Data, Analytics, and AI are at the center of innovation. What does it take to become an AI-driven organization?

Olya Lukashina
6 min readOct 31, 2022

The Field Guide to Databricks is a concise, readable and complete handbook for all Data Practitioners, presented as a series of blogs and created by the DNB Field Engineers. The Field Guide provides the tools necessary to discover and understand how the Databricks Lakehouse Platform can help organizations of all sizes leverage available data to accelerate and improve decision-making. Let’s begin!

A man reading a Field Guide to identify birds.
Joshua Heath/Getty Images, for The New York Times

Two months ago, I joined Databricks as an Associate Solutions Architect on the Digital Natives Team, working with some of the most sophisticated data engineering and data science teams in the world. I truly believe that Databricks is the most innovative, strategic company in the data space. Databricks is helping organizations simplify and democratize their most valuable asset — data — in order to tackle some of the world’s toughest problems.

The Customer Stories are inspiring! With the Databricks Lakehouse Platform, AT&T decreased fraud by 70%-80%, Amgen accelerated the drug discovery process to cure severe illnesses, H&M improved operational efficiency on their journey to sustainable fashion, and Shell underwent a digital transformation to deliver cleaner energy solutions. These successful organizations are leveraging data and AI to create a tremendous amount of business value, and you can too!

What is the Impact of Data?

data (noun) — facts or information, especially when examined and used to find out things or to make decisions

According to McKinsey, “By 2025, smart workflows and seamless interactions among humans and machines will likely be as standard as the corporate balance sheet, and most employees will use data to optimize nearly every aspect of their work”. The amount of data in our world has been exploding, in absolutely every industry and business function. The ability to extract useable and timely information from many diverse data sources is quickly becoming the key competitive advantage for organizations. Big Data — a term which refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods — has potential for Big impact.

Organizations must use data to power both Business Intelligence and Artificial Intelligence — two complementary capabilities, which are necessary to paint a full-picture of the business. By embracing the bridging of AI and BI, organizations can unify and harmonize valuable insights.

Business Intelligence (BI) — an umbrella term for the technology that enables data preparation, data mining, data management, and data visualization. Business intelligence tools and processes allow end users to identify actionable information from raw data, facilitating data-driven decision-making within organizations across various industries.

Artificial Intelligence (AI) — a field which combines computer science and robust datasets, to enable problem-solving. It also encompasses sub-fields of machine learning and deep learning, which are frequently mentioned in conjunction with artificial intelligence.

Business intelligence uses analytics to formulate conclusions about historical and current performance. For example, a company can use BI to track Key Performance Indicators (KPIs) and create simple-to-use reports and dashboards for decision makers. Artificial Intelligence is the science of making machines to understand human intelligence. For example, recommendation engines use past consumption data to develop real-time ads on e-commerce platforms. Simply put, Business Intelligence is Descriptive, while Artificial Intelligence is Prescriptive.

A chart of Data and AI Maturity. Journey of data fro descriptive to prescriptive.
Property of Databricks Inc.

Data maturity - a measure of an organization’s ability to utilize data to inform decisions.

According to the chart above, as organizations move to the right on the Data and AI Maturity curve, their competitive advantage in the industry exponentially increases. Few can dispute the immense influence data has on a company’s success, but just how can organizations effectively begin to use data, analytics and AI to inform strategic and operational decisions?

The Journey to Data and AI Maturity

Before companies can make meaningful strides using data, they must first set up the fundamental building blocks of a successful, unified BI and AI. As discussed previously, the confluence of Business Intelligence and Artificial Intelligence is critically important. The challenge is that most companies continue to implement two different platforms — data warehouses for BI and data lakes for AI. The tool stacks on top of these platforms are fundamentally different, creating an immense amount of complexity, causing most AI efforts to fail.

Diagram of the challenges of disjoint Data Warehouse and Data Lake.
Property of Databricks Inc.

Data Warehouses are very useful for structured data, but they weren’t designed for unstructured/semi-structured data, or data with high variety and volume. According to IDC, about 80% of the data in any organization will be unstructured by 2025, which makes Data Warehouses unsuitable for most data. On the other hand, Data Lakes are great for storing unstructured data, but they do not support transactions nor enforce data quality. Traditionally, a common approach is to use several data warehouses, a data lake, and other specialized tools. However, this creates complexity, data duplication, and increased costs. On top of that, organizations lose governance over data, which is essential considering recent privacy regulations.

In order to support Artificial Intelligence and Business Intelligence directly on the same data, a new architecture emerged: the Data Lakehouse.

Databricks Lakehouse — a unified platform to unify all your Analytics and AI workloads

Data Lake plus Data Warehouse equals Databricks Data Lakehouse.
Property of Databricks Inc.

The Databricks Lakehouse — a platform which combines the ACID transactions and data governance of data warehouses with the flexibility and cost-efficiency of data lakes to enable business intelligence (BI) and machine learning (ML) on all data. The Databricks Lakehouse keeps your data in your massively scalable cloud object storage in open source data standards, allowing you to use your data however and wherever you want.

In short, a Data Lakehouse is an architecture that enables efficient and secure AI and BI directly on vast amounts of data stored in Data Lakes. The Lakehouse combines the best elements of data lakes and data warehouses to deliver reliability, data governance, and performance of data warehouses with the openness and flexibility of data lakes. The Lakehouse is a perfect option for modern data companies that desire open, direct access to data stored in standard data formats, indexing protocols optimized for machine learning and data science, as well as low query latency and high reliability for BI and advanced analytics. The Databricks Lakehouse Platform is Simple, Open, and Multi-Cloud. If you have further questions about how the Data Lakehouse differs from a Data Warehouse/Data Lake, please reference this FAQ document.

The Databricks Lakehouse Platform architecture supports many different workloads. Data Engineering on the Lakehouse allows teams to unify batch and streaming operations with an end-to-end ETL platform that automates the complexity of building and maintaining pipelines and running workloads. The Lakehouse also provides a unified platform for running all streaming workloads, from ingestion to event processing. For Data Science and Machine Learning, the Lakehouse allows teams to effortlessly process and manage data, and standardize the ML lifecycle from experimentation to production.

Diagram of the Data Lakehouse, one platform to unify all of your data, analytics and AI workloads.
Property of Databricks Inc. — From eBook Intro to the Databricks Lakehouse Platform

Most importantly, the Databricks Lakehouse Platform is open and provides the flexibility to continue using existing infrastructure, to easily share data, and build your modern data stack. Databricks has thousands of Customers around the world, from unicorns to Fortune 500, across all kinds of industries. Databricks has already created easy-to-use solutions, purpose-built for various industries, from Financial Services to Healthcare to Retail. The Databricks Lakehouse is the key to becoming an AI-driven organization!

I hope that you found this gentle introduction to the Lakehouse useful and valuable!

Check out my comment for links to some useful resources!

Disclaimer: The views expressed in this post represent my own opinions and not those of my employer.

--

--

Olya Lukashina
Olya Lukashina

Written by Olya Lukashina

Solutions Architect — Digital Natives, Databricks 🧡 Data Science & Engineering, Machine Learning, Analytics & BI. https://www.linkedin.com/in/olukashina/

Responses (1)