Surya Pratap Singh

Python Programmer | Data Engineer | ETL developer

A brief introduction to Data Warehouse components.

Published Jan 07, 2020

What is a Data Warehouse?

A data warehouse is a central repository where raw data is transformed and stored in query-able forms. It is an information system that contains historical and commutative data from single or multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version of truth for any company for decision making and forecasting. Without a data warehouse, data scientists and data analyst have to pull data straight from the production database and may wind up reporting different results to the same question or cause delays and even outages.

According to Bill Inmon, recognized by many as the father of the data warehouse and famous author for several data warehouse books, “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision-making process”.

Technically, a data warehouse is a relational database optimized for reading, aggregating, and querying large volumes of data. The DWH simplifies a data analyst’s job, allowing for manipulating all data from a single interface and deriving analytics, visualizations, and statistics.

Data warehouses are generally updated less frequently and are designed to give a historical, long-range view of data. The analytics that is run in a data warehouse, then, is usually more of a comprehensive view of your company’s history rather than a snapshot of the current state of your business. Since the data in a data warehouse is already integrated and transformed, it allows you to easily compare older, historical data and track marketing and sales trends. These historical comparisons can be used to track successes and failures and predict how to best proceed with your business ventures to increase profit and long-term ROI.

Characteristics of Data Warehouse

A data warehouse has the following characteristics:

Subject-Oriented
Integrated
Time-variant
Non-volatile

Subject-Oriented
A data warehouse is subject-oriented as it offers information regarding a theme instead of companies’ ongoing operations. These subjects can be sales, marketing, distributions, etc.

Integrated
Integrated means the establishment of a common unit of measure for all similar data from the dissimilar database. A data warehouse is developed by integrating data from varied sources like mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.

Time-Variant
The data collected in a data warehouse is recognized with a particular period and offers information from the historical point of view. It contains an element of time, explicitly or implicitly.

Non-volatile
The data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.

Components of a data warehouse

Untitled Diagram.jpg

OLTP data

OLTP (Online Transactional Processing) is a category of data processing that is focused on transaction-oriented tasks. OLTP typically involves inserting, updating, and/or deleting small amounts of data in a database.OLTP mainly deals with large numbers of transactions by a large number of users.

OLTP databases are generally the main source for the data warehouse.

ETL Process

ETL is short for extract, transform, load, three database functions that are combined into one tool/pipeline to pull data out of one database and place it into another database.

Pipeline infrastructure varies depending on the use case and scale. However, it always implements a set of ETL operations:

Extracting data from source databases
Transforming data to match a unified format for specific business purposes
Loading reformatted data to the data warehouse

ETL tools perform many functionalities, some are:

Anonymize data as per regulatory stipulations.
Eliminating unwanted data in operational databases from loading into Data warehouse.
Search and replace common names and definitions for data arriving from different sources.
Calculating summaries and derived data
In case of missing data, populate them with defaults.
De-duplicated repeated data arriving from multiple data sources.

Some tools for writing ETL pipelines:

Airflow.
Cloud Dataflow.
Kafka.
Informatica.
Talend.

Data Marts

A data mart is a subset of a data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a data mart, data can be derived from an enterprise-wide data warehouse or data can also be collected directly from sources.

A data mart is an access layer which is used to get data out to the users. It is presented as an option for a large size data warehouse as it takes less time and money to build. However, there is no standard definition of a data mart is differing from person to person.

In a simple word data mart is a subsidiary of a data warehouse. The data mart is used for a partition of data which is created for the specific group of users. Data marts could be created in the same database as the Datawarehouse or a physically separate Database.

Reporting and Analytics

When data warehouse and data marts are built. The base work is done. Data prepared and ingested, modelled and cleaned. Now we can provide accurate reports and analytics over our data. This is where the reporting layer comes in.

Reporting is the process of organizing data into informational summaries to monitor how different areas of a business are performing. Measuring core metrics and presenting them — whether in an email, a slide deck, or online dashboard — falls under this category.

Analytics is the process of exploring data and reports to extract meaningful insights, which can be used to better understand and improve business performance.

Things Keep in Mind while creating a Data Warehouse

Setting up secure and reliable data flow is a challenging task. There are so many things that can go wrong during data transportation: Data can be corrupted, hit bottlenecks causing latency, or data sources may conflict generating duplicate or incorrect data. Getting data into one place requires careful planning and testing to filter out junk data, eliminating duplicates and incompatible data types, to obfuscate sensitive information while not missing critical data.

a. Data warehouse platform: cloud VS on-premise
The first decision that one has to make when choosing a data warehouse is between cloud and on-premise data warehouse systems, and there are certainly advantages and disadvantages to both.

Cloud services offer Scalability, Low entry cost, Connectivity, Security, Availability/Reliability

While on-premise solutions offer Speed as data is on local servers, speed and latency can be better managed, at least for businesses based in one geographical location. Security, Availability/Reliability can also be achieved If you have good staff and exceptional hardware.

b. Vendor selection
Once you choose between cloud and on-premise systems, there are various vendors to choose from.

For on-premise systems, some options include:

Oracle Database
Microsoft SQL Server
IBM DB2
MySQL
PostgreSQL

For cloud systems, top solutions include:

Amazon Redshift
Microsoft Azure SQL Data Warehouse
Google BigQuery
Snowflake Computing

c. Don’t become overly enamoured with technology and data rather than focus on the business’s requirements and goals.

d. Never presume that the business, its requirements and analytics, and the underlying data and the supporting technology are static.

Data Lake

While talking about DWH it is important to talk about Data lake too. A Data lake is a vast pool for saving data in its native, unprocessed form. A data lake is not limited to relational data. We can store structured, semi-structured and unstructured data. A data lake stands out for its high agility as it isn’t limited to a warehouse’s fixed configuration.

A data lake uses the ELT approach swapping transform and load operations in the standard ETL operations sequence. Supporting large storage and scalable computing, a data lake starts data loading immediately after extracting it. This allows for increasing volumes of data to be processed. A data lake is very convenient, for instance, when the purpose of data hasn’t been determined yet — since a data lake stores it and later processes it on-demand.

Data warehouse Architecture Best Practices

To design Data Warehouse Architecture, you need to follow below given best practices:

Use a data model which is optimized for information retrieval which can be the dimensional mode, denormalized or hybrid approach.
Need to assure that Data is processed quickly and accurately. At the same time, you should take an approach which consolidates data into a single version of the truth.
Carefully design the data acquisition and cleansing process for Data warehouse.
Design a MetaData architecture which allows sharing of metadata between components of Data Warehouse
Consider implementing an ODS model when information retrieval need is near the bottom of the data abstraction pyramid or when there are multiple operational sources required to be accessed.
One should make sure that the data model is integrated and not just consolidated. In that case, you should consider the 3NF data model. It is also ideal for acquiring ETL and Data cleansing tools

Database Data Warehouse Dwh Big data Data engineering

Report

Enjoy this post? Give Surya Pratap Singh a like if it's helpful.

Surya Pratap Singh

Python Programmer | Data Engineer | ETL developer

11 years of hands-on practice at growing engineering teams, building data-pipeline and big data technologies. I have developed many data pipelines and ETL from scratch to end, web panels for reporting and Hadoop administration. I ...

Discover and read more posts from Surya Pratap Singh

get started