Codementor Events

Data Engineering in a Nutshell

Published Feb 26, 2021Last updated Apr 28, 2022
Data Engineering in a Nutshell

For many years, companies have been trying to adopt customer-centricity, but why do so many companies struggle to get customer-centricity right?
The volume, velocity, and variety of customer data that now exists overwhelm many organizations. Most companies don’t have the data orchestration strategy to segment and profile customers or lack the process and operational capabilities to target them with personalized communications and experiences. As analytics become progressively more important, data engineering has become a competitive edge and central to the technology initiatives, which boosts companies to successfully implement a customer-centric strategy.

Data Engineering Overview
As you develop data pipelines, remember the ultimate goal: to turn your data into useful information such as actionable analytics for business users and predictive models for data scientists. To do so, you must think about the journey your data will take through your data pipelines. Start by answering some fundamental questions:

  • What business questions do you want to answer?
  • What types of data will you be analyzing?
  • What kinds of schema do you need to define?
  • What types of data quality problems do you have?
  • What is the acceptable latency of your data?
  • Will you transform your data as you ingest it, or maintain it in a raw state and transform it later, for specific use cases?

Once you have answered these questions, you can determine what type of data pipeline you need, how frequently you need to update your data, and whether you should use data lakes, data warehouses, integration tools, or simply a cloud data platform to simplify the process of creating database interfaces, data ingestion procedures, and data transformation logic.

dtlakedw.png
photo by Guru99

Data engineering involves extracting data from various applications, devices, event streams, and databases. Where will it store?
For many companies, the answer is often a data warehouse or a data lake. Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms.

What is Data Lake?

  • Data lakes are scalable repositories that can store many types of data in raw and native forms, especially for semi-structured and unstructured data. To be truly useful, they must facilitate user-friendly exploration via popular methods such as SQL, automate routine data management activities, and support a range of analytics use cases.

What is Data Warehouse?

  • Data warehouses typically ingest and store only structured data, usually defined by a relational database schema. Raw data often needs to be transformed to conform to the schema. Modern data warehouses are optimized for processing thousands or even millions of queries per day, and can support specific business uses.

The distinction between both options is important because they serve different purposes and require different actions to be properly optimized.

Understanding Data Latency

Data latency is the time delay between when data is generated and when it is available for use. In previous times, data was periodically loaded into analytic repositories in batches, generally daily, weekly, or monthly. Today, more analytic workloads require data that is updated in near real time, such as every five minutes, as well as streaming data that may be loaded continuously.

Consider a financial services company that has a data warehouse designed to store and analyze core banking data. The company began to place more emphasis on private banking services and brokerage accounts, financial advisors needed to see reports that reflected recent transactions. So the firm created a data pipeline that loads new transactions every few minutes, in conjunction with a predictive model that enables advisors to make assertive decisions, based on current activity.

Data latency is not something that we should fear. Its appropriation depends on micro and macro environmental factors, which differ for every organization and industry, and when companies know about the types of latency, their uses, and purposes, they can start working towards achieving zero-latency data. Of course, according to what is best for their particular case.

Change data capture (CDC) capabilities simplify data pipelines by recognizing the changes that have occurred since the last data load and incrementally processing or ingesting that data. For example, in the case of the financial services company, a bulk upload from the banking system refreshes the data warehouse each night, while CDC adds new transactions every five minutes. This type of process allows analytic databases to stay current without reloading the entire data set.

In the case of streaming data, be aware that event time and processing time are not always the same. You can’t simply follow the timestamp in the data, since some transactions may be delayed in transit, which could cause them to be recorded in the wrong order.
If you need to work with streaming data, you may need to create a pipeline that can verify precisely when each packet, record, or transaction occurred, and ensure they are recorded only once, and in the right order, according to your business requirements. Adding event time to the record ensures that processing delays do not cause incorrect results due to an earlier change overwriting a later change.

DATA ENGINEER.png

From Business Requirements to Storage Systems
Business requirements are the starting point for choosing a data storage system. Data engineers will use different types of storage systems for different purposes. The specific storage system you should choose is determined, in large part, by the stage of the data lifecycle of the systems used within your company.

The data lifecycle consists of four stages:

  1. Ingest
  2. Store
  3. Process and analyze
  4. Explore and visualize

Ingestion is the first stage in the data lifecycle, and it entails acquiring data and bringing data into the Database. The storage stage is about persisting data to a storage system from which it can be accessed for later stages of the data lifecycle. The process and analyze stage begins with transforming data into a usable format for analysis applications. Explore and visualize is the final stage, in which insights are derived from analysis and presented in tables, charts, and other visualizations for use by others.

Ingest
The three broad ingestion modes with which data engineers typically work are as follows:

  • Application data
  • Streaming data
  • Batch data

Application Data
Application data is generated by applications, including mobile apps, and pushed to backend services. This data includes user-generated data, like a name and shipping address collected as part of a sales transaction. It also includes data generated by the application, such as log data. Event data, like clickstream data, is also a type of application-generated data.

The volume of this kind of data depends on the number of users of the application, the types of data the application generates, and the duration of time the application is in use.This size of application data that is sent in a single operation can vary widely. A clickstream event may have less than 1KB of data, whereas an image upload could be multiple megabytes.

Streaming Data
Streaming data is a set of data that is typically sent in small messages that are transmitted continuously from the data source. Streaming data may be sensor data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Examples of streaming data include the following:

  • Virtual machine monitoring data, such as CPU utilization rates and memory consumption data
  • An IoT device that sends temperature, humidity, and pressure data every minute
  • A customer adding an item to an online shopping cart, which then generates an event with data about the customer and the item

Streaming data often includes a timestamp indicating the time that the data was generated. This is often called the event time. Some applications will also track the time that data arrives at the beginning of the ingestion pipeline. This is known as the process time.

Time-series data may require some additional processing early in the ingestion process. If a stream of data needs to be in time order for processing, then late arriving data will need to be inserted in the correct position in the stream. This can require buffering of data for a short period of time in case the data arrives out of order.

Batch Data
Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another. Examples of batch data include the following:

  1. Transaction data that is collected from applications may be stored in a relational database and later exported for use by a machine learning pipeline
  2. Archiving data in long-term storage to comply with data retention regulations
  3. Migrating an application from on-premises to the cloud by uploading files of exported data

Store
The focus of the storage stage of the data lifecycle is to make data available for transformation and analysis. Several factors influence the choice of storage system, including:

  • How the data is accessed—by individual record (row) or by an aggregation of columns across many records (rows).
  • The way access controls need to be implemented, at the schema or database level or finer-grained level.
  • How long the data will be stored.

These three characteristics are the minimum that should be considered when choosing a storage system; there may be additional criteria for some use cases.

Time to Store
Consider how long data will be stored when choosing a data store. Some data is transient. For example, data that is needed only temporarily by an application running on a Compute Engine instance could be stored on a local solid-state drive (SSD) on the instance. As long as the data can be lost when the instance shuts down, this could be a reasonable option.

Data is often needed longer than the lifetime of a virtual machine instance, so other options are better fits for those cases.Data that is frequently accessed is often well suited for either relational or NoSQL databases. As data ages, it may not be as likely to be accessed. In those cases, data can be deleted or exported and archived. If the data is not likely to be used for other purposes, such as machine learning, and there are no regulations that require you to keep the older data, then deleting it may be the best option.

In cases where the data can be useful for other purposes or you are required to retain data, then exporting and storing it in Cloud Storage is an option. Then, if the data needs to be accessed, it can be imported to the database and queried there.

Process and Analyze
During the process and analyze stage, data is transformed into forms that make the data readily available to ad hoc querying or other forms of analysis.

Data Transformations
Transformations include data cleansing, which is the process of detecting erroneous data and correcting it. Some cleansing operations are based on the data type of expected data.

For example, a column of data containing only numeric data should not have alphabetic characters in the column. The cleansing process could delete rows of data that have alphabetic characters in that column. It could alternatively keep the row and substitute another value, such as a zero, or treat the value as NULL.

In other cases, business logic is applied to determine incorrect data. Some business logic rules may be simple, such as that an order date cannot be earlier than the date that the business began accepting orders. An example of a more complex rule is not allowing an order total to be greater than the credit limit assigned to a customer.

The decision to keep the row or delete it will depend on the particular use case. A set of telemetry data arriving at one-minute intervals may include an invalid value. In that case, the invalid value may be dropped without significantly affecting hour-level aggregates. A customer order that violates a business rule, however, might be kept because orders are significant business events. In this case, the order should be processed by an exception-handling process.

Data transformation may be constructive (adding, copying, and replicating data), deructive (deleting fields and records), aesthetic (standardizing salutations or street names), or structural (renaming, moving, and combining columns in a database).

Data Analysis
In the analyze stage, a variety of techniques may be used to extract useful information from data. Statistical techniques are often used with numeric data to do the following:

  • Describe characteristics of a dataset, such as a mean and standard deviation of the dataset.
  • Generate histograms to understand the distribution of values of an attribute.
  • Find correlations between variables, such as customer type and average revenue per sales order.
  • Make predictions using regression models, which allow you to estimate one attribute based on the value of another. In statistical terms, regression models generate predictions of a dependent variable based on the value of an independent variable.
  • Cluster subsets of a dataset into groups of similar entities. For example, a retail sales dataset may yield groups of customers who purchase similar types of products and spend similar amounts over time.

Explore and Visualize
Often when working with new datasets, you’ll find it helpful to explore the data and test a hypothesis. Jupyter Notebooks (http://jupyter.org), is an open-source tool for exploring, analyzing, and visualizing data sets. Also, you have a wide range of data science and machine learning libraries, such as pandas, sci-kit learn, and Matplotlib, that can be used according to your needs.

Data visualization in data exploration leverages familiar visual cues such as shapes, dimensions, colors, lines, points, and angles so that data analysts can effectively visualize and define the metadata, and then perform data cleansing. Performing the initial step of data exploration enables data analysts to better understand and visually identify anomalies and relationships that might otherwise go undetected.
dataengineer.png

Let's Recap

  1. Know the four stages of the data lifecycle: ingest, storage, process and analyze. They provide an organizing framework for uderstanding the broad context of data engineering and machine learning.

  2. Understand the characteristics of streaming data. Streaming data is a set of data that is sent in small messages that are transmitted continuously from the data source.

  3. Understand the characteristics of batch data. Batch data is ingested in bulk, typically in files, under the batch processing model, a set of data is collected over time, then fed into an analytics system.

  4. Know the technical factors to consider when choosing a data store. These factors include the volume and velocity of data, the type of structure of the data, access control requirements, and data access patterns.

  5. Know the three levels of structure of data. These levels are structured, semi-structured, and unstructured. Structured data has a fixed schema, such as a relational database table. Semi-structured data has a schema that can vary; the schema is stored with data. Unstructured data has internal structure but is not structured via pre-defined data models or schema.

  6. Know the difference between relational and NoSQL databases. Relational databases are used for structured data whereas NoSQL databases are used for semi-structured data. The four types of NoSQL databases are key-value, document, wide-column, and graph databases.

Final takeways

Data engineering encompasses a broad set of procedures, tools, and skill sets that govern and facilitate the flow of data. Data engineers are focused primarily on building and maintaining data pipelines that transport data through different steps and put it into a usable state.

Last, but not least, you must be attentive to the data quality practices along all stages of the data orchestration process. As data is de-facto the core part of every business operation, the quality of the data that is gathered, stored, and consumed during business processes will determine the success achieved in doing business today and tomorrow.

In GOD we trust, all other must bring data. William Edwards Deming

References:

  1. Designing Data Intensive Applications
  2. Official Google Cloud Certified Professional Data Engineer
  3. Azure Storage Streaming and Batch Analytics
  4. The Group Kimball Reader
Discover and read more posts from Jayron Soares
get started
post commentsBe the first to share your opinion
Show more replies