ETL vs ELT Basics

ETL

Extract, Transform, and Load refers to the process of curating data from multiple sources, conforming it to a unified data format or structure, and loading the transformed data into its new environment.

Extract

Data extraction is the first stage of the ETL process, where data is acquired from various source systems.

the data may be completely raw, such as sensor data from IoT devices
perhaps it is unstructured data from scanned medical documents or company emails
it may be streaming data coming from a social media network
near real-time stock market buy/sell transactions
it may come from existing enterprise databases and data warehouses
normally this is an automated process and could be such as web scraping, using APIs to connect to data and query it
source data can be static, such as a data archive (in which case the extraction could be done in a batch process)
could be dynamic/streaming live from many locations (weather data, social network feeds, IoT devices…)

Techniques

OCR - Optical Character Recognition, is used to interpret and digitize text scanned from paper documents
ADC - to digitize analog audio recordings and signals
CC - charge couple devices that capture and digitize images
Opinions, questionnaires, vital statistical data
Cookies, user logs, and other methods to track human or system behavior
Web scraping
APIs for extracting data from all online data repositories and feeds
SQL for querying relational databases
NoSQL for querying document, key value, graph, or other non-structured data repositories
Edge computing to reduce data volume by extracting raw data we are interested in, rather than transmitting all the data collected

Transform

The transformation stage (wrangle stage) is where rules and processes are applied to the data to prepare it for loading into the target system.

This is normally done in an intermediate working environment called a “staging area.” Here, the data are cleaned to ensure reliability and conformed to ensure compatibility with the target system and in the intended use cases.

Transformation is about formatting the data to suit the application

Many other transformations may be applied, including:

Typing: casting data to appropriate types: float, integer, string…
Data structuring: converting data format to another, such as JSON, XML, or CSV to database tables
Anonymizing and encrypting to ensure privacy and security
Cleaning: fixing any errors or missing values
Filtering: selecting only what is needed
Joining: merging disparate data sources
Normalizing: converting data to common units
Data Structuring: converting one data format to another, such as JSON, XML, or CSV to database tables
Feature Engineering: creating KPIs for dashboards or machine learning
Anonymizing and Encrypting: ensuring privacy and security
Sorting: ordering the data to improve search performance
Aggregating: summarizing granular data

Schema-on-write

The conventional approach used in ETL pipelines, where the data must be conformed to a defined schema prior to loading to a destination, such as a relational database.

The idea is to have the data consistently structured for stability and for making subsequent queries much faster. But this comes at the cost of limiting the versatility of the data. Once you filter, group, aggregate and …. the data you’ve already reached a stage beyond its original raw state.

Load

The load phase is all about writing the transformed data to a target system its new environment.

The system can be as simple as a comma-separated file, which is essentially just a table of data like an Excel spreadsheet
A database, which may be part of a much more elaborate system, such as a data warehouse, a data mart, data lake, or some other unified, centralized data store
The purpose of loading is to make the data available for analysis, modeling, and data-driven decision making by business analysts, managers, executives, data scientists, and users at all levels of the enterprise

In most cases, as data is being loaded into a database, the constraints defined by its schema must be satisfied for the workflow to run successfully. Thus, such requirements imposed on the loading phase help ensure overall data quality.

The schema - a set of rules called integrity constraints, includes rules such as
- uniqueness
- referential integrity
- mandatory fields

Full loading

Full loading is used when you want to start tracking transactions in a new data warehouse or when you want to load an initial history into a database.

To reiterate, there is no existing content when you use full loading.

Incremental Loading

After full loading is complete, you can use incremental loading to insert data that has changed since the previous loading.

With incremental loading strategy, data is appended in the database and not overwritten.

It is useful for accumulating transaction history. You can categorize incremental loading into stream loading and batch loading, depending on the volume and velocity of data.

Stream loading

Stream loading is used when the data is to be loaded real time. Stream loading refers to continuous data updates performed in the data warehouse or other storage systems as new data arrives.

It is usually triggered by events, such as real-time data from sensors, like thermostat or motion sensors, social media feed, and IoT devices, and measures, such as data size when a certain amount of data is collected, or threshold values, or when a user requests data, such as online videos, music, or web pages.

Batch loading

Batch loading refers to periodic updates made/pushed to the data in the data warehouse or other storage systems, such as daily updates, hourly updates, or weekly updates. Batch data can be scheduled. Batch loading is used when it’s efficient and effective to load data in batches.

Some examples include Windows Task Scheduler, Cron jobs in Linux, and daily stock update.

Push loading

A push method is used when the source pushes data into the data warehouse or other storage. While push method can be used for batch loading, it is most suited for stream loading involving real-time data.

Pull loading

A pull method is used when the data warehouse pulls the data from the source by subscribing to receive the data. It is useful for scheduled transactions and batch loading. Loading can be serial or parallel.

Serial loading

Serial loading is when the data is copied sequentially, one after the other. This is how data loads in the data warehouse by default.

Parallel loading

You can use parallel loading when you need to load data from different sources parallelly or to split data from one source into chunks and load them parallelly.

When compared with serial loading, parallel loading is a faster and optimized approach.
Parallel loading can be employed on multiple data streams to boost loading efficiency, particularly when the data is big or has to travel long distances.
Similarly, by splitting a single file into smaller chunks, the chunks can be loaded simultaneously.

Pipelines

Pipelines are meant to facilitate the processing of data, or digitizing analog media and moving it from OLTP systems to OLAP systems

OLTP & OLAP

Online Transaction Processing systems (OLTP) don’t save historical data. ETL processes capture the transaction history and prepare it for
Subsequent analysis in an Online Analytical Processing systems (OLAP), dashboards…

ELT

Similar to ETL, it is a specific automated data pipeline engineering methodology. It is similar to ETL but the order in which the stages are performed varies. ELT is an evolution of ETL

Data is Extracted and loaded as is into its new environment where modern analytic tools can be used directly
It is more dynamic than ETL
It is used more often in the big data high performance computing realms with cases including:
- massive swings and scale associated with big data
- calculating real-time analytics on streaming bid data
- bringing together data sources that are distributed around the globe
Moving data is a bigger bottle neck than processing it, the less you move it the better

Why is ELT emerging

Cloud computing emergence, and their focus on big data. Their ability to handle large amounts of asynchronous data
Cloud computing resources are unlimited and can be scaled for demand
On demand costs are lower than fixed on-premises hardware
You can separate transforming from moving which eliminates data loss because you’re moving raw data as is.

Schema-on-read

Schema-on-read relates to the modern ELT approach, where the schema is applied to the raw data after reading it from the raw data storage. This approach is versatile since it can obtain multiple views of the same source data using ad hoc schemas.

Users potentially have access to more data, since it does not need to go through a rigorous preprocessing step. Whether intentional or accidental, there are many ways in which information can be lost in transformation. We can visualize this loss as follows:

Raw data is normally much bigger than transformed data.
Since data usually contains noise and redundancy, the information content of data as a proper subset of the data.
Correspondingly, we can see that shrinking the data volume can also mean shrinking the information content.

For ETL processes, any lost information may or may not be recoverable, whereas with ELT, all the original information content is left intact because the data is simply copied over as is.

ETL vs ELT

So what are the key differences between the two aside from the obvious:

Process	ETL	ELT
Transformation	within data pipeline	in the destination environment
Flexibility	rigid, engineered to user specs	flexible, user builds their own
Big Data	relational data	structured and unstructured
Location	on-premise, scalability restraints	scalable on demand in the cloud
Time-to-insight	Teams wait on others to specify and develop	self-serve, interactive anlytics in real-time
Intermediate storage	Typical staging areas (data lakes) for raw data where processes are run prior to loading. A private staging area	staging area is a data lake where you can manipulate raw data shared with everyone
Pain points	Lengthy time-to-insight, challenges of Big Data, siloed information	Solves those issues
Source & destination databases are	different	same
Capable of handling [ ] data transformation	complex	simple
Data size it can handle	small relative to ELT	huge
Latency	high	low
Type of data	static, archived	IoT, live sensor data
Processing type	MapReduce	Spark Streaming
Activity type	scheduled activity	realtime activity

ETL Workflow

Generally, an ETL workflow is a well thought out process that is carefully engineered to meet technical and end-user requirements.

Traditionally, the overall accuracy of the ETL workflow has been a more important requirement than speed, although efficiency is usually an important factor in minimizing resource costs.

To boost efficiency, data is fed through a data pipeline in smaller packets. While one packet is being extracted, an earlier packet is being transformed, and another is being loaded. In this way, data can keep moving through the workflow without interruption. Any remaining bottlenecks within the pipeline can often be handled by parallelizing slower tasks.

With conventional ETL pipelines, data is processed in batches, usually on a repeating schedule that ranges from hours to days apart. For example, records accumulating in an Online Transaction Processing System (OLTP) can be moved as a daily batch process to one or more Online Analytics Processing (OLAP) systems where subsequent analysis of large volumes of historical data is carried out.

Batch processing intervals need not be periodic and can be triggered by events, such as

when the source data reaches a certain size, or
when an event of interest occurs and is detected by a system, such as an intruder alert, or
on-demand, with web apps such as music or video streaming services

Staging Areas

ETL pipelines use staging areas to integrate data from disparate and usually siloed systems within the enterprise.

These systems can be from different vendors, locations, and divisions of the company, which can add significant operational complexity. As an example, a cost accounting OLAP system might retrieve data from distinct OLTP systems utilized by separate payroll, sales, and purchasing departments.

Airflow & ETL Workflow

ETL workflows can involve considerable complexity. By breaking down the details of the workflow into individual tasks and dependencies between those tasks, one can gain better control over that complexity. Workflow orchestration tools such as Apache Airflow do just that.

Airflow represents your workflow as a directed acyclic graph (DAG). A simple example of an Airflow DAG is illustrated below. Airflow tasks can be expressed using predefined templates, called operators.

Popular operators include

Bash operators, for running Bash code, and
Python operators for running Python code, which makes them extremely versatile for deploying ETL pipelines and many other kinds of workflows into production.

Popular ETL Tools

There are many ETL tools available today. Modern enterprise grade ETL tools will typically include the following features:

Automation: Fully automated pipelines
Ease of use: ETL rule recommendations
Drag-and-drop interface: “no-code” rules and data flows
Transformation support: Assistance with complex calculations
Security and Compliance: Data encryption and HIPAA, GDPR compliance

Some well-known ETL tools are listed below, along with some of their key features. Both commercial and open-source tools are included in the list.

Talend Open Studio

Open-source data pipeline development and deployment platform.

Talent Open Studio supports big data migration, data warehousing, and profiling, and it includes
collaboration, monitoring and scheduling capabilities.
It also has an interactive drag-and-drop GUI, which allows you to create ETL pipelines.
There is no need to write code as Java code is automatically generated.
It also connects to many data warehouses such as Google Sheets, RDBMS, IBM, DB2, and Oracle.

AWS Glue

An Enterprise tools: AWS Glue is a fully managed ETL service that makes it easy for you to prepare and load your data for analytics.

Glue crawls your data sources to discover data formats and suggests schemas to store your data and you can quickly create and run an ETL job using the AWS console.

IBM InfoSphere DataStage

IBM InfoSphere DataStage is a data integration tool for designing, developing, and running both ETL and ELT pipelines.

InfoSphere DataStage is the data integration component of IBM InfoSphere information server.

Like many other platforms, it also provides a drag-and-drop framework for developing workflows.
InfoSphere DataStage also uses parallel processing and enterprise connectivity to provide a truly scalable platform.

IBM Streams

IBM Streams is a streaming data pipeline technology, which enables you to build real time analytical applications using
the Streams processing language or SPL, plus Java, Python, or C++.
You can use it to blend data in motion with data at rest to deliver continuous intelligence in real time.
Streams powers a stream analytics service that allows you to ingest and analyze millions of events per second with sub-millisecond latency,
and IBM Streams comes packaged with IBM Stream Flows,
a tool which allows you to drag and drop operators onto a canvas and modify parameters from built-in settings panels.

Alteryx

Alteryx is also a highly versatile self-service data analytics platform with multiple products.

It gives you drag-and-drop accessibility to built in ETL tools,
and you don’t need to know SQL or programming to create and maintain a complex data pipeline.

Apache Airflow and Python

Apache Airflow, another package based on the Python programming language is

a highly versatile and well-known example of an open-source configuration as code data pipeline platform.
Apache Airflow was open-sourced by Airbnb and was created to programmatically author, schedule, and monitor data pipeline workflows.
It was designed to be scalable and can handle an arbitrary number of parallel compute nodes,
and Airflow integrates with most Cloud platforms, including AWS, IBM, Google Cloud, and Microsoft Azure.

Pandas Python Library

Python, along with the Pandas library is a very popular and highly versatile programming environment for building data pipelines.

Pandas uses a data structure called a data frame to handle Excel or CSV-style tabular data.
It’s a great tool for prototyping ETL pipelines and for exploratory data analysis but
it can be challenging to scale to big data since data frame manipulations must be carried out in memory.

Libraries with similar data frame APIs include Dask, Vaex, and Apache Spark, which can all help you to scale up to big data. For scalability, consider SQL-like alternatives to data frame APIs, such as PostgreSQL.

Panoply

Panoply is another enterprise solution, but its focus is on ELT rather than ETL.

It handles data connection and integration without code and
comes with SQL functionality so you can generate views of your data.
This frees your time to focus on data analysis, rather than optimizing your data pipeline.
Panoply also integrates with many dashboard and BI tools, including Tableau and Power BI.