Data Repositories

Data mining repositories store data for reporting, analysis, and deriving insights, but their purpose, types of data stored, and how data is accessed differs. Here is a summary of what we’ll cover in this document:

Repository	Description	Optimized for	Limitation	Use Cases
RDBM	Built on principles of flat files Tabular format w/rows & columns Well defined structure & schema ACID compliant Powerful when joining tables	Data operations SQL & other querying Storage, retrieval, processing of large vol of data Uses keys to relate tables together Query mulitple tables at once	Not for semi & unstructured data Migrating 2 RDBMs in only possible if both have identical schema Restricted to input values specified by schema	OLTP Data Warehouses- OLAP IoT solutions
NoSQL - Non-Relational DBs	Stores any data: Key-value as collection of key-value pairs Document-based stored in 1 document Column-based stored in cells grouped in columns not rows. Then grouped as family Graph-based stored in graphical model	Speed, flexibility, scale, schema-less Handles large V of uns/ structured data Easily scales out Great for Big Data, built for specific data models
Data Warehouse	Central repository merging data from disparate sources Stores ETL historical, cleansed, categorized data	Bot Tier: db servers - relational, non-relational Middle Tier: OLAP server Top Tier: client front-end for querying, reporting, analyzing
Data Mart	Subsection of D Warehouse Built for a specific business function	Provides users with most relevant data
Data Lakes	Stores ALL types of data in their original format No schema Repository of raw data	Scales fast No schema Ability to repurpose any data Big Data capable
Data Lakehouse	Combines both Warehouse & Lake	Warehouse: Cost effective, great performance, structure, governance, management, machine learning high loads, fast Lake: any type of Data at low cost
Big Data Stores	Computational & storage infrastructure to store, scale, process very large data sets

Overall, data repositories help to isolate data and make reporting and analytics more efficient and credible while also serving as a data archive. Several factors influence the choice of database, such as

the data type and structure
querying mechanisms
latency requirements
transaction speeds, and
intended use of the data.

Relational RDBMs

Built on the organizational principles of flat files, with data organized in tabular format with rows and columns following a well-defined structure and schema
Unlike flat files, RDBMs are optimized for data operations and querying with SQL and other languages and
Optimized for storage, retrievel and processing of data for large volumes of data, unlike spreadsheets

RDBMS use primary keys that relate tables to each other
That allows you query multiple tables at once
Data can be extracted from RDBMs using SQL, of course you can use other languages such as R or python… But SQL seems to be the most popular

Can retrieve millions of records in seconds using SQL

Use Cases

OLTP - Online Transaction Processing application
Data Warehouses - OLAP
IoT Solutions

Limitations

NOT suited for semi-structured and unstructured data
Migration between two RDBMs’s is possible only when the source and destination tables have identical schema and data types
Entering a value greater than the defined length of a data field results in loss of information

NoSQL - Non-Relational DBs

Not Only SQL emerged in response to the volume, diversity, and speed at which data is being generated today, mainly influenced by advances in cloud computing, the Internet of Things, and social media proliferation.
Built for speed, flexibility, and scale, non-relational databases made it possible to store data in a schema-less or free-form fashion.
NoSQL is widely used for processing big data and are built for specific data models
Has flexible schemas, do not use a traditional row/column/table database design
Do not typically use SQL to query data
Its advantage is to handle large volumes of structured, semi-structured, and unstructured data
Its ability to run as a distributed system scaled across multiple data centers
Easily scales out with the addition of new nodes
Simple design, better control over availability, and improved scalability that makes it agile, flexible, and support quick iterations
Any data can be stored in NoSQL databases
- Key-value data - stored as a collection of key-value pairs. Great for storing user session data and user preferences
  - Redis, Memcached, DynamDB are examples
- Document-based - store each record and its associated data within a single document. They enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents
  - Preferred for eCommerce platforms, medical records, CRM, analytics platforms
  - Not best for complex search queries and multi-operation tansactions
  - MongoDB, DocumentDB, CouchDB, Cloudant
- Column-based - store data in cells grouped as columns of data instead of rows. A logical grouping of columns, that is, columns that are usually accessed together, is called a column family.
  - For example, a customer’s name and profile information will most likely be accessed together but not their purchase history.
  - So,customer name and profile information data can be grouped into a column family.
  - Since column databases store all cells corresponding to a column as a continuous disk entry, accessing and searching the data becomes very fast.
  - Column databases can be great for systems that require heavy write requests, storing time-series data, weather data, and IoT data.
  - Not good for complex queries or change your querying patterns frequently.
  - The most popular column databases are Cassandra and HBase.
- Graph-based - use a graphical model to represent and store data.
  - They are particularly useful for visualizing, analyzing, and finding connections between different pieces of data.
  - The circles arenodes contain the data.
  - The arrows represent relationships.
  - Graph databases are an excellent choice for working with connected data, which is data that contains lots of interconnected relationships.
  - Graph databases are great for social networks, real-time product recommendations, network
  - diagrams, fraud detection, and access management.
  - Not good if you want to process high volumes of transactions because graph databases are not optimized for large-volume analytics queries.
  - Neo4J and CosmosDB are some of the more popular graph databases.
  - Some non-relational dbs come with their own querying tools such as CQL for Cassandra and GraphQL for Neo4J

Data Warehouse

Traditionally, data warehouses are known to store relational data from transactional systems and operational databases such as CRM, ERP, HR, and Finance applications.

But with the emergence of NoSQL technologies and new data sources, non-relational data repositories are also being used for data warehousing.

A data warehouse works as a central repository that merges information coming from disparate sources and consolidates it through the extract, transform, and load process, also known as the ETL process, into one comprehensive database for analytics and business intelligence.

DW store current and historical data that has been cleansed, conformed, and categorized. It has already been modeled and structured for a specific purpose, and it’s analysis ready.

At a very high-level, the ETL process helps you to extract data from different data sources, transform the data into a clean and usable state, and load the data into the enterprise’s data repository

Examples include:

Teradata Enterprise Data Warehouse
Oracle Exadata
IBM Db2 Warehouse on cloud
IBM Netezza Performance Server
Amazon RedShift
BigQuery by Google
Cloudera’s Enterprise Data Hub
Snowflake Cloud Data Warehouse

Tiered Architecture

Typically, a data warehouse has a three-tier architecture:

Bottom tier includes the db servers, which could be relational, non-relational or both that extract data from different sources
Middle tier, OLAP Server, a category of software that allow users to process and analyze information coming from multiple db servers
Top tier, the client front-end layer which includes all the tools and applications used for querying, reporting, and analyzing data

Data Mart

Is a subsection of the data warehouse, built specifically for a particular business function, purpose, or community of users.

Its purpose is to provide users with data that is most relevant to them when they need it.

Sales, or finance groups in an organization
There are 3 types of Data Marts:
- dependent - sub-section of an enterprise data warehouse.
  - They pull data from an enterprise data where it has already been cleaned and transformed
- independent - are created from sources other than an enterprise data warehouse, such as internal operational systems or external data.
  - They carry out the transformation since data is coming from operational systems and external sources.
- Hybrid - combine inputs from data warehouses, operational systems, and external systems. The difference also lies in how data is extracted from the source systems, the transformations that need to be applied, and how the data is transported into the mart

Data Lakes

Data Lakes or Data Marts have historically been relational, since much of the traditional enterprise data has resided in RDBMSes.

However, with the emergence of NoSQL technologies and new sources of data, non-relational data repositories are also now being used for Data Warehousing.

DL is a data repository that can store large amounts of structured, semi-structured, and unstructured data in their native format. Think of it as the storage in a large restaurant.

While a data warehouse stores data that has been cleaned, processed, and transformed for a specific need, you do not need to define the structure and schema of data before loading into the data lake.
You do not even need to know all of the use cases for which you will ultimately be analyzing the data.
A data lake exists as a repository of raw data in its native format, straight from the source, to be transformed based on the use case for which it needs to be analyzed.
This does not mean that a data lake is a place where data can be dumped without governance. While in the data lake, the data is appropriately classified, protected, and governed. A data lake is a reference architecture that is independent of technology.
Data lakes combine a variety of technologies that come together to facilitate agile data exploration for analysts and data scientists.
Data lakes can be deployed using
- Cloud Object Storage, such as Amazon S3, or
- large-scale distributed systems such as Apache Hadoop, used for processing Big Data.
- They can also be deployed on different relational database management systems,
- as well as NoSQL data repositories that can store very large amounts of data.
Data lakes offer a number of benefits, such as:
- The ability to store all types of data – unstructured data such as documents, emails, PDFs, semi-structured data such as JSON, XML, CSV, and logs, as well as structured data from relational databases
- The agility to scale based on storage capacity – growing from terabytes to petabytes of data
- Saving time in defining structures, schemas, and transformations since data is imported in its original format and
- The ability to repurpose data in several different ways and wide-ranging use cases. This is extremely beneficial as it is hard for businesses to foresee all the different ways in which you could potentially leverage their data in the future.
Some of the vendors that provide technologies, platforms, and reference architectures for data lakes include
- Amazon
- Cloudera
- Google
- IBM
- Informatica
- Microsoft Oracle
- SAS
- Snowflake
- Teradata
- Zaloni

Data LakeHouse

Combines both Data Warehouse and Data Lakes and focuses on the pluses of each.

we get the cost effectiveness of a data lake,
and we get the performance and structure of a data warehouse.
the lake house lets us store data from the exploding number of new sources in a low cost way, comparable to Data Lake, and
then leverages built in data management and governance layers to allow us to power both business intelligence and high performance machine learning workloads quickly, comparable to Data Warehouse

Big Data Stores

Include distributed computational and storage infrastructure to store, scale, and process very large data sets.

ETL & ELT

Now we come to the process that is at the heart of gaining value from data.

The Extract, Transform, and Load process, or ETL.

ETL is how raw data is converted into analysis-ready data.

It is an automated process in which you gather raw data from identified sources, extract the information that aligns with your reporting and analysis needs, clean, standardize, and transform that data into a format that is usable in the context of your organization; and

load it into a data repository

While ETL is a generic process, the actual job can be very different in usage, utility, and complexity.

Extract is the step where data from source locations is collected for transformation. Data extraction could be through
- Batch processing, meaning source data is moved in large chunks from the source to the target system at scheduled intervals.
  - Tools for batch processing include Stitch and Blendo.
- Stream processing, which means source data is pulled in real-time from the source and transformed while it is in transit and before it is loaded into the data repository.
  - Tools for stream processing include Apache Samza, Apache Storm, and Apache Kafka.
Transform involves the execution of rules and functions that convert raw data into data that can be used for analysis.
Load is the step where processed data is transported to a destination system or data repository. It could be:
- Initial loading, that is, populating all the data in the repository
- Incremental loading, that is, applying ongoing updates and modifications as needed periodically or
- Full refresh, that is, erasing contents of one or more tables and reloading with fresh data
- Load verification—which includes data checks for missing or null values, server performance, and monitoring load failures, are important parts of this process step.

We’ll cover ETL & ELT in another section of the site.

Data Platform

The architecture of a data platform consists of five layers, each layer responsible for a specific set of tasks. These five layers are

Data Ingestion or Data Collection Layer
Data Storage and Integration Layer
Data Processing Layer
Analysis and User Interface Layer
Data Pipeline Layer

Data Integration

In the field of analytics and data science, data integration includes

accessing, queueing, or

extracting data from operational systems transforming and

merging extracted data either logically or physically

data quality and governance, and

delivering data through an integrated approach for analytics purposes

For example, to make customer data available for analytics, you would need to extract individual customers’ information from operational systems such as sales, marketing, and finance. You would then need to provide a unified view of the combined data so that your users can access, query, and manipulate this data from a single interface to derive statistics, analytics, and visualizations.

How does a data integration platform relate to ETL and data pipelines?

While data integration combines disparate data into a unified view of the data,
A data pipeline covers the entire data movement journey from source to destination systems. In that sense, you use a data pipeline to perform data integration, while ETL is a process within data integration.

Data integration solutions typically support the following capabilities:

An extensive catalog of pre-built connectors and adopters that help you connect and build integration flows with a wide variety of data sources such as databases, flat files, social media data, APIs, CRM and ERP applications.
Open-source architecture that provides greater flexibility and avoids vendor lock-in.
Optimization for both batch processing of large-scale data and continuous data streams, or both.
Integration with Big Data sources.
Specific demands around data quality and governance, compliance, and security.
Portability, which ensures that as businesses move to cloud models, they should be able to run their data integration platforms anywhere.
And data integration tools that are able to work natively in a single cloud, multi-cloud, or hybrid cloud environment.

There are many data integration platforms and tools available in the market, ranging from commercial off-the-shelf tools to open-source frameworks.

IBM offers a host of data integration tools targeting a range of enterprise integration scenarios, such as
- Information Server for IBM
- Cloud Pak for Data
- IBM Cloud Pak for Integration
- IBM Data Replication
- IBM Data Virtualization Manager
- IBM InfoSphere Information Server on Cloud
- IBM InfoSphere DataStage
Talend’s data integration tools include
- Talend Data Fabric
- Talend Cloud
- Talend Data Catalog
- Talend Data Management
- Talend Big Data
- Talend Data Services
- Talend Open Studio
SAP, Oracle, Denodo, SAS, Microsoft, Qlik, and TIBCO are some of the other vendors that offer data integration tools and platforms.
Open-source frameworks include
- Dell Boomi
- Jitterbit
- SnapLogic.

There are a significant number of vendors who are offering cloud-based Integration Platform as a Service, or iPaaS, as a hosted service via virtual private cloud or hybrid cloud. Such as

Adeptia Integration Suite
Google Cloud’s Cooperation 534
IBM’s Application Integration Suite on Cloud
Informatica’s Integration Cloud.

Data Extraction

SQL

Data can be extracted from RDBMs using SQL, of course you can use other languages such as R or python… But SQL seems to be the most popular

CQL & GraphQL

Some non-relational dbs come with their own querying tools such as CQL for Cassandra and GraphQL for Neo4J

API

Application Programming Interfaces (or APIs) are also popularly used for extracting data from a variety of data sources. APIs are invoked from applications that require the data and access an end-point containing the data.

End-points can include databases, web services, and data marketplaces. APIs are also used for data validation.

Web Scraping

Also known as screen scraping or web harvesting, is used for downloading specific data from web pages based on defined parameters.

Among other things, web scraping is used to extract data such as text, contact information, images, videos, podcasts, and product items from a web property.

RSS Feeds

Rss feeds are another source typically used for capturing updated data from online forums and news sites where data is refreshed on an ongoing basis.

Data Streams

Data streams are a popular source for aggregating constant streams of data flowing from sources such as instruments, IoT devices and applications, and GPS data from cars.

Data streams and feeds are also used for extracting data from social media sites and interactive platforms.

Data Exchange Platforms

Data Exchange platforms allow the exchange of data between data providers and data consumers. Data Exchanges have a set of well-defined exchange standards, protocols, and formats relevant for exchanging data.

Data Wrangling

Data wrangling, also known as data munging, is an iterative process that involves data exploration, transformation, validation, and making data available for a credible and meaningful analysis.

Some of the tasks involved in wrangling are:

Structuring

This task includes actions that change the form and schema of your data. The incoming data can be in varied formats. You might, for example, have some data coming from a relational database and some data from Web APIs.

In order to merge them, you will need to change the form or schema of your data. This change may be as simple as changing the order of fields within a record or dataset or as complex as combining fields into complex structures.

Joins and Unions are the most common structural transformations used to combine data from one or more tables. How they combine the data is different.
Joins combine columns. When two tables are joined together, columns from the first source table are combined with columns from the second source table—in the same row. So, each row in the resultant table contains columns from both tables.
Unions combine rows. Rows of data from the first source table are combined with rows of data from the second source table into a single table. Each row in the resultant table is from one source table or another.

Normalization

Normalization focuses on cleaning the database of unused data and reducing redundancy and inconsistency.

Data coming from transactional systems, for example, where a number of insert, update, and delete operations are performed on an ongoing basis, are highly normalized.

Denormalization

Denormalization is used to combine data from multiple tables into a single table so that it can be queried faster.

For example, normalized data coming from transactional systems is typically denormalized before running queries for reporting and analysis. Another transformation type is Cleaning.

Cleaning

Cleaning tasks are actions that fix irregularities in data in order to produce a credible and accurate analysis.

The first step in the data cleaning workflow is to detect the different types of issues and errors that your dataset may have. You can use scripts and tools that allow you to define specific rules and constraints and validate your data against these rules and constraints. You can also use data profiling and data visualization tools for inspection.

Data Profiling

Data profiling helps you inspect the source data to understand the structure, content, and interrelationships in your data. It uncovers anomalies and data quality issues.

For example, blank or null values, duplicate data, or whether the value of a field falls within the expected range.

Visualizing

Visualizing the data using statistical methods can help you to spot outliers. For example, plotting the average income in a demographic dataset can help you spot outliers. That brings us to the actual cleaning of the data.

Tools

Excel Power Query / Spreadsheets and Add-ins
OpenRefine - is an open-source that allows you to import and export data in a wide variety of formats: TSV, CSV, XLS, XML, JSON
Google DataPrep - cloud service that allows you to visually explore, clean, and prepare both structured and unstructured data- It suggests what every next step should be, and detects schemas, data types, and anomalies.
Watson StudioIBM Refinery - allows you to discover, cleanse, and transform data with built-in operations
Trifacta Wrangler - cloud based service famous for collaboration
Python - Has a large number of libraries to process data: Jypyter Notebook, Numpy, Pandas
R - Offers a large number of packages Dplyr, Data.table, Jsonlite…

DataOps Methodology

Enable an organization to utilize a repeatable process to build and deploy analytics and data pipelines. Successful implementation of this methodology allows an organization to know, trust, and use data to drive value.

It ensures that the data used in problem-solving and decision making is relevant, reliable, and traceable and improves the probability of achieving desired business outcomes. And it does so by tackling the challenges associated with inefficiencies in accessing, preparing, integrating, and making data available.

In a nutshell, the DataOps Methodology consists of three main phases:

The Establish DataOps Phase provides guidance on how to set up the organization for success in managing data.
The Iterate DataOps Phase delivers the data for one defined sprint.
The Improve DataOps Phase ensures learnings from each sprint is channeled back to continually improve the DataOps process.

The figure below presents a high-level overview of these phases and the key activities within each of these phases.

Benefits of DataOps

Adopting the DataOps methodology helps organizations to organize their data and make it more trusted and secure. Using the DataOps methodology, organizations can:

Automate metadata management and catalog data assets, making them easy to access.
Trace data lineage to establish its credibility and for compliance and audit purposes.
Automate workflows and jobs in the data lifecycle to ensure data integrity, relevancy, and security.
Streamline the workflow and processes to ensure data access and delivery needs can be met at optimal speed.
Ensure a business-ready data pipeline that is always available for all data consumers and business stakeholders.
Build a data-driven culture in the organization through automation, data quality, and governance.

Governance & Complicance

Governance

Data Governance is a collection of principles, practices, and processes that help maintain the security, privacy, and integrity of data through its lifecycle.

Personal Information and Sensitive Personal Information, that is, data that can be traced back to an individual or can be used to identify or cause harm to an individual, needs to be protected through governance regulations.

General Data Protection Regulation, or GDPR, is one such regulation that protects the personal data and privacy of EU citizens for transactions that occur within EU member states. Regulations, such as HIPAA (Health Insurance Portability and Accountability Act) for Healthcare, PCI DSS (Payment Card Industry Data Security Standard) for retail, and SOX (Sarbanes Oxley) for financial data are some of the industry-specific regulations.

Data Acquisition

In the Data Acquisition stage, you need to establish

What data needs to be collected and the contracts and consent that give you a legal basis for procuring this data.
The intended use of this data, published as a privacy policy and communicated internally and with individuals whose data is being collected.
The amount of data you need to meet your defined purposes. For example, does the email address meet your purpose, or do you also need to have the phone numbers and pin codes?

Data Processing

In the Data Processing stage, you will be required to establish

The details of how you are going to process personal data.
And your legal basis for the processing of personal data, such as a contract or consent.

Data Storage

In the Data Storage stage, you will be required to establish

Where the data will be stored, including specific measures that will be taken to prevent internal and external security breaches.

Data Retention

In the Data Retention and Disposal stages, you will be required to

establish what policies and processes you will follow for the retention and deletion of personal data after a designated time
and how will you ensure that in the case of data deletion, it will be removed from all locations, including third-party systems?

At each of these stages, you will be required to maintain an auditable trail of personal data acquisition, processing, storage, access, retention, and deletion.

Compliance

Compliance covers the processes and procedures through which an organization adheres to regulations and conducts its operations in a legal and ethical manner.

Compliance requires organizations to maintain an auditable trail of personal data through its lifecycle, which includes acquisition, processing, storage, sharing, retention, and disposal of data.

Now let’s look at some of the controls made available through different tools and technologies for ensuring compliance to governance regulations.

Authentication and Access Control

Today’s platforms offer layered authentication processes, such as a combination of passwords, tokens, and biometrics, to provide foolproof protection against unauthorized access of data.

Authentication systems are designed to verify that you are who you say you are.
Access control systems ensure that authorized users have access to resources, both systems and data, based on their user group and role.

Encryption

Using encryption, data is converted to an encoded format that can only be legible once it is decrypted via a secure key.
Encryption of data is available both for data at rest, as it resides in the storage systems, and data in transit, as it moves through browsers, services, applications, and storage systems.

Data Masking

Data Masking provides anonymization of data for downstream processing and pseudonymization of data.

Using Anonymization, the presentation layer is abstracted without changing the data inthe database itself. For example, replacing characters with symbols when they are displayed on the screen.
Pseudonymization of data is a de-identification process where personally identifiable information is replaced with artificial identifiers so that a data set cannot be traced back to an individual’s identity—for example, replacing the name with a random value from the names dictionary.

Monitoring and Alerting functionalities

Security monitoring helps to proactively monitor, track, and react to security violations across infrastructure, applications, and platforms.

Monitoring systems also provide detailed audit reports that track access and other operations on the data.
Alerting functionalities flag security breaches as they occur so that immediate remedial actions can be triggered. The alerts are based on the severity and urgency level of the breach, which is pre-defined in the system.

Data Erasure

Data erasure is a software-based method of permanently clearing data from a system by overwriting. This is different from a simple deletion of data since deleted data can still be retrieved.