Overview
Data Types
Data is unorganized information that is processed to make it meaningful. Generally, data comprises of facts, observations, perceptions, numbers, characters, symbols, and images that can be interpreted to derive meaning.
One of the ways in which data can be categorized is by its structure-data can be:
- Structured
- Semi-structured
- Unstructured
Structured data
Data that is well organized in formats that can be stored
in databases and lends itself to standard data analysis methods and tools
- Has a well-defined structure or adheres to a specified data model
- can be stored in well-defined schemas such as databases and
- in many cases can be represented in a tabular manner with rows and columns.
- Is objective facts and numbers that can be collected, exported, stored, and organized in typical databases.
- Some of the sources of structured data could include:
- SQL Databases and Online Transaction Processing (or OLTP) Systems that focus on business transactions
- Spreadsheets such as Excel and Google Spreadsheets
- Online forms
- Sensors such as Global Positioning Systems (or GPS) and
- Radio Frequency Identification (or RFID) tags, and
- Network and Web server logs.
- You can typically store structured data in relational or SQL databases.
- You can also easily examine structured data with standard data analysis methods and tools.
Semi-structured data
data that is somewhat organized and relies on meta tags for grouping and hierarchy
- Is data that has some organizational properties but lacks a fixed or rigid schema.
- Semi-structured data cannot be stored in the form of rows and columns as in databases.
- It contains tags and elements, or metadata, which is used to group data and organize it in a hierarchy.
- Some of the sources of semi-structured data could include:
- E-mails
- XML and other markup languages
- Binary executables
- TCP/IP packets
- Zipped files
- Integration of data from different sources XML and JSON allow users to define tags and attributes to store data in a hierarchical form and are used widely to store and exchange semi-structured data.
Unstructured data
Data that is not conventionally organized in the form of rows and columns
in a particular format.
- Is data that does not have an easily identifiable structure and, therefore,
- cannot be organized in a mainstream relational database in the form of rows and columns
- It does not follow any particular format, sequence, semantics, or rules
- Unstructured data can deal with the heterogeneity of sources and has a variety of business intelligence and analytics applications
- Some of the sources of unstructured data could include:
- Web pages
- Social media feeds
- Images in varied file formats (such as JPEG, GIF, and PNG)
- Video and Audio files
- Documents and PDF files
- PowerPoint presentations
- Media logs; and
- Surveys
- Unstructured data can be stored in files and documents (such as a Word doc) for manual analysis or
- In NoSQL databases that have their own analysis tools for examining this type of data.
File Formats
Standard
Standard file formats include:
- Delimited text file formats
- Microsoft Excel Open XML Spreadsheet, or XLSX
- Extensible Markup Language, or XML
- Portable Document Format, or PDF
- JavaScript Object Notation, or JSON
Delimited text files
Text files used to store data as text in which each line, or row, has values separated by a delimiter; where delimiter is a sequence of one or more characters for specifying the boundary between independent entities or values.
- Any character can be used to separate the values, but most common delimiters are the comma, tab, colon, vertical bar, and space.
- Comma-separated values (or CSVs) and tab-separated values (or TSVs) are the most commonly used file types in this category
- In CSVs, the delimiter is a comma while
- In TSVs, the delimiter is a tab, when literal commas are present in text data and therefore cannot be used as delimiters, TSVs serve as an alternative to CSV format.
- Tab stops are infrequent in running text, each row, or horizontal line, in the text file has a set of values separated by the delimiter, and represents a record.
- The first row works as a column header, where each column can have a different type of data.
- Delimited files allow field values of any length and are considered a standard format for providing straightforward information schema.
- They can be processed by almost all existing applications.
- Delimiters also represent one of various means to specify boundaries in a data stream.
Excel XML or XLSX
- Microsoft Excel Open XML Spreadsheet, or XLSX, is a Microsoft Excel Open XML file format that falls under the spreadsheet file format.
- It is an XML-based file format created by Microsoft.
- In an .XLSX, also known as a workbook, there can be multiple worksheets.
- Each worksheet is organized into rows and columns, at the intersection of which is the cell.
- XLSX uses the open file format, which means it is generally accessible to most other applications.
XML
Extensible Markup Language, or XML, is a markup language with set rules for encoding data.
- The XML file format is both readable by humans and machines.
- It is a self-descriptive language designed for sending information over the internet.
- XML is similar to HTML in some respects, but also has differences—for example, an .XML does not use predefined tags like HTML does.
- XML is platform independent and programming language independent and therefore simplifies
- data sharing between various systems.
Portable Document Format, or PDF, is a file format developed by Adobe to present documents independent of application software, hardware, and operating systems, which means it can be viewed the same way on any device.
- This format is frequently used in legal and financial documents and can also be used to fill in data such as forms.
JSON
JavaScript Object Notation, or JSON, is a text-based open standard designed for transmitting structured data over the web.
- The file format is a language-independent data format that can be read in any programming language.
- JSON is easy to use, is compatible with a wide range of browsers, and is considered as one of the best tools for sharing data of any size and type, even audio and video.
- That is one reason, many APIs and Web Services return data as JSON.
Sources of Data
Some of the most common sources of data are:
Relational Databases
- Most businesses use relational dbs such as SQL Server, Oracle, MySQL, IBM DB2 or others to store data in a structured way.
Flatfiles
- Store data in plain text format, with one record or row per line, and each value separated by delimiters such as commas, semi-colons, or tabs.
- Data in a flat file maps to a single table, unlike relational databases that contain multiple tables.
- One of the most common flat-file format is CSV in which values are separated by commas.
Purchased XML Datasets
- There are companies that sell specific data, for example, Point-of-Sale data or financial data, or Weather data, which businesses can use to define strategy, predict demand, and make decisions related to distribution or marketing promotions, among other things. Such data sets are typically made available as flat files, spreadsheet files, or XML documents
APIs & Web Services
- APIs and Web Services typically listen for incoming requests, which can be in the form of web requests from users or network requests from applications, and return data in plain text, XML, HTML, JSON, or media files.
Web Scraping
- Web ScrapingWeb scraping is used to extract relevant data from unstructured sources.
- Also known as screen scraping, web harvesting, and web data extraction, web scraping makes it possible to download specific data from web pages based on defined parameters.
- Web scrapers can, among other things, extract text, contact information, images, videos, product items, and much more from a website.
- Some popular uses of web scraping include
- collecting product details from retailers, manufacturers, and eCommerce websites to provide price comparisons;
- generating sales leads through public data sources;
- extracting data from posts and authors on various forums and communities and
- collecting training and testing datasets for machine learning models
- Some of the popular web scraping tools include BeautifulSoup, Scrapy, Pandas, and Selenium.
Data Streams & Feeds
- Another widely used source for aggregating constant streams of data flowing from sources such as instruments, IoT devices, and applications, GPS data from cars, computer programs, websites, and social media posts.
- This data is generally timestamped and also geo-taggedfor geographical identification.
- Some of the data streams and ways in which they can be leveraged include:
- stock and market tickers for financial trading;
- retail transaction streams for predicting demand and supply chain management;
- surveillance and video feeds for threat detection;
- social media feeds for sentiment analysis;
- sensor data feeds for monitoring industrial or farming machinery;
- web click feeds for monitoring web performanceand improving design; and
- real-time flight events for rebooking and rescheduling.
- Some popular applications used to process data streams include Apache Kafka, Apache Spark Streaming, and Apache Storm.
RSS
RSS (or Really Simple Syndication) feeds, are another popular data source.
Metadata
Metadata is data that provides information about other data
We have toconsider the following three main types of metadata:
- Technical metadata
- Process metadata, and
- Business metadata
Technical Metadata
Can include tables and a data catalog:
Tables
Tables that record information about the tables stored in a database, like:
- each table’s name
- the number of columns and rows each table has
Data Catalog
An inventory of tables that contain in formation like:
- the name of each database in the enteprise data warehouse
- the name of each column present in each database
- the names of every table that each column is contained in
- the type of data that each column contains
Process Metadata
Process metadata describes the processes that operate behind business systems such as data warehouses, accounting systems, or customer relationship management tools.
Many important enterprise systems are responsible for collecting and processing data from various sources. Such critical systems need to be monitored for failures and any performance anomalies that arise. Process metadata for such sytems includes tracking things like:
- process start and end times
- disk usage
- where data was moved from and to, and
- how many users access the system at any given time
This sort of data is invaluable for troubleshooting and optimizing workflows and ad hoc queries.
Business Metadata
Users who want to explore and analyze data within and outside the enterprise are typically interested in data discovery. They need to be able to find data which is meaningful and valuable to them and know where that data can be accessed from.
It is information about the data described in readily interpretable ways, such as:
- how the data is acquired
- what the data is measuring or describing
- the connection between the data and other data sources
Business metadata also serves as documentation for the entire data warehouse system.
Metadata Management
Managing metadata includes developing and administering policies and processes to ensure information can be accessed and integrated from various sources and appropriately shared across the entire enterprise.
Data Catalog
Creation of a reliable, user-friendly data catalog is a primary objective of a metadata management model. The data catalog is a core component of a modern metadata management system, serving as the main asset around which metadata management is administered.
- It serves as the basis by which companies can inventory and efficiently organize their data systems.
- A modern metadata managment model will include a web-based user interface that enables engineers and business users to easily search for and find information on key attributes such as CustomerName or ProductType.
- This kind of model is central to any Data Governance initiative.
- Having access to a well implemented data catalog greatly enhances data discovery, repeatability, governance, and can also facilitate access to data.
- Well managed metadata helps you to understand both the business context associated with the enterprise data and the data lineage, which helps to improve data governance.
- Data lineage provides information about the origin of the data and how it gets transformed and moved, and thus it facilitates tracing of data errors back to their root cause.
Data Governance
Data governance is a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives.
The key focus areas of data governance include availability, usability, consistency, data integrity and data security and includes establishing processes to ensure effective data management throughout the enterprise such as accountability for the adverse effects of poor data quality and ensuring that the data which an enterprise has can be used by the entire organization.
Popular tools
Popular metadata management tools include:
- IBM InfoSphere Information Server
- CA Erwin Data Modeler
- Oracle Warehouse Builder
- SAS Data Integration Server
- Talend Data Fabric
- Alation Data Catalog
- SAP Information Steward
- Microsoft Azure Data Catalog
- IBM Watson Knowledge Catalog
- Oracle Enterprise Metadata Management (OEMM)
- Adaptive Metadata Manager
- Unifi Data Catalog
- data.world
- Informatica Enterprise Data Catalog