SQL Server + Hadoop + Azure Data = PolyBase
In this blog I will introduce you to PolyBase. Instead of writing a paragraph, I tried to make it more interesting by converting them in sequence of questions. Please read all questions alphabetically.
Fig 1.0 PolyBase
A). What is PolyBase?
PolyBase query engine integrates SQL Server with external data in Hadoop or Azure Blob storage or Data Lake. You can import and export data as well as executing query.
B). What is Hadoop?
Hadoop is a solution, a tool for managing Big Data.
C). What is Big Data?
Big Data describes a massive volume of structured, semi-structured, and unstructured data collected within an organization. This data is so large that it is difficult to process using common database management tools or traditional data processing applications
D). Small Data vs Big Data?
|Data from traditional enterprise systems like Enterprise resource planning, Customer relationship management(CRM), Financial Data like general ledger data, Payment transaction data from website etc.. are small data.||Big data generates from the system like Purchase data from point-of-sale, Click stream data from public websites, GPS Mobility stream data, Social media – facebook, twitter etc..|
|Size of data range is in tens or hundreds of GB.||Size of Data range is in TBs ( 1 TB=1000 GB), PB or ZB.|
|Manually Entered Data||Machine or Action generated Data|
|Small Data is used in Business Intelligence, Analysis and Reporting.||Complex data mining for prediction, recommendation, pattern finding etc.|
|Structured Data||Semi – Structured Data, Non – Structured Data|
|Quality Data||Quality of data not guaranteed|
E). What are the Types of Big Data Formats?
- Structured data within a database management system (DBMS)
- Documents (Word, Excel, PowerPoint, PDF, etc.)
- Email Messages
- Text Messages
- Text Files
- HTML Files
- Audio / Voice
- Log Files
- Data Files
- XML Files
- JSON Files
- Binary Files
F). What is structured data, semi-structured data and non-structured data?
|Structured Data||Structured Data is defined in fixed data model, data is formatted and relational||Low||Database, Data warehouse, Data marts, Manually Entered Data|
|Semi Structured||Semi Structured Data structure with a probable pattern defined in variable data model||Medium||Excel, XML|
|Quasi Structured||Textual data with arbitrary format. This data needs tool and effort to format||High||Public Web, Click data, GPS Navigation, Logs|
|Unstructured||Data has no integral structure and usually stored in files.||Very High||Sensor Data, PDFs, Text Files, Video, Audio, Images|
Fig 2.0 Spectrum of Data
G). What is the need of Big Data?
The eventual objective for data analysis to get timely insights to support decision-making.
Small Data helps find answers to questions you already know i.e known dimensions. Big Data helps you find the questions beyond your dimensions, which you want to ask.
The way we analyze data is crystallizing day by day.
|Static Report||Interactive Report||Advance Machine Learning and Forecasting reports|
H). Where the Big Data is stored?
Generally, 80-90% of Big Data which is unstructured data. Stored where schema definition is not required at write time (and can be defined at code level at read time) – generally NoSQL databases.
Some of the NoSQL databases are MongoDB, Cassandra, HBase, Oracle NoSQL, Amazon DynamoDB, Couchbase, CouchDB
I). How to process and read Big Data?
Hadoop is a set of open source programs and procedures. Its Java-based programming framework that supports the processing and storage of excessively large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
J). How Big can be Big Data?
An organization that has Big Data may have billions to trillions of records stored their organization’s databases, file systems, and storage units which could contain zettabytes (1000 petabytes), exabytes (1000 petabytes), petabytes (1000 terabytes), terabytes (1000 gigabytes/GB) of data.
K). What is Azure Data Lake and Blob Storage?
Azure Data Lakes and Blob storage is to store an unlimited amount of structured, semi-structured or unstructured data from a variety of sources. The service does not impose limits on account sizes, file sizes, or the amount of data that can be stored.
Blob storage is good at non-text based files – database backups, photos, videos and audio files.
Whereas data lake bit better at large volumes of text data.
L). SQL Vs Hadoop Tools?
Big data is difficult to work with using most relational database management systems, business intelligence and analytics applications, statistics tool and visualization packages. When dealing with extremely large datasets, organizations face complications in being able to create, manipulate, manage, transfer and query the data.
SQL cannot read Hadoop data directly as Data structure, Data integrity, Schema, Query Language and Method to store data is different.
M). How SQL can read Hadoop Data, Azure Data Lake Data and Blob Storage Data?
In SQL Server Parallel Data warehouse (APS), you can query in T-SQL across structured and unstructured data with data stored on Hadoop Clusters, Azure Data Lake and Blob Storage.
Just imagine, being able to query Big Data in a single statement, all based on the T-SQL Syntax you are familiar with.
The same query can also access relational tables in your SQL Server. PolyBase enables the same query to also join the data from Hadoop and SQL Server.
Fig 3.0 PolyBase Flow.
With PolyBase, if you use SQL Server 2016 as a data source, you also have access to Hadoop and Azure Blob Storage regardless of your environment.
Thanks for reading 🙂
Keep reading, share your thoughts, experiences. Feel free to contact us to discuss more. If you have any suggestion / feedback / doubt, you are most welcome.
Stay tuned on Knowledge-Junction, will come up with more such articles