Azure HDInsight

What is Azure HDInsight?

Microsoft Azure PaaS offers Azure HDInsight as a service that deploys & provisions Apache Hadoop clusters in the cloud and also provides a software framework designed to manage, analyze, and report on Big Data.

Why is it needed?

Azure HDInsight comes into picture when our data grows exponentially on a daily basis and it becomes difficult to process using on-hand database management tools or traditional data processing applications.

for example

Consider a Cloud Based Management platform which captures readings of water meter and processes millions of records every day as below:

Thus the quantum of data that is being processed is as follows:

Since data grows exponentially on daily basis, System has to manage this Big Data. Challenges to manage this kind of data are capture, storage, search, analyze, and visualization.

Thus Azure HDInsight comes into picture because Windows Azure HDInsight brings enterprise-ready Apache Hadoop to the cloud. All users can gain insights through Excel while developers are supported in .NET, Java and more.

How to manage this Big Data?

We can quickly build a Hadoop cluster in minutes when we need it, and delete it when our work is done.

We can run our Big Data applications on Windows server or Linux because Azure HDInsight provides Hortonworks Data Platform (HDP) and allows us to take advantage of existing Apache Hadoop projects including Apache Pig, Hive, Sqoop etc.

We can analyze unstructured data from Azure HDInsight in Microsoft Excel and use Microsoft Excel’s advanced technologies like PowerPivot, Power Query, and Power View to perform dynamic analysis on the combined data set. We can map our data easily with Power Map - a powerful 3D mapping tool in Excel.

Azure HDInsight lets us code in any language like .NET, Java, etc.

Azure HDInsight drives on the four keys/drivers of Big Data as shown in the figure below:

Who needs it?

Organizations or users who are facing below challenges:

Getting Real time data

Data is being created faster than it did before. Users want to get analyze this data without advanced technical knowledge.

Storing Unstructured data

Today’s data doesn’t fit effortlessly into relational databases, because up to 85% of new data captured is unstructured data such as text file, videos, images, sensor data and web logs. Customers need to manage all this data along with the relational data in their databases and data warehouse.

Lack of skills

Due to lack of in-house expertise, Hadoop is not implemented by many businesses today.

Poor integration with BI tools

Data cannot be included from Hadoop clusters by customers since their BI tools are not integrated with Hadoop.