Building Machine Learning solutions using Spark ML on Azure

by Bhushan Patil

A Manufacturing company running Spark Machine Learning solutions on their local servers wanted to move their data & ML solutions to Microsoft Azure cloud for better scalability and ease of maintenance. The company engaged Saviant to build a solution that would help them overcome their challenges. Our team leveraged Azure HDInsight Spark, an Azure hosted offering of open-source Spark, that provides complete common data science tasks such as data ingestion, feature engineering, modeling, and model evaluation. It supports Jupyter PySpark, which is Python API for Spark, notebooks on the Spark cluster that runs Spark SQL interactive queries to transform, filter, and visualize the data stored in Azure Blobs (WASB).

To migrate machine learning solutions to Azure, Spark MLlib on Azure, a scalable Machine Learning library, was leveraged. Spark MLlib consists of common ML algorithms like regression, classification, collaborative filtering, clustering, underlying optimization primitives, and dimensionality reduction.


Implementation Steps to configure Spark Cluster on HDInsight

  1. Created a new Spark cluster on Azure HDInsight using the Azure portal.
  2. Created Jupyter notebook that has the python code to Run the ML experiment.
  3. Read CSV from blob storage to create the input data frame.
  4. Created a Linear model from the input data frame.
  5. Predicted the severity and certainty using the model.
  6. Python code returns the data frame with severity and certainty.

Benefits of using Apache Spark on Azure

  • Spark is known as the Swiss army knife of Big Data Analytics.
  • Spark on HDInsight fills a gap and makes "big data" more approachable for Microsoft customers.
  • For customers, it is easy to connect to a Spark cluster using Power BI
  • Azure makes it easy to work with data housed in Azure storage container.