Course - EdX - DAT202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight

Implementing Predictive Analytics with Spark in Azure HDInsight

This course is part of the Microsoft Professional Program Certificate in Data Science.

will start on Jul 1, 2017, and it is scheduled to end on Sep 30, 2017 at 23:59 UTC.

What you will learn:

Using Spark to explore data and prepare for modeling
Build supervised machine learning models
Evaluate and optimize models
Build recommenders and unsupervised machine learning models

Further Action Item - Learn and use Spark -

https://spark.apache.org/docs/latest/quick-start.html

Overview of data science using Spark on Azure HDInsight

Sample Dataset

Documentation for Microsoft Azure HDInsight, including Spark clusters, is at https://azure.microsoft.com/en-us/documentation/services/hdinsight.
Documentation and getting started guidance for programming with Scala is at http://www.scala-lang.org/documentation/.
Documentation and getting started guidance for programming with Python is at https://www.python.org/doc/.
You can view the Spark SQL and DataFrames Programming Guide at https://spark.apache.org/docs/latest/sql-programming-guide.html.

Review the Spark Machine Learning Programming Guide at https://spark.apache.org/docs/latest/ml-guide.html
Classification and Regression: https://spark.apache.org/docs/latest/ml-classification-regression.html
Pipelines: https://spark.apache.org/docs/latest/ml-pipeline.html
Model Selection and Tuning : https://spark.apache.org/docs/latest/ml-tuning.html
Collaborative Filtering: https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
Clustering: https://spark.apache.org/docs/latest/ml-clustering.html

7/10: Intro to DS with Spark > Explore Data with Spark

Spark MLlib has 2 flavors:

spark.mllib for RDD

spark.ml for DataFrame

df = spark.read.csv('wasb:///...') # create df

df.count() # count

df.dropDuplicates() # dedup

df.fillna(0, "arrDelay") # missing data

df.select('name','price') # select data

df.filter('price > 2.00') # filter data

df.createOrReplaceTempView('t1') # persisted df can be queried with SQL

df=spark.sql('select * from t1')

in Jupyter: write inline SQL using magic

%%SQL

SELECT * FROM t1

flights = spark.read.csv('wasb:///data/raw-flight-data.csv', schema=flightSchema, header=True)

flights.show(truncate=False) # don't truncate string field

airports = spark.read.csv('wasb:///data/airports.csv', header=True, inferSchema=True)

airports.printSchema()

join and group

flightOrigin = flights.join(airports,flights.OriginAirportId==airports.airport_id).groupBy("city").count()

flightOrigin.show()

pipeline: streamline dataframe transformation and fit together

done with videos

work on Labs next

Comments

UnknownOctober 20, 2017 at 1:41 AM
It is really good article.Keep sharing...
Azure Online Training Hyderabad
UnknownDecember 20, 2017 at 1:32 AM
• Excellent…Amazing…. I’m satisfied to find so many helpful information here within the put up, for latest PHP jobs in Hyderabad we want work out extra strategies in this regard, thanks for sharing. Azure Online Training Bangalore

deeplearning4java

Search This Blog

Course - EdX - DAT202.3x - Implementing Predictive Analytics with Spark in Azure HDInsight

Further Reading

Comments

Post a Comment