Implementing Predictive Analytics with Spark in Azure HDInsight
This course is part of the Microsoft Professional Program Certificate in Data Science.
will start on Jul 1, 2017, and it is scheduled to end on Sep 30, 2017 at 23:59 UTC.
What you will learn:
done with videos
This course is part of the Microsoft Professional Program Certificate in Data Science.
will start on Jul 1, 2017, and it is scheduled to end on Sep 30, 2017 at 23:59 UTC.
What you will learn:
- Using Spark to explore data and prepare for modeling
- Build supervised machine learning models
- Evaluate and optimize models
- Build recommenders and unsupervised machine learning models
Further Action Item - Learn and use Spark -
Further Reading
The lessons in this module have provided you with an introduction to Spark in Azure HDInsight, and should help you get started with Spark clusters. Use the following resources to learn more about working with Spark:
- Documentation for Microsoft Azure HDInsight, including Spark clusters, is at https://azure.microsoft.com/en-us/documentation/services/hdinsight.
- Documentation and getting started guidance for programming with Scala is at http://www.scala-lang.org/documentation/.
- Documentation and getting started guidance for programming with Python is at https://www.python.org/doc/.
- You can view the Spark SQL and DataFrames Programming Guide at https://spark.apache.org/docs/latest/sql-programming-guide.html.
- Review the Spark Machine Learning Programming Guide at https://spark.apache.org/docs/latest/ml-guide.html
- Classification and Regression: https://spark.apache.org/docs/latest/ml-classification-regression.html
- Pipelines: https://spark.apache.org/docs/latest/ml-pipeline.html
- Model Selection and Tuning : https://spark.apache.org/docs/latest/ml-tuning.html
- Collaborative Filtering: https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
- Clustering: https://spark.apache.org/docs/latest/ml-clustering.html
7/10: Intro to DS with Spark > Explore Data with Spark
Spark MLlib has 2 flavors:
spark.mllib for RDD
spark.ml for DataFrame
df = spark.read.csv('wasb:///...') # create df
df.count() # count
df.dropDuplicates() # dedup
df.fillna(0, "arrDelay") # missing data
df.select('name','price') # select data
df.filter('price > 2.00') # filter data
df.createOrReplaceTempView('t1') # persisted df can be queried with SQL
df=spark.sql('select * from t1')
in Jupyter: write inline SQL using magic
%%SQL
SELECT * FROM t1
flights = spark.read.csv('wasb:///data/raw-flight-data.csv', schema=flightSchema, header=True)
flights.show(truncate=False) # don't truncate string field
airports = spark.read.csv('wasb:///data/airports.csv', header=True, inferSchema=True)
airports.printSchema()
join and group
flightOrigin = flights.join(airports,flights.OriginAirportId==airports.airport_id).groupBy("city").count()
flightOrigin.show()
pipeline: streamline dataframe transformation and fit together
done with videos
work on Labs next
It is really good article.Keep sharing...
ReplyDeleteAzure Online Training Hyderabad
• Excellent…Amazing…. I’m satisfied to find so many helpful information here within the put up, for latest PHP jobs in Hyderabad we want work out extra strategies in this regard, thanks for sharing. Azure Online Training Bangalore
ReplyDelete