Spark

Quick Start with Spark

Local =
D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python
D:\machine_learn\0-AI_Learning-Map\2.1-Tech-Spark\pyspark-Tutorial

7/11: tried https://spark.apache.org/docs/latest/quick-start.html

How to run pyspark in Jupyter Notebook

PySpark Tutorial-Learn to use Apache Spark with Python

http://jonamjar.com/2016/07/13/run-pyspark-on-jupyter-notebook-for-windows-users/

cmd> spark-shell (enter scala env)
cmd> pyspark (enter python env)

To run Pyspark on Jupyter, you need to set two new user variables.

Variable Name: PYSPARK_DRIVER_PYTHON , Value: Jupyter

Variable Name: PYSPARK_DRIVER_PYTHON_OPTS, Value: notebook

open command prompt and enter
CMD>Pyspark ### Jupyter notebook opens up

How to setup jupyter profile

How to create ipython profile

CMD>ipython profile create pyspark

[ProfileCreate] Generating default config file: 'C:\\Users\\WGONG\\.ipython\\profile_pyspark\\ipython_config.py'

[ProfileCreate] Generating default config file: 'C:\\Users\\WGONG\\.ipython\\profile_pyspark\\ipython_kernel_config.py'

create/modify C:\Users\WGONG\.ipython\profile_default\startup\00-pyspark-setup.py

import os

import sys

spark_home = os.environ.get('SPARK_HOME', None)

sys.path.insert(0, spark_home + "/python")

sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')

exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 2.1.1" in open(spark_release_file).read():

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")

if not "pyspark-shell" in pyspark_submit_args:

pyspark_submit_args += " pyspark-shell"

os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

CMD>jupyter notebook --profile=pyspark

Issue #1
after running notebook, spark-submit does not work (with the following error)

D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python>spark-submit pi.py
Traceback (most recent call last):
File "C:\Anaconda3\Scripts\Jupyter-script.py", line 5, in <module>
sys.exit(jupyter_core.command.main())
File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 186, in main
_execvp(command, sys.argv[1:])
File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 104, in _execvp
raise OSError('%r not found' % cmd, errno.ENOENT)

OSError: [Errno None not found] 2

Following URL=https://stackoverflow.com/questions/42263691/jupyter-notebook-interferes-with-spark-submit

it was resolved by removing

Variable Name: PYSPARK_DRIVER_PYTHON , Value: Jupyter

Variable Name: PYSPARK_DRIVER_PYTHON_OPTS, Value: notebook

and create D:\spark-2.1.1-bin-hadoop2.7\bin\pyspark-jupyter.bat
to run jupyter notebook by entering
CDM> pyspark-jupyter

deeplearning4java

Search This Blog

Spark

Quick Start with Spark

How to setup jupyter profile

Comments

Post a Comment