Quick Start with Spark
Local =
D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python
D:\machine_learn\0-AI_Learning-Map\2.1-Tech-Spark\pyspark-Tutorial
7/11: tried https://spark.apache.org/docs/latest/quick-start.html
http://jonamjar.com/2016/07/13/run-pyspark-on-jupyter-notebook-for-windows-users/
cmd> spark-shell (enter scala env)cmd> pyspark (enter python env)
To run Pyspark on Jupyter, you need to set two new user variables.
Variable Name: PYSPARK_DRIVER_PYTHON , Value: Jupyter
Variable Name: PYSPARK_DRIVER_PYTHON_OPTS, Value: notebook
open command prompt and enter
CMD>Pyspark ### Jupyter notebook opens up
CMD>Pyspark ### Jupyter notebook opens up
How to setup jupyter profile
How to create ipython profile
CMD>ipython profile create pyspark
[ProfileCreate] Generating default config file: 'C:\\Users\\WGONG\\.ipython\\profile_pyspark\\ipython_config.py'
[ProfileCreate] Generating default config file: 'C:\\Users\\WGONG\\.ipython\\profile_pyspark\\ipython_kernel_config.py'
create/modify C:\Users\WGONG\.ipython\profile_default\startup\00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 2.1.1" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args:
pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
CMD>jupyter notebook --profile=pyspark
Issue #1
after running notebook, spark-submit does not work (with the following error)
D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python>spark-submit pi.py
Traceback (most recent call last):
File "C:\Anaconda3\Scripts\Jupyter-script.py", line 5, in <module>
sys.exit(jupyter_core.command.main())
File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 186, in main
_execvp(command, sys.argv[1:])
File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 104, in _execvp
raise OSError('%r not found' % cmd, errno.ENOENT)
OSError: [Errno None not found] 2
Following URL=https://stackoverflow.com/questions/42263691/jupyter-notebook-interferes-with-spark-submit
it was resolved by removing
and create D:\spark-2.1.1-bin-hadoop2.7\bin\pyspark-jupyter.bat
to run jupyter notebook by entering
CDM> pyspark-jupyter
after running notebook, spark-submit does not work (with the following error)
D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python>spark-submit pi.py
Traceback (most recent call last):
File "C:\Anaconda3\Scripts\Jupyter-script.py", line 5, in <module>
sys.exit(jupyter_core.command.main())
File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 186, in main
_execvp(command, sys.argv[1:])
File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 104, in _execvp
raise OSError('%r not found' % cmd, errno.ENOENT)
OSError: [Errno None not found] 2
Following URL=https://stackoverflow.com/questions/42263691/jupyter-notebook-interferes-with-spark-submit
it was resolved by removing
Variable Name: PYSPARK_DRIVER_PYTHON , Value: Jupyter
Variable Name: PYSPARK_DRIVER_PYTHON_OPTS, Value: notebook
and create D:\spark-2.1.1-bin-hadoop2.7\bin\pyspark-jupyter.bat
to run jupyter notebook by entering
CDM> pyspark-jupyter
Comments
Post a Comment