Spark

Quick Start with Spark


Local =
D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python
D:\machine_learn\0-AI_Learning-Map\2.1-Tech-Spark\pyspark-Tutorial

7/11: tried https://spark.apache.org/docs/latest/quick-start.html

How to run pyspark in Jupyter Notebook

PySpark Tutorial-Learn to use Apache Spark with Python
http://jonamjar.com/2016/07/13/run-pyspark-on-jupyter-notebook-for-windows-users/


cmd> spark-shell (enter scala env)
cmd> pyspark (enter python env)


To run Pyspark on Jupyter, you need to set two new user variables.
Variable Name: PYSPARK_DRIVER_PYTHON , Value: Jupyter
Variable Name: PYSPARK_DRIVER_PYTHON_OPTS, Value: notebook

open command prompt and enter 
CMD>Pyspark  ###  Jupyter notebook opens up


How to setup jupyter profile


How to create ipython profile



CMD>ipython profile create pyspark

[ProfileCreate] Generating default config file: 'C:\\Users\\WGONG\\.ipython\\profile_pyspark\\ipython_config.py'
[ProfileCreate] Generating default config file: 'C:\\Users\\WGONG\\.ipython\\profile_pyspark\\ipython_kernel_config.py'

create/modify C:\Users\WGONG\.ipython\profile_default\startup\00-pyspark-setup.py

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 2.1.1" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: 
        pyspark_submit_args += " pyspark-shell"
        os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
        
        
CMD>jupyter notebook --profile=pyspark


Issue #1
after running notebook, spark-submit does not work (with the following error)

D:\spark-2.1.1-bin-hadoop2.7\examples\src\main\python>spark-submit pi.py
Traceback (most recent call last):
  File "C:\Anaconda3\Scripts\Jupyter-script.py", line 5, in <module>
    sys.exit(jupyter_core.command.main())
  File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 186, in main
    _execvp(command, sys.argv[1:])
  File "C:\Anaconda3\lib\site-packages\jupyter_core\command.py", line 104, in _execvp
    raise OSError('%r not found' % cmd, errno.ENOENT)

OSError: [Errno None not found] 2

Following URL=https://stackoverflow.com/questions/42263691/jupyter-notebook-interferes-with-spark-submit

it was resolved by removing 
Variable Name: PYSPARK_DRIVER_PYTHON , Value: Jupyter

Variable Name: PYSPARK_DRIVER_PYTHON_OPTS, Value: notebook

and create D:\spark-2.1.1-bin-hadoop2.7\bin\pyspark-jupyter.bat
to run jupyter notebook by entering
CDM> pyspark-jupyter



Comments