PySpark profile creation in jupyter - BigData- The Next Big Thing

1. Create kernel directory (pyspark is the kernel name)

mkdir -p ~/.ipython/kernels/pyspark

2. Create kernel file:

touch ~/.ipython/kernels/pyspark/kernel.json

3. Enter below code inside above file

{
 "display_name": "pySpark (Spark 1.6.0)",
 "language": "python",
 "argv": [
  "/opt/anaconda2/bin/python2.7",
  "-m",
  "IPython.kernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/opt/cloudera/parcels/CDH/lib/spark/",
  "PYTHONPATH": "/opt/cloudera/parcels/CDH/lib/spark/python/:/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.9-src.zip",
  "PYTHONSTARTUP": "/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS":"--master yarn --deploy-mode client pyspark-shell"
 }
}

4. Create new IPython profile Create profile startup script

touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py

5. Add below properties to above file

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
    raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

5. You can run this profile using following command

jupyter notebook