My Jupyter Docker Full Stack
My Jupyter Docker Full Stack
My Jupyter Docker Full Stack
William Trigos
Introduction
There is little question, big data analytics, data science, artificial
intelligence (AI), and machine learning (ML), a subcategory of AI,
have all experienced a tremendous surge in popularity over the last
few years. Behind the hype curves and marketing buzz, these
technologies are having a significant influence on all aspects of our
modern lives. Due to their popularity and potential benefits,
academic institutions and commercial enterprises are rushing to train
large numbers of Data Scientists and ML and AI Engineers.
1 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
2 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Featured Technologies
The following technologies are featured prominently in this post.
Jupyter Notebooks
According to Project Jupyter, the Jupyter Notebook, formerly known
as the IPython Notebook, is an open-source web application that
allows users to create and share documents that contain live code,
equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical
modeling, data visualization, machine learning, and much more. The
word, Jupyter, is a loose acronym for Julia, Python, and R, but today,
the Jupyter supports many programming languages. Interest in
Jupyter Notebooks has grown dramatically.
3 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Apache Spark
According to Apache, Spark is a unified analytics engine for large-
scale data processing, used by well-known, modern enterprises, such
as Netflix, Yahoo, and eBay. With speeds up to 100x faster than
Hadoop, Apache Spark achieves high performance for static, batch,
and streaming data, using a state-of-the-art DAG (Directed Acyclic
Graph) scheduler, a query optimizer, and a physical execution engine.
PySpark
The Spark Python API, PySpark, exposes the Spark programming
model to Python. PySpark is built on top of Spark’s Java API. Data is
processed in Python and cached and shuffled in the JVM. According
4 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Docker
According to Docker, their technology developers and IT the freedom
to build, manage and secure business-critical applications without the
fear of technology or infrastructure lock-in. Although Kubernetes is
now the leading open-source container orchestration platform,
Docker is still the predominant underlying container engine
technology. For this post, I am using Docker Desktop Community
version for MacOS.
Docker Swarm
Current versions of Docker include both a Kubernetes and Swarm
orchestrator for deploying and managing containers. We will choose
Swarm for this demonstration. According to Docker, Swarm is the
cluster management and orchestration features embedded in the
Docker Engine are built using swarmkit. Swarmkit is a separate
5 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
PostgreSQL
PostgreSQL is a powerful, open source object-relational database
system. According to their website, PostgreSQL comes with many
features aimed to help developers build applications, administrators
to protect data integrity and build fault-tolerant environments, and
help manage data no matter how big or small the dataset.
Demonstration
To show the capabilities of the Jupyter development environment, I
will demonstrate a few typical use cases, such as executing Python
scripts, submitting PySpark jobs, working with Jupyter Notebooks,
and reading and writing data to and from different format files and to
a database. We will be using the jupyter/all-spark-notebook Docker
Image. This image includes Python, R, and Scala support for Apache
Spark, using Apache Toree.
Architecture
As shown below, we will stand-up a Docker stack, consisting of
Jupyter All-Spark-Notebook, PostgreSQL 10.5, and Adminer
containers. The Docker stack will have local directories bind-mounted
into the containers. Files from our GitHub project will be shared with
the Jupyter application container through a bind-mounted directory.
Our PostgreSQL data will also be persisted through a bind-mounted
directory. This allows us to persist data external to the ephemeral
containers.
6 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Source Code
All open-sourced code for this post can be found on GitHub. Use the
following command to clone the project. The post and project code
was updated on 6/7/2019.
git clone \
--branch master --single-branch --depth 1 --no-tags \
https://github.com/garystafford/pyspark-setup-demo.git
Source code samples are displayed as GitHub Gists, which may not
display correctly on some mobile and social media browsers.
7 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
8 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
1 version: "3.7"
2 services:
3 pyspark:
4 image: jupyter/all‐spark‐notebook:latest
5 ports:
6 ‐ "8888:8888/tcp"
7 ‐ "4040:4040/tcp"
8 networks:
9 ‐ pyspark‐net
10 working_dir: /home/$USER/work
11 environment:
12 CHOWN_HOME: "yes"
13 GRANT_SUDO: "yes"
14 NB_UID: 1000
15 NB_GID: 100
16 NB_USER: $USER
17 NB_GROUP: staff
18 user: root
19 deploy:
20 replicas: 1
21 restart_policy:
22 condition: on‐failure
23 volumes:
24 ‐ $PWD/work:/home/$USER/work
25 postgres:
26 image: postgres:11.3
27 environment:
28 POSTGRES_USERNAME: postgres
29 POSTGRES_PASSWORD: postgres1234
30 POSTGRES_DB: demo
31 ports:
32 ‐ "5432:5432/tcp"
33 networks:
34 ‐ pyspark‐net
35 volumes:
36 ‐ $HOME/data/postgres:/var/lib/postgresql/data
37 deploy:
38 restart_policy:
39 condition: on‐failure
40 adminer:
41 image: adminer:latest
42 ports:
43 ‐ "8080:8080/tcp"
44 networks:
45 ‐ pyspark‐net
46 deploy:
47 restart_policy:
48 condition: on‐failure
49
9 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
and the three containers. To confirm the stack deployed, you can run
the following command:
10 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Using the URL and token shown in the log output, you will be able to
access the Jupyter web-based user interface on localhost port 8888.
Once there, from Jupyter dashboard landing page, you should see all
the files in the project’s work/ directory.
Also shown below, note the types of files you are able to create from
the dashboard, including Python 3, R, Scala (using Toree or spylon-
kernal), and text. You can also open a Jupyter Terminal or create a
new Folder.
11 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
1 #!/usr/bin/python
2
3 import random
4
5 technologies = ['PySpark', 'Python', 'Spark', 'Scala', 'JVM',
6 'Project Jupyter', 'PostgreSQL']
7 print("Technologies: %s" % technologies)
8
9 technologies.sort()
10 print("Sorted: %s" % technologies)
11
12 print("I'm interested in learning %s." % random.choice(technologies))
Run the script from within the Jupyter container, from a Jupyter
Terminal window:
12 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
python ./01_simple_script.py
Kaggle Datasets
To explore the features of the Jupyter Notebook container and
PySpark, we will use a publically-available dataset from Kaggle.
Kaggle is a fantastic open-source resource for datasets used for big-
data and ML applications. Their tagline is ‘Kaggle is the place to do
data science projects’.
13 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
The dataset contains 21,294 rows, each with four columns of data.
Although certainly nowhere near ‘big data’, the dataset is large
enough to test out the Jupyter container functionality (gist).
Search this file…
1 Date Time Transaction Item
2 2016-10-30 09:58:11 1 Bread
3 2016-10-30 10:05:34 2 Scandinavian
4 2016-10-30 10:05:34 2 Scandinavian
5 2016-10-30 10:07:57 3 Hot chocolate
6 2016-10-30 10:07:57 3 Jam
7 2016-10-30 10:07:57 3 Cookies
8 2016-10-30 10:08:41 4 Muffin
9 2016-10-30 10:13:03 5 Coffee
10 2016-10-30 10:13:03 5 Pastry
11 2016-10-30 10:13:03 5 Bread
14 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
1 #!/usr/bin/python
2
3 from pyspark.sql import SparkSession
4 from pyspark.sql.types import StructType, StructField, StringType, IntegerType
5
6 spark = SparkSession \
7 .builder \
8 .appName('pyspark_demo_app') \
9 .config('spark.driver.extraClassPath',
10 'postgresql‐42.2.5.jar') \
11 .getOrCreate()
12
13 sc = spark.sparkContext
14
15 bakery_schema = StructType([
16 StructField('date', StringType(), True),
17 StructField('time', StringType(), True),
18 StructField('transaction', IntegerType(), True),
19 StructField('item', StringType(), True)
20 ])
21
22 df3 = spark.read \
23 .format('csv') \
24 .option('header', 'true') \
25 .load('BreadBasket_DMS.csv', schema=bakery_schema)
26
27 df3.show(10)
python ./02_bakery_dataframes.py
An example of the output of the Spark job is shown below. At the time
of this post, the latest jupyter/all-spark-notebook Docker Image
runs Spark 2.4.3, Scala 2.11.12, and Java 1.8.0_191 using the
OpenJDK.
15 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
More typically, you would submit the Spark job, using the spark-
$SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py
Below, we see the beginning of the output from Spark, using the
spark-submit command.
16 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Below, we see the scheduled tasks executing and the output of the
print statement, displaying the top 10 rows of bakery data.
17 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
# using pip
docker exec -it \
$(docker ps | grep pyspark_pyspark | awk '{print $NF}') \
pip install psycopg2-binary
1 #!/usr/bin/python
2
3 import psycopg2
4
5 # source: https://stackoverflow.com/questions/45805871/python3‐psycopg2‐execute‐sql‐file
6
7 connect_str = 'host=postgres port=5432 dbname=demo user=postgres password=postgres1234'
8 conn = psycopg2.connect(connect_str)
9 conn.autocommit = True
10 cursor = conn.cursor()
11
12 sql_file = open('bakery_sample.sql', 'r')
13 sqlFile = sql_file.read()
14 sql_file.close()
15 sqlCommands = sqlFile.split(';')
16 for command in sqlCommands:
17 print(command)
18 if command.strip() != '':
19 cursor.execute(command)
python ./03_load_sql.py
18 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
19 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
20 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
PostgreSQL Driver
The only notebook document dependency, not natively part of the
Jupyter Image, is the PostgreSQL JDBC driver. The driver,
postgresql-42.2.5.jar , is included in the project and referenced in
21 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
This ensures the JAR is available to Spark (written in Scala) when the
job is run.
PyCharm
Since the working directory for the project is shared with the
container, you can also edit files, including notebook documents, in
your favorite IDE, such as JetBrains PyCharm. PyCharm has built-in
language support for Jupyter Notebooks, as shown below.
22 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
23 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Plotly also provides Chart Studio Online Chart Maker. Plotly describes
Chart Studio as the world’s most sophisticated editor for creating
D3.js and WebGL charts. Shown below, we have the ability to
enhance, stylize, and share our bakery data visualization using the
free version of Chart Studio Cloud.
nbviewer
Notebooks can also be viewed using Jupyter nbviewer, hosted on
24 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
25 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
We can also review the timing of each event, occurring as part of the
stages of the Spark job.
26 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
We can also use the Spark interface to review and confirm the
runtime environment, including versions of Java, Scala, and Spark, as
well as packages available on the Java classpath.
Spark Performance
Spark, running on a single node within the Jupyter container, on your
development system, is not a substitute for a full Spark cluster,
running on bare metal or robust virtualized hardware, with YARN,
Mesos, or Kubernetes. In my opinion, you should adjust Docker to
27 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
docker stats \
--format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}
\t{{.MemPerc}}"
Below, we see the stats from the stack’s three containers immediately
after being deployed, showing little or no activity. Here, Docker has
been allocated 2 CPUs, 3GB of RAM, and 2 GB of swap space
available, from the host machine.
28 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Compare the stats above with the same three containers, while the
example notebook document is running on Spark. The CPU shows a
spike, but memory usage appears to be within acceptable ranges.
29 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
With htop , we can observe individual CPU activity. The two CPUs at
the top left of the htop window are the two CPUs assigned to Docker.
We get insight into the way Docker is using each CPU, as well as other
basic performance metrics, like memory and swap.
30 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
31 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
Conclusion
In this brief post, we have seen how easy it is to get started learning
and developing applications for big data analytics, using Python,
Spark, and PySpark, thanks to the Jupyter Docker Stacks. We could
use the same stack to learn and develop for machine learning, using
Python, Scala, and R. Extending the stack’s capabilities is as simple as
swapping out this Jupyter image for another, with a different set of
tools, as well as adding additional containers to the stack, such as
Apache Kafka or Apache Cassandra.
All opinions expressed in this post are my own and not necessarily the
views of my current or past employers, their clients.
32 of 33 5/08/19, 12:06 p. m.
Getting Started with PySpark for Big Data Analyti... https://medium.com/@GaryStafford/getting-start...
33 of 33 5/08/19, 12:06 p. m.