Scala and Spark Training: Objective

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

SCALA AND SPARK TRAINING

Objective:
 By the end of this, participants will get a thorough knowledge of Scala
 Ability to understand Spark
 Write data processing programs using Scala and Spark

Duration: 5-6 days (40 to 48 hours)


 SCALA –2- 3 Days
 SPARK – 3 Days

Prerequisite:
 Java programming ( Good to know )
 Familiarity with Hadoop eco-system HIVE, HDFS etc ( Good to know )
 Familiarity with Unix systems – Basic directory navigation commands, file creation
commands, Environment variables
 Systems with sufficient RAM – Minimum of 8GB of RAM
 64 bit architecture machines with VT-x enabled to allow the running of Virtual Machine

Hands-On:
 This is a hands on session and each section will have relevant hands-on lab
 We will use both
o base ubuntu VM
o Cloudera’s quick start VM

SCALA – 3 days
Introduction to Scala
1. Why Scala?
2. What makes Scala tick
3. Scala interpreter
4. Variables
5. Functions
6. Control Statements (if , else, for, foreach)
7. Basics of Lists, tuples, sets, maps arrays

Introduction to Classes and Objects


1. Classes, fields, methods
2. Singleton objects
3. A scala application
4. Application trait
5. Case classes

Functional programming in Scala


1. Treating functions as first class citizens
2. Closures and higher order functions
3. Tail recursion
4. Function literals

Packages and Imports


1. Putting code in packages
2. Imports
3. Access modifiers
4. Package objects

Collections in details
1. Lists
2. Sequences
3. Sets
4. Maps
5. Tuples

Combining Scala and Java


1. Using Scala from Java
2. Annotations
3. Existential types
4. Compiling Scala and Java together

SBT
Scala build tool
SPARK – 3 days
Introduction to Spark
1. What is Spark
2. Spark stack
3. Where does spark fit in the Hadoop stack
4. Usages of Spark

Spark setup
1. Downloading
2. Starting spark shell
a. Python
b. Scala
3. SparkContext and SparkSession

Resilient Data Sets


1. RDD Basics
2. Creating RDDs
3. Transformations
4. Actions
5. Lazy evaluation of RDD
6. Persistence

Pair RDDs
1. Creating Pair RDDs
2. Transformations on Pair RDDs
3. Data Partitioning

Loading and Saving Data


1. Working with Local File System
2. Working with HDFS

Running Spark on a cluster


1. Spark Runtime Arch
a. Driver
b. Executor
c. Cluster Manager
2. Deploying applications using spark-submit
3. Packaging code and dependencies
4. Building a spark application through sbt
5. Intro to cluster managers

Spark SQL , DataFrames and DataSets


1. Using DataFrames
2. Using spark SQL
3. Using DataSets
4. Loading and Saving data
a. Json
b. Parquet
c. Apache Hive
d. RDDs
5. User Defined functions

Spark Streaming
1. Understanding Dstream
2. Architecture
3. Transformations
4. Performance Considerations

Machine Learning Overview


1. Machine learning basics
2. Data Types
3. Algorithms

Tuning and Debugging Spark


1. Jobs
2. Tasks
3. Stages
4. Finding the right information
5. Parallelism
6. Serialization format
7. Memory Management
8. Hardware provisioning

You might also like