How to Get Started with Real-Time Traffic Speed Analysis with Apache Spark and Kafka.

Midhun Pottammal

Senior Data Engineer @ saal.ai | Big Data Platforms

Published May 11, 2023

Introduction

Monitoring road traffic at scale poses a significant #bigdata data challenge, necessitating the use of a robust big data platform This article outlines the process of conducting real-time analysis of road traffic flows in a data lake and apache spark.

This blog post will explore how Apache Spark can be leveraged to perform real-time traffic speed analysis, providing valuable insights for traffic management and decision-making.

Problem Statement

Develop a #real-time traffic speed analysis system that can handle and analyze vast amounts of data with speed and ease. This system should be scalable and offer practical information about traffic conditions.

Architecture

let's dive into how a real-time traffic analysis system works.

A real-time traffic analysis system typically has four main components:

Traffic event-generating IoT devices: These devices collect traffic data, such as the number of vehicles per minute or the average speed of vehicles.
Data collection queues using Apache #kafka : Apache Kafka is a distributed streaming platform that can handle large volumes of data in real-time. It is a good choice for collecting traffic data because it is scalable, reliable, and fault-tolerant.
Apache Spark used as a processing engine: Apache Spark is a distributed processing framework that can quickly and efficiently analyze large datasets.
Apache #iceberg as the data warehouse: Apache Iceberg is a scalable and durable data warehouse that can store large amounts of data.

No alt text provided for this image — Traffic Data Monitoring Application Architecture Diagram

The system works as follows:

Traffic event-generating IoT devices collect traffic data.
The data is sent to Apache Kafka data collection queues.
Apache Spark streaming jobs read the data from Apache Kafka and process it.
The processed data is stored in Apache Iceberg.
The data in Apache Iceberg can be used to generate real-time traffic analysis reports using any BI tool, such as Power BI.

Event Generation

IoT devices can generate events when they detect each vehicle. The events can be anything from the vehicle's license plate number to its speed and direction of travel. The events can then be sent to Kafka, which is a distributed streaming platform that can handle large volumes of data in real-time.

example of a JSON message that an IoT device might send to Kafka:

{
 "event_type": "vehicle_detected",
  "vehicle_id": "1234567890",
  "license_plate": "ABCD",
  "speed": 60,
  "direction": "north"
}

Once the event is in Kafka, it can be processed by the Apache Spark processing engine.

Processing

To process the events and calculate the average speed of vehicles every 1 minute using PySpark Structured Streaming.

from pyspark.sql import SparkSessio
from pyspark.sql.functions import avg
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# schema for the IoT events
schema = StructType([
    StructField("event_type", StringType(), True),
    StructField("vehicle_id", StringType(), True),
    StructField("license_plate", StringType(), True),
    StructField("speed", IntegerType(), True),
    StructField("direction", StringType(), True)
])

# Create a SparkSession
spark = SparkSession.builder.appName("trafic_events").getOrCreate()

# Read the IoT events from Kafka using the defined schema
iot_df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "trafic_events") \
    .option("startingOffsets", "latest") \
    .load() \
    .selectExpr("CAST(value AS STRING)") \
    .selectExpr("from_json(value, schema) AS data") \
    .select("data.*")

average_speed_df = iot_df \
    .filter("event_type = 'vehicle_detected'") \
    .groupBy(window("event_time", 1 minute, 1 minute), "direction") \
    .agg(avg("speed").alias("average_speed"))

average_speed_df \
    .writeStream \
    .format("iceberg") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .option("path", "/path/to/result/table") \
    .start()

spark.streams.awaitAnyTermination()

Spark streaming code runs continuously, listening for events generated in Kafka. It reads these events and calculates the average speed of vehicles within one-minute intervals. The calculated average speed is then stored in Iceberg tables for further analysis.

Schema Definition: Defining a schema using the StructType class. The schema describes the structure and data types of the fields in the streaming data. It includes fields such as Filed Name, Data Type and Optional or Not. This schema helps ensure data consistency and allows for proper interpretation of the incoming events.
Reading IoT Events from Kafka: Using the specified Kafka bootstrap servers and topic, the code reads IoT events in a streaming manner. It leverages the defined schema to parse the events received from Kafka. The streaming data frame is created, enabling real-time processing and analysis of the incoming data.
Data Processing and Average Speed Calculation: The code performs data processing operations on the streaming DataFrame. It filters the data to include only events with an event_type of "vehicle_detected". after, it groups the events based on a one-minute window and the direction of the vehicle. Within each group, the average speed is calculated using the avg aggregation function.

Window in Spark is a logical grouping of rows that are processed together. Windows can be used to perform aggregations, calculations, and other operations on groups of rows.

There are two main types of windows in Spark:

Tumbling windows and sliding windows. Tumbling windows are fixed-size windows that process all rows that arrive within a certain time period.

Sliding windows, on the other hand, move across the data stream, processing a group of rows at a time.

Persisting

Writing Results to Iceberg Tables: The computed average speed DataFrame is written to Iceberg tables. The code specifies the checkpoint location and path for the result table. By writing the results in a streaming fashion, the tables can be continuously updated with the latest average speed information.

data.writeStream
    .format("iceberg")
    .outputMode("append")
    .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
    .option("path", tableIdentifier)
    .option("checkpointLocation", checkpointPath)
    .start()

The table identifier can be:

The fully-qualified path to an HDFS table, like hdfs://nn:8020/path/to/table
A table name if the table is tracked by a catalogue, like a database.table_name
iceberg supports append and complete output modes

Conclusion

The real-time traffic speed analysis system using Apache Spark and Kafka offers a powerful and scalable solution for processing and analyzing large volumes of streaming data. By leveraging the capabilities of Spark's distributed processing and Kafka's streaming platform, it becomes possible to perform real-time data analysis and gain actionable insights for traffic management. The integration with Iceberg tables provides reliable storage for the analyzed results, ensuring data consistency and scalability.

How to Get Started with Real-Time Traffic Speed Analysis with Apache Spark and Kafka.

Midhun Pottammal

Senior Data Engineer @ saal.ai | Big Data Platforms

More articles by this author

Insights from the community

Others also viewed

Case Study: Kafka Async Queuing with Consumer Proxy

Kafka Eco System

Transform Your Data Strategy with Apache Kafka

Optimize your EMR cluster

Day 1 - 15Day Databricks: Spark Architecture & Internal Working Mechanism

Distributed Storage Cluster & Hadoop

Hadoop market set to touch double digit cagr: Cloudrea, Hortonworks, MapR Tech

Fetch Bulk Data from HBase Using Spark Multi-Executor With foreachPartitionAsync

Hadoop Market: Trends, Challenges, Opportunities, Potential Growth by 2032

BigDataRevealed Fills the Weaknesses Inherent in Hadoop And makes IoT as easy as 1-2-3 And all with SecureSequester/Encrypt

Explore topics

Open source tools for Data Engineering

Feb 14, 2024

Apache Iceberg Schema Evolution

Jan 31, 2024

Benefit of Data Observability: Unlocking the Insights 🚗

Jan 23, 2024

Star Schema vs Snowflake Schema: Key Differences Between The Two

Jan 19, 2024

PySpark: Unit and Integration Testing

Dec 7, 2023

Change Data Capture, with Debezium

Nov 20, 2023

From Raw Ore to Refined Gold: Crafting a Data Symphony with Bronze, Silver, and Gold Layers

Nov 15, 2023

Inside the Data Lakehouse: Choosing the Right Open Table Format

Oct 12, 2023

The Lambda Architecture: A Hybrid Approach to Data Processing

Sep 10, 2023

The Ultimate Guide to Open Table Formats: Delta Lake vs Iceberg Part 1