Traffic-Net: 3D Traffic Monitoring Using A Single Camera: Mahdi Rezaei, Mohsen Azarmi, Farzam Mohammad Pour Mir
Traffic-Net: 3D Traffic Monitoring Using A Single Camera: Mahdi Rezaei, Mohsen Azarmi, Farzam Mohammad Pour Mir
Traffic-Net: 3D Traffic Monitoring Using A Single Camera: Mahdi Rezaei, Mohsen Azarmi, Farzam Mohammad Pour Mir
Mahdi Rezaei 1,★, , Mohsen Azarmi 2,★, Farzam Mohammad Pour Mir 3,★
1 Institute for Transport Studies, The University of Leeds, Leeds, LS2 9JT, UK
2 Department of Computer Engineering, Qazvin University, Qazvin, IR
3 Faculty of Computer Engineering, Tehran Azad University, Science & Research Branch, IR
1 [email protected] 2 [email protected] 3 [email protected]
ABSTRACT
Computer Vision has played a major role in Intelligent Transportation Systems (ITS) and traffic surveillance. Along with the rapidly growing
automated vehicles and crowded cities, the automated and advanced traffic management systems (ATMS) using video surveillance
infrastructures have been evolved by the implementation of Deep Neural Networks. In this research, we provide a practical platform for
real-time traffic monitoring, including 3D vehicle/pedestrian detection, speed detection, trajectory estimation, congestion detection, as well as
monitoring the interaction of vehicles and pedestrians, all using a single CCTV traffic camera. We adapt a custom YOLOv5 deep neural
network model for vehicle/pedestrian detection and an enhanced SORT tracking algorithm. For the first time, a hybrid satellite-ground based
inverse perspective mapping (SG-IPM) method for camera auto-calibration is also developed which leads to an accurate 3D object detection
and visualisation. We also develop a hierarchical traffic modelling solution based on short- and long-term temporal video data streams
to understand the traffic flow, bottlenecks, and risky spots for vulnerable road users. Several experiments on real-world scenarios and
comparisons with state-of-the-art are conducted using various traffic monitoring datasets, including MIO-TCD, UA-DETRAC and GRAM-RTM
collected from highways, intersections, and urban areas under different lighting and weather conditions.
Keywords – 3D Object Detection; Traffic Flow Monitoring; Intelligent Transportation Systems; Deep Neural Networks; Vehicle Detection;
Pedestrian Detection; Inverse Perspective Mapping Calibration; Digital Twins; Video Surveillance.
1 Introduction
mart video surveillance systems are becoming a com-
2/21
In another approach, Cheon et al. [15] have presented a the objects’ size, volume, and spatial dimensions [23].
vehicle detection system using histogram of oriented gradients LiDAR sensors and 3D point cloud data offers a new mean
(HOG) considering the shadow of the vehicles to localise them. for traffic monitoring. In [24], Zhang et al. have presented
They have also used an auxiliary feature vector to enhance the a real-time vehicle detector and tracking algorithm without
vehicle classification and to find the areas with a high risk of bounding box estimation, and by clustering the point cloud
accidents. However, the method leads to erroneous vehicle space. Moreover, they used the adjacent frame fusion tech-
localisation during the night or day-time where the vehicles’ nique to improve the detection of vehicles occluded by other
shadows are too long and not presenting the exact location vehicles on the road infrastructures.
and size of the vehicle. Authors in [25], proposed a centroid-based tracking method
Although most of the discussed methods perform well and a refining module to track vehicles and improve the speed
in simple and controlled environments, they fail to propose estimations. Song, Yongchao, et al. [26] proposed a framework
accurate performance in complex and crowded scenarios. Fur- which uses binocular cameras to detect road, pedestrians and
thermore, they are unable to perform multi-class classifications vehicles in traffic scenes.
and can not distinguish between various categories of moving In another multi-modal research, thermal sensor data is
objects such as pedestrians, cars, buses, trucks, etc. fused with the RGB camera sensor, leading to a noise-resistant
With the emergence of deep neural networks (DNNs), the technique for traffic monitoring [27].
machine learning domain received more attention in the object Although many studies are conducting different sensors to
detection domain. In modern object detection algorithms, Con- perform 3D object detection such as in [28], [29], the cost of
volutional Neural Networks (CNN) learns complex features applying such methods in large and crowded cities could be
during the training phase, aiming to elaborate and understand significant. Since there are many surveillance infrastructures
the contents of the image. This normally leads to improvement already installed in urban areas and there are more data
in detection accuracy compared to classical image processing available for this purpose, 2D object detection on images has
methods [16]. Such object detectors are mostly divided into gained a lot of attention in a more practical way.
two categories of single-stage (dense prediction) and two-stage Many studies including deep learning-based methods [30],
(sparse prediction) detectors. The two-stage object detectors [31], have tried to utilise multi-camera and sensors to compen-
such as RCNN family, consist of a region proposal stage and a sate for the missing depth information in the monocular CCTV
classification stage [17]; while the single-stage object detec- cameras, to estimate the position and speed of the object, as
tors such as Single-Shot Multi-Box Detector (SSD) [18], and well as 3D bounding box representation from a 2D perspective
You Only Look Once (YOLO) see the detection process as images [32].
a regression problem, thus provide a single-unit localisation Regardless of the object detection methodology, the CCTV
and classification architecture [17]. camera calibration is a key requirement of 2D or 3D traf-
Arinaldi et al. [19] reached a better vehicle detection per- fic condition analysis prior to starting any object detection
formance using Faster-RCNN compared to a combination of operation. A camera transforms the 3D world scene into
MOG and SVM models. a 2D perspective image based on the camera intrinsic and
Peppa et al. [20], developed a statistical-based model, a extrinsic parameters. Knowing these parameters is crucial for
random forest method, and an LTSM to predict the traffic an accurate inverse perspective mapping, distance estimation,
volume for the upcoming 30 minutes, to compensate for lack and vehicle speed estimation [33].
of accurate information in extreme weather conditions. In many cases especially when dealing with a large network
Some researchers such as Bui et al. [21], utilised single- of CCTV cameras in urban areas, these parameters can be
stage object detection algorithms including he YOLOv3 model unknown or different to each other due to different mounting
for automated vehicle detection. They designed a multi- setups and different types of cameras. Individual calibration
class distinguished-region tracking method to overcome the of all CCTVs in metropolitan cities and urban areas with
occlusion problem and lighting effects for traffic flow analysis. thousands of cameras is a very cumbersome and costly task.
In [22], Mandal et al. have proposed an anomaly detection Some of the existing studies have proposed camera calibra-
system and compared the performance of different object tion techniques in order to estimate these parameters, hence
detection including Faster-RCNN, Mask-RCNN and YOLO. estimating an inverse perspective mapping.
Among the evaluated models, YOLOv4 gained the highest Dubska et al. [34] extract vanishing points that are parallel
detection accuracy. However, they have presented a pixel- and orthogonal to the road in a road-side surveillance camera
based (pixel per second) vehicle velocity estimation that is not image, using the moving trajectory of the detected cars and
very accurate. Hough line transform algorithm. This can help to automatically
On the other hand, the advancement of stereo vision sensors calibrate the camera for traffic monitoring purposes, despite
and 3D imaging has led to more accurate solutions for traffic low accuracy of the Hough transform algorithm in challenging
monitoring as well as depth and speed estimation for road users. lighting and noisy conditions.
Consequently, this enables the researchers to distinguish the Authors in [35], proposed a Faster-RCNN model to detect
scene background from the foreground objects, and measure vehicles and consider car edgelets to extract perpendicular van-
3/21
ishing points to the road to improve the automatic calibration
of the camera.
Song et al. [36], have utilised an SSD object detector
to detect cars and extract spatial features from the content
of bounding boxes using optical flow to track them. They
calculate two vanishing points using the moving trajectory of
vehicles in order to automatically calibrate the camera. Then,
they consider a fixed average length, width and height of cars
to draw 3D bounding boxes.
However, all of the aforementioned calibration methods
assume 1) The road has zero curvature which is not the case
in real-world scenarios and 2) Vanishing points are based
on straight-line roads. i.e. the model does not work on
intersections.
In a different approach, Kim et al. [37], consider 6 and 7
Figure 2. The overall structure of the proposed methodology
corresponding coordinates in a road-side camera image and
an image with a perpendicular view of the same scene (such
as near-vertical satellite image) to automatically calibrate Figure 2 summarises the overall flowchart of the method-
the camera. They introduced a revised version of RANSAC ology, starting with a 2D camera image and satellite image
model, called Noisy-RANSAC to efficiently work with at least as the main inputs which ultimately lead to a 3D road-users
6 or 7 corresponding points produced by feature matching detection and tracking.
methods. However, the method is not evaluated on real-world
and complex scenarios in which the road is congested and 3.1 Object Detection and Localisation
occluded by various types of road users. According to the reviewed literature, the YOLO family has
Among the reviewed literature most of the studies have not been proven to be faster and also very accurate compared to
investigated various categories of road users such as pedestri- most of the state-of-the-art object detectors [38].
ans and different types of vehicles that may exist in the scene. We hypothesis that recent versions of the YOLO family can
Moreover, there are limited researches addressing full/partial provide a balanced trade-off between the speed and accuracy
occlusion challenges in the congested and noisy environment of our traffic surveillance application. In this section we
of urban areas. It is also notable that the performance of the conduct a domain adaptation and transfer learning of YOLOv5.
latest object detection algorithm to date is not evaluated by The Microsoft COCO dataset [39] consists of 80 annotated
the traffic monitoring related researches. categories of the most common indoor and outdoor objects
Furthermore, very limited research has been conducted on in daily life. We use pre-trained feature extraction matrices
short and long-term spatio-temporal video analysis to automat- of YOLOv5 model on COCO dataset as the initial weights to
ically understand the interaction of vehicles and pedestrians train our customised model.
and their effects on the traffic flow, congestion, hazards, or Our adapted model is designed to detect 11 categories
accidents. of traffic-related objects which also match the MIO-TCD
In this article, we will aim at proving an efficient and traffic monitoring dataset [40]. These categories consist of a
estate-of-the-art traffic monitoring solution to tackle some of pedestrian class and 10 types of vehicles, including articulated
above-mentioned research gaps and weaknesses in congested truck, bicycle, bus, car, motorcycle, motorised vehicles, non-
urban areas. motorised vehicles, pickup truck, single-unit truck and work
van. Because of the different number of classes in two datasets,
the last layers of the model (the output layers) do not have
3 Methodology the same shape to copy. Therefore, these layers will be
We represent our methodology in four hierarchical subsections. initialised with random weights (using 100100 seeds). After
In section 3.1 as the first contribution, a customised and the initialisation process, the entire model will be trained on
highly accurate vehicle and pedestrian detection model will be the MIO-TCD dataset. We expect this would ultimately lead
introduced. In Section 3.2 and as the second contribution we to a more accurate and customised model for our application.
elaborate our multi-object and multi-class tracker (MOMCT). As shown in Figure 3 the architecture of modern YOLO
Next, in Section 3.3, a novel auto-calibration technique (named frameworks consists of a backbone, the neck, and the head.
SG-IPM) is developed. Last but not the least, in section 3.4 we The backbone includes stacked convolutional layers turn-
develop a hybrid methodology for road and traffic environment ing the input into the feature space. In the backbone, the
modelling which leads to 3D detection and representation of Cross-Stage Partial network (CSP) [41] conducts shortcut
all vehicles and pedestrians in a road scene using a single connections between layers to improve the speed and accuracy
CCTV camera. of feature extractors.
4/21
Figure 3. Summarised structure of the dense (single-stage) and sparse (two-stage) object detection architecture, applied to a
road-side surveillance video.
5/21
Figure 4. The hierarchical structure of the proposed feature matching model.
Kalman filter of the SORT tracker. The category vector is the knowing the camera intrinsic parameters, height and angle of
one-hot encoded representation of the detected class vector c, the camera.
in which the highest class probability is shown by 1 and the We exploit a top-view satellite image from the same location
rest of the probabilities by 0. of the CCTV camera and develop a hybrid satellite-ground
Exploiting the smoothing effect of the Kalman filter would based inverse perspective mapping (SG-IPM) to automatically
filter out the bouncing of detected categories through the calibrate the surveillance cameras. This is an end-to-end
sequence of frames. Also, it enables the SORT to calculate technique to estimate the planar transformation matrix G as
IoU between the bounding boxes of different categories. This per Equation 35 in Appendix A. The matrix G is used to
yields a multi-object and multi-category ID assignment. transform the camera perspective image to a bird’s eye view
The state matrix of the new Kalman filter is defined as image.
follows: Let’s assume (𝑥, 𝑦) as a pixel in a digital image container
I : U → [0, 255] 3 were U = [[0; 𝑤 − 1] × [0; ℎ − 1]] represents
x̀ = [ 𝑥ˆ 𝑦ˆ 𝑠 𝑏 𝑟 𝑏 𝑥¤ 𝑦¤ 𝑠¤ | ć ] 𝑇 (3) the range of pixel locations in a 3 channel image, and 𝑤, ℎ are
where 𝑠 𝑏 = 𝑤 𝑏 × ℎ 𝑏 denotes the scale (area), 𝑟 𝑏 is the aspect width and height of the image.
¤ 𝑦¤ and 𝑠¤ are the velocities of 𝑥,
ratio, 𝑥, ˆ 𝑦ˆ and 𝑠 𝑏 , respectively. Using ( ˆ ) to denote the perspective space (i.e. camera
Similarly, we represent the observation matrix of the revised view), and, (ˇ) for inverse perspective space, we represent the
Kalman filter as follows: surveillance camera image as Î, the satellite image as Ì, and
bird’s eye view image as Ǐ which is calculated by the linear
z̀ = [ 𝑥ˆ 𝑦ˆ 𝑠 𝑏 𝑟 𝑏 | ć ] 𝑇 (4) transformation G : Î → Ǐ.
Since the coordinates of the bird’s eye view image approxi-
In order to determine the trajectory of objects, we introduce mately matches the satellite image coordinates (i.e. (Ì ≈ Ǐ), the
two sets of 𝑉 and 𝑃 as the tracker-ID of detected vehicles and utilisation of the transformation function ( 𝑥, ˇ 𝑦ˇ ) = Λ(( 𝑥,
ˆ 𝑦ˆ ), G)
pedestrians, respectively. (as defined in Appendix A) would transform the pixel loca-
The trajectory set of each vehicle (𝑣 𝑖 ) and pedestrian ( 𝑝 𝑖 ) tions of Î to the Ì. Similarly, G−1 inverts the mapping process.
can be calculated based on temporal image frames as follows: In other words, ( 𝑥, ˇ 𝑦ˇ ), G−1 ) transforms the pixel
ˆ 𝑦ˆ ) = Λ(( 𝑥,
locations from Ǐ to the Î.
𝑀𝑣𝑖 = {( 𝑥ˆ 𝑣𝑡 𝑖 , 𝑦ˆ 𝑡𝑣𝑖 ) : ∀𝑡 ∈ 𝑇𝑣𝑖 } In order to solve the linear equation 35, at least four pairs
(5) of corresponding points in Î and Ì are required. Therefore,
𝑀 𝑝𝑖 = {( 𝑥ˆ 𝑡𝑝𝑖 , 𝑦ˆ 𝑡𝑝𝑖 ) : ∀𝑡 ∈ 𝑇 𝑝𝑖 }
we would need to extract and match similar features pairs
where 𝑇𝑣𝑖 and 𝑇 𝑝𝑖 are the sets of frame-IDs of the vehicles 𝑣 𝑖 from both images. These feature points should be robust and
and pedestrians 𝑝 𝑖 and ( 𝑥ˆ 𝑡 , 𝑦ˆ 𝑡 ) is the location of the object 𝑣 𝑖 invariant to rotation, translation, scale, tilt, and also partial
or 𝑝 𝑖 at frame 𝑡. occlusion in case of high affine variations.
Finally, moving trajectories of all tracked objects are defined Figure 4 represents the general flowchart of our SG-IPM
as the following sets: technique, which is fully explained in the following sub-
sections, including feature enhancement and feature matching:
𝑀𝑉 = {𝑀𝑣𝑖 : ∀𝑣 𝑖 ∈ 𝑉 }
(6)
𝑀 𝑃 = {𝑀 𝑝𝑖 : ∀𝑝 𝑖 ∈ 𝑃}
3.3.1 Feature Enhancement
6/21
Radial Distortion: Some of the road-side cameras have non-
linear radial distortion due to their wide-angle lens which will
affect the accuracy of the calibration process and monitoring
system to estimate the location of the objects.
Such type of noise would also reduce the resemblance
between Î and Ì images, especially, in the case that we want to
find similar feature points.
Examples of the barrel-shaped radial noise are shown in
Figure 5, left column. Similar to a study by Dubská et
al [34], we assume the vehicles traverse between the lanes in a
straight line. We use the vehicles’ trajectory sets to remove
radial distortion noise. For each vehicle 𝑣 𝑖 , a polynomial
radial distortion model is applied to the location coordinates
( 𝑥ˆ 𝑣𝑖 , 𝑦ˆ 𝑣𝑖 ) of the vehicle’s trajectory set (𝑀𝑣𝑖 ) as follows:
¯ 𝑦¯ ) = (( 𝑥ˆ 𝑣𝑖 − 𝑥 𝑠 ) (1 + 𝑘 1 𝑟 2 + 𝑘 2 𝑟 4 + ...),
( 𝑥,
(7)
( 𝑦ˆ 𝑣𝑖 − 𝑦 𝑠 ) (1 + 𝑘 1 𝑟 2 + 𝑘 2 𝑟 4 + ...))
√︃
𝑟= ( 𝑥ˆ 𝑣𝑖 − 𝑥 𝑠 ) 2 + ( 𝑦ˆ 𝑣𝑖 − 𝑦 𝑠 ) 2 (8)
where ( 𝑥,
¯ 𝑦¯ ) is the corrected location of the vehicle, (𝑥 𝑠 , 𝑦 𝑠 )
denotes the centre of the radial noise, 𝐾 = {𝑘 1 , 𝑘 2 , ...} are the Figure 5. Eliminating the radial distortion in MIO-TCD
unknown scalar parameters of the model which need to be dataset samples [40]. Left column: original images. Right
estimated, and 𝑟 is the radius of the distortion with respect to column: rectified images after barrel distortion removal.
the centre of the image.
A rough estimation of 𝑘 1 and 𝑘 2 would be sufficient to
0
remove the major effects of such noise. To this regard, each where initially B̂ is the accumulative variable and B̂ is equal
point of the moving trajectories would be applied to the Equa- to the first input frame I0 , 𝛼 is the weighted coefficient that
tion 7 yielding to transformed trajectory set 𝑀¯ 𝑣𝑖 . Then, the determines the importance of the next incoming frame. Our
optimal values of 𝑘 1 and 𝑘 2 would be achieved by minimising experiment shows that 𝛼 = 0.01, and n𝑡 ≈ 70 frames is usually
the sum of squared errors between the best fitting line ℓ to the sufficient to remove the foreground objects in most urban and
𝑀𝑣𝑖 and 𝑀¯ 𝑣𝑖 as follows: city roads with a moderate traffic flow.
∑︁ ∑︁ Figure 6 shows samples of the background extraction
𝐾 = arg min (ℓ.𝑙¯𝑗 ) 2 (9) method applied to various roads and traffic scenarios.
𝑘
𝑣𝑖 ∈𝑉 𝑙¯𝑗 ∈ 𝑀
¯𝑣
𝑖
where 𝑙¯𝑗 is the corrected pixel location of the vehicle 𝑣 𝑖 Histogram Matching: Lighting condition variation is another
belonging to the transformed moving trajectory set 𝑀¯ 𝑣𝑖 . barrier that makes it hard to find and match similar feature
Finally, the optimal parameters will be estimated using points between Î and Ì.
(1 + 𝜆)-ES evolutionary algorithm with 𝜆 = 8 as discussed We utilise a colour correlation-based histogram matching
in [34]. [50] which adjusts the hue and luminance of Î and Ì into
the same range. The algorithm can be extended to find a
Background Extraction: Since Î and Ì images are captured monotonic mapping between two sets of histograms. The
using two different cameras (ground camera vs. aerial satellite optimal monotonic colour mapping 𝐸 is calculated to minimise
camera) and in different dates, times, or weather conditions, the distance between the two sets simultaneously.
the images may seem inconsistent and different. This is mostly
∑︁ 𝑔 𝑔
due to the existence of different foreground objects and road 𝑑¯= arg min 𝑑 (𝐸 ( Î𝑖 ), Ì𝑖 ) (11)
users on each image. This makes it hard to find analogous 𝐸
𝑖=𝑛 𝑝
features to match.
To cope with that challenge, we extract the background 𝑔 𝑔
of image Î by eliminating the moving objects. We apply where Î and Ì are grey-level images, 𝑛 𝑝 is the number of
an accumulative weighted sum over the intensity value for a pixels in each image, 𝐸 is histogram matching function, and
period of n𝑡 (frames) to remove the effect of the temporal pixel 𝑑 (·, ·) is a Euclidean distance metric between two histograms.
value changes as follows: Figures 7a and 7b show the results of the background
extraction and histogram matching process, respectively.
𝑡 𝑡−1 𝑡
B̂ = (1 − 𝛼) B̂ + (𝛼 Î ) , 1 ≤ 𝑡 ≤ n𝑡 (10)
7/21
(a) Laidlaw Library CCD original (b) The generated Laidlaw Library
video footage background video
8/21
(a) ASIFT feature matching results
9/21
We represent the Kalman transition matrix (Ă) as follows:
2
𝑡𝑤
1 0 𝑡𝑤 0 2 0
2
𝑡𝑤
Ă = 0 1 0 𝑡𝑤 0
2 (18)
0 0 1 0 𝑡𝑤 0
0 0 0 1 0 𝑡 𝑤
1
where 𝑡 𝑤 = fps is the real-world time between the former and
the current frame depending on the camera frame rate (frame
per second, fps). The observation matrix z̆ can be defined as
follows:
z̆ = [ 𝑥ˇ 𝑦ˇ ] 𝑇 (19)
Using the Kalman-based smoothed location ( 𝑥, ˇ 𝑦ˇ ) and the Figure 10. Δ𝜃 cosine suppression operation. The darker
frame by frame velocity of objects 𝑥,¤ the speed of a vehicle zones receive a lower coefficients which in turn suppress any
will be calculated (in mph) as follows: large and sudden angular changes between two consequent
frames.
𝜗𝑣𝑖 = 𝑥¤ 𝑣𝑖 × 𝜄 (20)
where 𝑥¤ 𝑣𝑖 is the "pixels per second" velocity of the vehicle 𝑣 𝑖 „ approaches to 0◦ or ±180◦ to maintain the natural forward and
and 𝜄 is the pixel-to-mile ratio. Samples of estimated speeds backward movement of the vehicle. Figure 10 illustrates the
(in mph) is shown on top-left corner of the vehicle bounding smoothed values of w by green colour spectrum. The darker
boxes in Figure 1, bottom row. green, the lower the coefficient.
In case of missing observations due to e.g. partial Finally, we rectify the vehicle-angle as follows:
occlusion, we predict the current location of vehicles using
the process step of the Kalman filter (Ă .x̆) and buffer- 𝜃˜𝑣𝑡 𝑖 = 𝜃 𝑣𝑡−1 + ( w × Δ𝜃 𝑣𝑖 ) (24)
𝑖
ing the predicted locations up to an arbitrary number of frames.
In some cases the moving trajectory may not be available;
for instance, when a vehicle appear on the road-scene for the
3.4.2 Angle Estimation first time or some vehicles are stationary (or parked) during
The heading angle of a vehicle can be calculated as follows: their entire presence in the scene. For such cases the heading
𝑦ˇ 𝑡𝑣𝑖 − 𝑦ˇ 𝑡−1 direction of the vehicle cannot be directly estimated as no
) = tan−1 (
𝑣𝑖
𝜃 𝑣𝑖 = 𝜃 ( 𝑙ˇ𝑣𝑡 𝑖 , 𝑙ˇ𝑣𝑡−1 ) (21) prior data is available about the vehicle movement history.
𝑖
𝑥ˇ 𝑣𝑡 𝑖 − 𝑥ˇ 𝑣𝑡−1
𝑖 However, we can still calculate the angle of the vehicles by
The angle estimation is very sensitive to the displacement calculating a perpendicular line from the vehicle position to
of vehicle locations, and even a small noise in localisation can the closest boundary of the road. Identifying the border of the
lead to a significant change in the heading angle. However, in road requires a further road segmentation operation.
the real world the heading angle of vehicles would not change Some of the existing deep-learning based studies such
significantly in a very short period of time (e.g. between two as [54] mainly focus on segmenting satellite imagery which
consequent frames). are captured from a very high altitude and heights comparing
We introduce a simple yet efficient Angle Bounce Filtering to the height of CCTV surveillance cameras.
(ABF) method to restrict sudden erroneous angle changes Moreover, there are no annotated data available for such
between the current and previous angle of the vehicle: heights to train a deep-learning based road segmentation model.
In order to cope with that limitation, we adapt a Seeded Region
Δ𝜃 𝑣𝑖 = 𝜃 𝑣𝑡 𝑖 − 𝜃 𝑣𝑡−1
𝑖
(22) Growing method (SRG) [55] on intensity values of the image
Ǐ.
where Δ𝜃 𝑣𝑖 is in the range of [−180◦ , 180◦ ]. In order to We consider the moving trajectory of vehicles traversing
suppress high rates of the changes, we consider a cosine the road in Ǐ domain (Λ(𝑀𝑣𝑖 , G) ∀𝑣 𝑖 ∈ 𝑉), as initial seeds for
weight coefficient (w) as follows: the SRG algorithm. In the first step, the algorithm calculates
the intensity difference between each seed and its adjacent
cos((4𝜋 × Δ̃) + 1)
w= (23) pixels. Next, the pixels with an intensity distance less than a
2
threshold 𝜏𝛼 , are considered as connected regions to the seeds.
where Δ̃ is the normalised value of Δ𝜃 𝑣𝑖 within the range of Utilising these pixels as new seeds, the algorithm repeats the
[0, 1]. The coefficient yields to "0" when the Δ𝜃 𝑣𝑖 approaches above steps until no more connected regions are found. At
to ±90◦ to neutralise the sudden angle changes of the vehi- the end of the iterations, the connected regions represent the
cle. Similarly, the coefficient yields to "1" when the Δ𝜃 𝑣𝑖 segment of the road.
10/21
Figure 11. Road segmentation on the satellite image. The red Figure 12. Reference angle estimation process with respect to
lines represent the initial segmentation result extracted from nearest road boundary. The boundaries are denoted with
the SRG method, and the green region is the final green lines, which are extracted by application of the Canny
segmentation output after applying the morphological edge detector on the road segment borders.
operations.
11/21
them is dedicated to traffic monitoring purposes.
We considered the MIO-TCD dataset [40] which consists
of 648, 959 images and 11 traffic-related annotated categories
(including cars, pedestrian, bicycle, bus, three types of trucks,
two types of vans, motorised vehicles, and non-motorised
vehicles) to train and evaluate our models. The dataset has
been collected at different times of the day and different
seasons of the year by thousands of traffic cameras deployed
all over Canada and the United States.
As per Table 1, we also considered two more traffic moni-
toring video-footage including UA-DETRAC [49] and GRAM
Road-Traffic Monitoring (GRAM-RTM) [48] to test our mod-
(a) Detected objects in 2D bounding boxes
els under different weather and day/night lighting conditions.
Moreover, we set up our own surveillance camera at one of the
highly interactive intersections of Leeds City, near the Parkin-
son Building at the University of Leeds, to further evaluate the
performance of our model on a real-world scenario consisting
of 940, 000 video frames from the live traffic.
As mentioned in the Methodology section (3.1), we adopted
transfer learning to train different architectures of YOLOv5
model on the MIO-TCD dataset. We exploited pre-trained
(b) 2D to 3D bounding box conversion weights of 80 class COCO dataset as initial weights of our
fine-tuning process.
There are four versions of YOLOv5 which are distinguished
by the number of learning parameters. The “small” with 7.5
million parameters is a lightweight version, “medium” version
(21.8 million), “large” (47.8 million), and “xlarg” version
which has 89 million learnable parameters. We performed
experiments with different number of head modules which
consist of three or four head outputs to classify different sizes
of objects (as described in section 3.1).
In the training phase (Figure 14a), we minimised the loss
function of the adapted YOLOv5, based on a sum of three
loss terms including the "C-IoU loss" as the bounding box
regression loss, "objectness confidence loss", and "binary
(c) Final 3D representation cross entropy" as the classification loss.
In order to choose optimal learning-rate and avoid long train-
Figure 13. 2D to 3D bounding box conversion process in four
ing time, we used one-cycle-learning-rate [56]. This gradually
categories of vehicle/truck, pedestrian, bus, and cyclist.
increases the learning rate to a certain value (called warm-up
phase) followed by a decreasing trend to find the minimum
Laidlaw Library surveillance camera footage. loss, while avoiding local minima. In our experiments, we
found the minimum and maximum learning rates of 0.01 and
0.2 as the optimum values.
4 Experiments Figure 14 illustrates the analytic graphs of the training and
In this section, we evaluate the performance and accuracy of the validation processes. As per the classification graphs ( Fig.
proposed 3D road-users detection model followed by assessing 14a), the training loss starts decreasing around epoch 35, while
the efficiency of the proposed environment modelling. the validation loss starts increasing (Fig. 14b). This is a
sign of over-fitting in which the model starts memorising the
4.1 Performance Evaluation dataset instead of learning generalised features. To avoid the
The majority of modern object detectors are trained and effects of over-fitting, we choose the optimal weights which
evaluated on large and common datasets such as Microsoft yield the minimum validation loss.
Common Objects in Context (Ms-COCO) [39]. The COCO Table 2 compares the performance of the proposed YOLOv5-
dataset consists of 886, 284 samples of general annotated based model with 10 other state-of-the-art object detection
objects for 80 categories (include person, animal, appliance, method on the challenging dataset of MIO-TCD. Two metrics
vehicle, accessory etc.) in 123, 287 images. However, none of of mAP and speed (fps) are investigated.
12/21
Table 1. Specifications of the test datasets used in this research, including various weather conditions, resolutions, frame rates,
and video lengths.
(a) Train
Figure 14. Error minimisation graphs of the model in training and validations phases, after 50 epochs.
As can be seen, the adapted YOLOv5-based model has model (YOLOv5-Large, 3 head) on UA-DET and GRAM-
achieved a considerable increase in mean average precision RTM datasets with very high precision rates of 99.8% and
comparing to the former standard YOLOv4 algorithm (84.6% 99.7%, respectively. The GRAM-RTM dataset only provides
versus 80.4% ). The experiments also proved that 3-head ground truth annotations for one lane of the road. So, we
versions of YOLOv5 provides more efficiency in traffic moni- applied a mask to ignore the non-annotated lanes of the road;
toring than the 4-head versions. The lightweight version of otherwise, our proposed model is capable of detecting vehicles
YOLOv5 reaches the highest rate of speed (123 𝑓 𝑝𝑠). While in both lanes.
the model has sacrificed the accuracy by −1.7% in comparison Figure 15, top row, shows the results of the detection
to the highest rate (84.6%). algorithm and 3D localisation of the road users. Figure 15, the
The YOLOv5 xLarge and Large, with 3 heads reach the bottom row, shows the environment modelling of the scene
highest accuracy of 84.6% on the MIO-TCD benchmark as a digital twin of the scene and live traffic information.
dataset. Although the xLarge model has more parameters to Such live information (which can be stored in cloud servers),
learn features, the network complexity is greater than what would be be very useful for city councils, police, governmental
is required to learn the features in the dataset. This prevents authorities, traffic policy makers, and even as extra source of
the accuracy to go beyond 84.6%. Also, it suffers from the processed data for automated vehicles (AVs) which traverse
lack of adequate speed to perform in real-time performance. around the same zone. Such rich digital twins of the road
Whereas the 3-head YOLOv5 Large, has the same 𝑚 𝐴𝑃 condition can significantly along with the ego-vehicles sensory
score, and provides a real-time performance of 36.5 𝑓 𝑝𝑠. This data can enhance the AVs’ capability in better dealing with
makes the model more suitable for the cases in which heavy the corner cases and complicated traffic scenarios.
post-processing procedures are involved. In Figure 15 we are also trying to show the efficiency of the
Table 3 shows the test results of our pioneer detection heading angle estimation and the tracking system in case of
13/21
Table 2. A comparison of mean average precision (mAP) rate between the developed models and 10 other models on MIO-TCD
dataset. The accuracy scores of 3 truck categories (Articulate Truck, Pickup Truck and Single Unit Truck) is averaged and
presented in a one column- "Trucks × 3".
Table 3. Detection performance of our YOLOv5 Large (3 their speed (𝜗𝑣𝑖 ) is more than the speed limit of the road
head) model on two auxiliary traffic-related datasets. (i.e. 30 mph for Leeds city centre, UK).
• Collision Risk: a set D consists of pedestrians whose
Datasets Precision Recall
distances from vehicles are less than a meter, and the
UA-DET [49] 99.8% 99.7% vehicles are not in the parking status P.
GRAM-RTM [48] 99.7% 99.5%
To analyse the traffic condition, we buffer the count of
tracked vehicles and pedestrians locations during a period of
full occlusions. As can be seen, one of the cars in the scene is time (e.g. 6,000 frames) as shown by line graph in Figure 16a.
taking a U-turn and we have properly identified the heading In order to visualise a long-term spatio-temporal statistical
angle of the car at frame 82100 (indicated with blur arrow). analysis of traffic flow and interactions between road users,
This can be compared with its previous angle and position in a heat map representation is created similar to our previous
frame 82000. Considering the position and the heading angle work in another context for social distancing monitoring [60].
𝑡 ˇ ℎˇ in the
of the vehicle at frames 82000 and 82100, the 3D bounding The heat map is defined by the matrix Ȟ ∈ R𝑤×
box of the vehicle is also determined. satellite domain, where 𝑡 is the frame-ID number. The
As another complicated example in the same scene, one matrix is initially filled with zero to save the last location
of the cars is fully occluded by a passing bus at frame 82100 of objects using the input image sequences. The heat map
(indicated with a red arrow). However the car has been fully updates in each frame by the function 𝐺 (object) ( Ȟ) which
traced by utilisation of the spatio-temporal information and applies a 3 × 3 Gaussian matrix centred at the object’s
tracking data at frame 82000 and beyond. ˇ 𝑦ˇ ) on the Ȟ matrix. Finally, we normalise the
location (𝑥,
heat map intensity values between 0 and 255, in order
4.2 Environment Modelling and Traffic Analysis to visualise it as a colour-coded heat image. Then a
In order to take the most of the detection and tracking algo- colour spectrum will be mapped to the stored values in
rithms and to provide smart traffic monitoring analysis, we the Ȟ matrix in which the red-spectrum represents the
defined three possible states for vehicles and pedestrians as higher values, and the blue-spectrum represents the low values.
follows:
The heat map of the detected pedestrians is shown by Ȟ ( 𝑝) ,
• Parking: a set P contains all of the vehicles which have which updates over time as follows:
less than one-meter distance in Ǐ domain from the road
𝑡 𝑡−1
border (𝑙 𝑣r 𝑖 ), and their temporal speeds (𝜗𝑣𝑖 ) have been Ȟ ( 𝑝) = 𝐺 ( 𝑝𝑖 ) ( Ȟ ( 𝑝) ) ∀𝑝 𝑖 ∈ 𝑃 (25)
close to zero for more than 1 minute.
Figure 16b illustrates the developed heat map Ȟ ( 𝑝) on the
• Speeding Violation: a set S consists of vehicles in which satellite image. The lower range values have been removed for
14/21
(a) Detected cars and bus on frame 82000 (b) Detected cars and bus on frame 82100
(c) Environment mapping result of frame 82000 (d) Environment mapping result of frame 82100
Figure 15. The outputs of adapted 3-head YOLOv5-large algorithm for road-user detection and environment modelling.
better visualisation. This figure provides valuable information Figure 17a and 17b, illustrates an instance of speed heat
about the pedestrians’ activity. For instance, we can see a map calculated over the 10,000 selected frames. As can be
significant number of pedestrians have crossed the dedicated seen the speeding violation significantly decreases near the
zebra-crossing shown by the green rectangle. However, in pedestrian crossing zone, which makes sense. As a very useful
another region of the road (marked by a red rectangle) many application of our developed model, similar investigations can
other pedestrians cross another part of the road where there is be conducted in various parts of city and urban areas, in order
no zebra-crossing. Also, there are a few pedestrians who has to identify less known or hidden hazardous zones where the
crossed the street directly in front of the bus station. vehicles may breach the traffic rules.
Similarly, the heat map for detected vehicles is defined as The graph shown in Figure 17c, represents the average
follows: speed of all vehicles in the scene during the selected period of
𝑡 𝑡−1 the monitoring. In each frame, the average speed is calculated
Ȟ (𝑣) = 𝐺 (𝑣𝑖 ) ( Ȟ (𝑣) ) ∀𝑣 𝑖 ∈ 𝑉 , 𝑣 𝑖 ∉ P (26) by:
Í
where Ȟ (𝑣) stores the location of moving vehicles only (not 𝜗𝑣𝑖
stationary or parked vehicles). This matrix has illustrated 𝜗¯ = ∀𝑣 𝑖 ∈ 𝑉, 𝑣 𝑖 ∉ P (28)
𝑛𝑣
in Figure16c. This heat map represents that more vehicles
are traversing on the left lane of the road comparing to the where 𝑛 𝑣 is the number of vehicles that are not in the Parking
opposite direction, on the right lane. state.
The heat map images can be also mapped to the perspec- In order to identify the congested and crowded spots in the
tive space by: Ĥ = Λ( Ȟ, G−1 ). Figures 16d and 16e are scene, we can monitor the vehicles e.g. with less than 2m
corresponded maps of Figures 16b and 16c, respectively. distances to each other with an average speed of e.g. lower than
We also investigated the speed violation heat map Ȟ ( 𝜗) and 5mph. The shorter vehicles’ proximity over a longer period
the areas in which vehicles violated the speed limit of the road: of time, the larger values will be stored in the congestion
buffer; consequently, a hotter heat map will be generated.
Defining optimum values of distance and speed threshold
𝑡 𝑡−1
Ȟ ( 𝜗) = 𝐺 (𝑣𝑖 ) ( Ȟ ( 𝜗) ) ∀𝑣 𝑖 ∈ S (27) requires and intensive analytical and statistical data collection
15/21
(a) Vehicle and pedestrian counts over 6000 video frames. Source: Parkinson building CCTV surveillance camera.
(b) BEV Pedestrian movements heat map (c) BEV vehicle movements heat map
(d) Pedestrian movements heat map- Perspective view (e) Vehicle movements heat map- Perspective view
Figure 16. Spatio-temporal long-term analysis of vehicles and pedestrians’ activity using Parkinson Building surveillance
camera, Leeds, UK
and assessments based on the road type (e.g. highway or a However, as a general-purpose solution and similar to the
city road) which is out of the scope of this research. previous heat maps, we defined the congestion heat map Ȟ ( C)
16/21
(a) Speed violation heat map- Perspective view
Figure 18. Heat map representation of congested areas based
on 10,000 live video frames from Woodhouse lane, Leeds
LS2 9JT, UK.
(b) BEV speed violation heat map As we can see in Figure 18, there are two regions of congestion,
one before the pedestrian crossing which is probably due to
the red traffic light which stops the vehicles, and also a second
congestion spot at the T-junction (top left side of the scene),
where the vehicles stop and line up before joining the main
road.
Figure 19 shows the pedestrian behaviour’s heat map by
monitoring the pedestrians who are not maintaining a mini-
mum safety distance of 2𝑚 to the passing vehicles. Similarly,
the heat map of the high-risk pedestrians can be updated
according to the following equation:
(c) Average speed of moving vehicles in the scene
𝑡 𝑡−1
Figure 17. Automated speed monitoring and heat map Ȟ ( W) = 𝐺 ( 𝑝𝑖 ) ( Ȟ ( W) ) ∀𝑝 𝑖 ∈ D (30)
analysis based on 10,000 video frames from the Laidlaw
The hot area in front of the bus station is more likely caused
Library surveillance camera, Leeds, UK
by the buses which stop just beside the bus station. The heat
map also shows another very unsafe and risky spot in the same
as follows:
scene where some of the pedestrians have crossed through the
𝑡 𝑡−1 middle of a complex 3-way intersection. This may have been
Ȟ ( C) = 𝐺 (𝑣𝑖 ) ( Ȟ ( C) ) ∀𝑣 𝑖 ∈ A (29)
caused by careless pedestrians who try to reach the bus stop
where A is an ID set of vehicles that are in the congested areas. or leave the bus stop via a high-risk shortcut.
17/21
All experiments and performance evaluations in this re- Acknowledgement
search were conducted on a PC workstation with an Intel
The research has received funding from the European Commis-
©Core™ i5-9400F processor and an NVIDIA RTX 2080 GPU
sion Horizon 2020 program under the L3Pilot project, grant
with CUDA version 11. All services were performed based on
No. 723051 as well as the interACT project from the European
a unified software using parallel processing for simultaneous
Union’s Horizon 2020 research and innovation program, grant
utilisation of all processor’s cores to enhance the execution
agreement No. 723395. Responsibility for the information
performance. Similarly, all image-processing-related calcula-
and views set out in this publication lies entirely with the
tions were performed on GPU tensor units to increase speed
authors.
and efficiency.
The running time of the whole services is 0.05 𝑚𝑠, except
for the speed of the object detector which can slightly vary References
depending on the lighting and complexity of the environment. 1. Nambiar, R., Shroff, R. & Handy, S. Smart cities:
Challenges and opportunities. In 2018 10th Interna-
5 Conclusion tional Conference on Communication Systems Networks
(COMSNETS), 243–250, DOI: 10.1109/COMSNETS.
In this article, we proposed a real-time traffic monitoring
2018.8328204 (2018).
system called Traffic-Net which applies a customised 3-head
YOLOv5 model to detect various categories of vehicles and 2. Sheng, H., Yao, K. & Goel, S. Surveilling surveillance:
pedestrian. A multi-class and multi-object tracker named Estimating the prevalence of surveillance cameras with
MOMCT were also developed for an accurate and continuous street view data. arXiv preprint arXiv:2105.01764 DOI:
classification, identifications, and localisation of the same 10.1007/978-3-642-38622-0_32 (2021).
objects over consequent video frames, as well as prediction of 3. Olatunji, I. E. & Cheng, C.-H. Video analytics for
the next position of vehicles in case of missing information. visual surveillance and applications: An overview
In order to develop a general-purpose solution applicable on and survey. Mach. Learn. Paradigms 475–515, DOI:
the majority of traffic surveillance cameras, we introduced 10.1007/978-3-030-15628-2_15 (2019).
an automatic camera calibration techniques (called SG-IPM)
to estimate real-world positions and distances using a com- 4. Mondal, A., Dutta, A., Dey, N. & Sen, S. Visual traffic
bination of near-perpendicular satellite images and ground surveillance: A concise survey. In Frontiers in Artificial
information. Intelligence and Applications, vol. 323, 32–41, DOI:
Having the real-world position of the vehicles, a constant 10.3233/FAIA200043 (IOS Press, 2020).
acceleration Kalman filter was applied for smooth speed esti- 5. Poddar, M., Giridhar, M., Prabhu, A. S., Umadevi, V.
mation. Using spatio-temporal moving trajectory information, et al. Automated traffic monitoring system using computer
the heading angle of vehicles were also calculated. We also in- vision. In 2016 International Conference on ICT in
troduced the ABF method to remove the angle variation noise Business Industry & Government (ICTBIG), 1–5, DOI:
due to occlusion, sensor limitation, or detection imperfection. 10.1109/ICTBIG.2016.7892717 (IEEE, 2016).
These led to 3D bounding box estimation and traffic heat
6. Hu, W., Tan, T., Wang, L. & Maybank, S. A survey on
map modelling and analysis which can help the researchers
visual surveillance of object motion and behaviors. IEEE
and authorities to automatically analyse the road congestion,
Transactions on Syst. Man Cybern. Part C: Appl. Rev. 34,
high-risk areas, and the pedestrian-vehicle interactions. Ex-
334–352, DOI: 10.1109/TSMCC.2004.829274 (2004).
perimental results on the MIO-TCD dataset and a real-world
road-side camera, confirmed the proposed approach well dom- 7. Yang, W., Fang, B. & Tang, Y. Y. Fast and accurate
inates 10 state-of-the-art research work in ten categories of vanishing point detection and its application in inverse
vehicles and pedestrian detection. Tracking, auto-calibration, perspective mapping of structured road. IEEE Trans-
and automated congestion detection with a high level of actions on Syst. Man, Cybern. Syst. 48, 755–766, DOI:
accuracy (up to 84.6%) and stability over various lighting 10.1109/TSMC.2016.2616490 (2018).
conditions were other outcomes of this research. 8. Oliveira, M., Santos, V. & Sappa, A. D. Multimodal
As a future study and in order to improve the feature match- inverse perspective mapping. Inf. Fusion 24, 108–121,
ing process between the camera and satellite images, a neural DOI: https://doi.org/10.1016/j.inffus.2014.09.003 (2015).
network-based feature matching algorithm can be applied
to increase the accuracy. Also, many other strategies (like 9. Brunetti, A., Buongiorno, D., Trotta, G. F. & Bevilacqua,
evolutionary algorithms, feature engineering, and generative V. Computer vision and deep learning techniques for
models) can be used to provide more robust features, to tackle pedestrian detection and tracking: A survey. Neurocom-
the matching failures. puting 300, 17–33, DOI: 10.1016/j.neucom.2018.01.092
Availability of larger datasets can further help to improve (2018).
the accuracy of heat maps, to identify high-risk road spots and 10. Rezaei, M., Terauchi, M. & Klette, R. Robust vehicle de-
further statistical analyses. tection and distance estimation under challenging lighting
18/21
conditions. IEEE Transactions on Intell. Transp. Syst. 16, 24. Zhang, Z., Zheng, J., Xu, H. & Wang, X. Vehicle
2723–2743, DOI: 10.1109/TITS.2015.2421482 (2015). Detection and Tracking in Complex Traffic Circumstances
11. Gawande, U., Hajari, K. & Golhar, Y. Pedestrian detec- with Roadside LiDAR. Transp. Res. Rec. 2673, 62–71,
tion and tracking in video surveillance system: Issues, DOI: 10.1177/0361198119844457 (2019).
comprehensive review, and challenges. Recent Trends 25. Zhang, J., Xiao, W., Coifman, B. & Mills, J. P. Vehicle
Comput. Intell. DOI: 10.5772/intechopen.90810 (2020). Tracking and Speed Estimation From Roadside Lidar.
12. Cheung, S.-c. S. & Kamath, C. Robust techniques for IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 13, 5597–
background subtraction in urban traffic video. In Pan- 5608, DOI: 10.1109/JSTARS.2020.3024921 (2020).
chanathan, S. & Vasudev, B. (eds.) Visual Communica- 26. Song, Y., Yao, J., Ju, Y., Jiang, Y. & Du, K. Automatic de-
tions and Image Processing 2004, vol. 5308, 881–892, tection and classification of road, car, and pedestrian using
DOI: 10.1117/12.526886. International Society for Optics binocular cameras in traffic scenes with a common frame-
and Photonics (SPIE, 2004). work. Complexity 2020, DOI: 10.1155/2020/2435793
13. Zhou, J., Gao, D. & Zhang, D. Moving vehicle detection (2020).
for automatic traffic monitoring. IEEE Transactions on 27. Alldieck, T., Bahnsen, C. H. & Moeslund, T. B. Context-
Veh. Technol. 56, 51–59, DOI: 10.1109/TVT.2006.883735 aware fusion of rgb and thermal imagery for traffic moni-
(2007). toring. Sensors 16, DOI: 10.3390/s16111947 (2016).
14. Chintalacheruvu, N., Muthukumar, V. et al. Video based 28. Fernandes, D. et al. Point-cloud based 3d object detection
vehicle detection and its application in intelligent trans- and classification methods for self-driving applications:
portation systems. J. transportation technologies 2, 305, A survey and taxonomy. Inf. Fusion 68, 161–191, DOI:
DOI: 10.4236/jtts.2012.24033 (2012). 10.1016/j.inffus.2020.11.002 (2021).
15. Cheon, M., Lee, W., Yoon, C. & Park, M. Vision- 29. Zhou, T., Fan, D.-P., Cheng, M.-M., Shen, J. & Shao, L.
based vehicle detection system with consideration of the Rgb-d salient object detection: A survey. Comput. Vis.
detecting location. IEEE Transactions on Intell. Transp. Media 1–33, DOI: 10.1007/s41095-020-0199-z (2021).
Syst. 13, 1243–1252, DOI: 10.1109/TITS.2012.2188630
(2012). 30. Laga, H. A survey on deep learning architectures
for image-based depth reconstruction. arXiv preprint
16. Zou, Z., Shi, Z., Guo, Y. & Ye, J. Object detection in 20
arXiv:1906.06113 (2019).
years: A survey. arXiv preprint arXiv:1905.05055 (2019).
31. Xie, J., Girshick, R. & Farhadi, A. Deep3d: Fully auto-
17. Jiao, L. et al. A survey of deep learning-based object
matic 2d-to-3d video conversion with deep convolutional
detection. IEEE Access 7, 128837–128868, DOI: 10.
neural networks. In European Conference on Computer
1109/ACCESS.2019.2939201 (2019).
Vision, 842–857, DOI: 10.1007/978-3-319-46493-0_51
18. Liu, W. et al. Ssd: Single shot multibox detector. In (Springer, 2016).
European conference on computer vision, 21–37, DOI:
10.1007/978-3-319-46448-0_2 (Springer, 2016). 32. Bhoi, A. Monocular depth estimation: A survey. arXiv
preprint arXiv:1901.09402 (2019).
19. Arinaldi, A., Pradana, J. A. & Gurusinga, A. A. Detection
and classification of vehicles for traffic video analytics. 33. Rezaei, M. & Klette, R. Computer vision for driver
Procedia Comput. Sci. 144, 259–268, DOI: https://doi.org/ assistance. Cham: Springer Int. Publ. 45, DOI: https:
10.1016/j.procs.2018.10.527 (2018). INNS Conference //doi.org/10.1007/978-3-319-50551-0 (2017).
on Big Data and Deep Learning. 34. Dubská, M., Herout, A., Juránek, R. & Sochor, J. Fully
20. Peppa, M. V. et al. Towards an end-to-end framework of automatic roadside camera calibration for traffic surveil-
cctv-based urban traffic volume detection and prediction. lance. IEEE Transactions on Intell. Transp. Syst. 16,
Sensors 21, DOI: 10.3390/s21020629 (2021). 1162–1171, DOI: 10.1109/TITS.2014.2352854 (2015).
21. Bui, K.-H. N., Yi, H. & Cho, J. A multi-class multi- 35. Sochor, J., Juránek, R. & Herout, A. Traffic surveillance
movement vehicle counting framework for traffic analysis camera calibration by 3D model bounding box alignment
in complex areas using cctv systems. Energies 13, DOI: for accurate vehicle speed measurement. Comput. Vis.
10.3390/en13082036 (2020). Image Underst. 161, 87–98, DOI: https://doi.org/10.1016/
j.cviu.2017.05.015 (2017).
22. Mandal, V., Mussah, A. R., Jin, P. & Adu-Gyamfi, Y. Arti-
ficial intelligence-enabled traffic monitoring system. Sus- 36. Song, H. et al. 3d vehicle model-based ptz camera auto-
tainability 12, 9177, DOI: 10.3390/su12219177 (2020). calibration for smart global village. Sustain. Cities Soc. 46,
23. Arnold, E. et al. A survey on 3d object detection methods 101401, DOI: https://doi.org/10.1016/j.scs.2018.12.029
for autonomous driving applications. IEEE Transactions (2019).
on Intell. Transp. Syst. 20, 3782–3795, DOI: 10.1109/tits. 37. Kim, Z. Camera calibration from orthogonally projected
2019.2892405 (2019). coordinates with noisy-ransac. In 2009 Workshop on
19/21
Applications of Computer Vision (WACV), 1–7, DOI: 50. Niu, H., Lu, Q. & Wang, C. Color correction based on
10.1109/WACV.2009.5403107 (2009). histogram matching and polynomial regression for image
stitching. In 2018 IEEE 3rd International Conference on
38. Jocher, G. et al. ultralytics/yolov5: v5.0 - YOLOv5-P6
Image, Vision and Computing (ICIVC), 257–261, DOI:
1280 models, AWS, Supervise.ly and YouTube integra-
10.1109/ICIVC.2018.8492895 (2018).
tions, DOI: 10.5281/zenodo.4679653 (2021).
51. Yu, G. & Morel, J.-M. Asift: An algorithm for fully affine
39. Lin, T.-Y. et al. Microsoft coco: Common objects in invariant comparison. Image Process. On Line 1, 11–38,
context (2015). 1405.0312. DOI: 10.5201/ipol.2011.my-asift (2011).
40. Luo, Z. et al. Mio-tcd: A new benchmark dataset for 52. Lowe, D. Object recognition from local scale-invariant
vehicle classification and localization. IEEE Transactions features. In Proceedings of the Seventh IEEE International
on Image Process. 27, 5129–5141, DOI: 10.1109/TIP. Conference on Computer Vision, vol. 2, 1150–1157 vol.2,
2018.2848705 (2018). DOI: 10.1109/ICCV.1999.790410 (1999).
41. Wang, C.-Y. et al. Cspnet: A new backbone that can 53. Fischler, M. A. & Bolles, R. C. Random sample consensus:
enhance learning capability of cnn. In Proceedings of the A paradigm for model fitting with applications to image
IEEE/CVF conference on computer vision and pattern analysis and automated cartography. Commun. ACM 24,
recognition workshops, 390–391 (2020). 381–395, DOI: 10.1145/358669.358692 (1981).
42. Huang, Z. et al. Dc-spp-yolo: Dense connection and 54. Wu, M., Zhang, C., Liu, J., Zhou, L. & Li, X. To-
spatial pyramid pooling based yolo for object detection. wards accurate high resolution satellite image seman-
Inf. Sci. 522, 241–258, DOI: https://doi.org/10.1016/j.ins. tic segmentation. IEEE Access 7, 55609–55619, DOI:
2020.02.067 (2020). 10.1109/access.2019.2913442 (2019).
43. Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation 55. Adams, R. & Bischof, L. Seeded region growing. IEEE
network for instance segmentation. In 2018 IEEE/CVF Transactions on pattern analysis machine intelligence 16,
Conference on Computer Vision and Pattern Recognition, 641–647, DOI: 10.1109/34.295913 (1994).
8759–8768, DOI: 10.1109/CVPR.2018.00913 (2018). 56. Smith, L. N. A disciplined approach to neural net-
44. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. work hyper-parameters: Part 1–learning rate, batch
Focal loss for dense object detection. In 2017 IEEE size, momentum, and weight decay. arXiv preprint
International Conference on Computer Vision (ICCV), arXiv:1803.09820 (2018).
2999–3007, DOI: 10.1109/ICCV.2017.324 (2017). 57. Jung, H. et al. Resnet-based vehicle classification and
localization in traffic surveillance systems. In Proceedings
45. Zheng, Z. et al. Distance-iou loss: Faster and better
of the IEEE Conference on Computer Vision and Pattern
learning for bounding box regression. In Proceedings of
Recognition (CVPR) Workshops, DOI: 10.1109/CVPRW.
the AAAI Conference on Artificial Intelligence, vol. 34,
2017.129 (2017).
12993–13000, DOI: https://doi.org/10.1609/aaai.v34i07.
6999 (2020). 58. Wang, T., He, X., Su, S. & Guan, Y. Efficient scene
layout aware object detection for traffic surveillance. In
46. Wojke, N., Bewley, A. & Paulus, D. Simple online and Proceedings of the IEEE Conference on Computer Vision
realtime tracking with a deep association metric. In 2017 and Pattern Recognition (CVPR) Workshops, DOI: 10.
IEEE International Conference on Image Processing 1109/CVPRW.2017.128 (2017).
(ICIP), 3645–3649, DOI: 10.1109/ICIP.2017.8296962
(2017). 59. Hedeya, M. A., Eid, A. H. & Abdel-Kader, R. F. A
super-learner ensemble of deep networks for vehicle-
47. Bewley, A., Ge, Z., Ott, L., Ramos, F. & Upcroft, B. type classification. IEEE Access 8, 98266–98280, DOI:
Simple online and realtime tracking. 2016 IEEE Int. 10.1109/ACCESS.2020.2997286 (2020).
Conf. on Image Process. (ICIP) DOI: 10.1109/icip.2016.
60. Rezaei, M. & Azarmi, M. Deepsocial: Social distancing
7533003 (2016).
monitoring and infection risk assessment in covid-19
48. Guerrero-Gomez-Olmedo, R., Lopez-Sastre, R. J., pandemic. Appl. Sci. 10, 7514, DOI: https://doi.org/10.
Maldonado-Bascon, S. & Fernandez-Caballero, A. Ve- 3390/app10217514 (2020).
hicle tracking by simultaneous detection and viewpoint
estimation. In IWINAC 2013, Part II, LNCS 7931, 306–
316, DOI: 10.1007/978-3-642-38622-0_32 (2013).
49. Wen, L. et al. UA-DETRAC: A new benchmark and
protocol for multi-object detection and tracking. Comput.
Vis. Image Underst. DOI: 10.1016/j.cviu.2020.102907
(2020).
20/21
Appendix 1: Camera Calibration and Inverse Per-
spective Mapping
Knowing the camera intrinsic and extrinsic parameters, the
actual position of the 3D objects from 2D perspective image
can be estimated using Inverse Perspective Mapping (IPM) as
follows:
21/21