Train Rolling Stock Video Segmentation and Classif
Train Rolling Stock Video Segmentation and Classif
Train Rolling Stock Video Segmentation and Classif
Journal of Engineering
Journal of Engineering and Applied Science (2022) 69:69
https://doi.org/10.1186/s44147-022-00128-x and Applied Science
*Correspondence:
[email protected]; Abstract
[email protected] Train rolling stock examination (TRSE) is a physical procedure for inspecting the bogie
1
Department of Electronics parts during transit at a little over 30 kmph. Currently, this process is manually per-
and Communications formed across many railway networks across the world. This work proposes to auto-
Engineering, K.L. University,
Green Fields, Vaddeswaram, mate the process of TRSE using artificial intelligence techniques. The previous works
Guntur DT, Andhra Pradesh 522 have proposed active contour-based models for the segmentation of bogie parts.
502, India Though accurate, the models require manual intervention and are found to be iterative
2
Image Speech and Signal
Processing Research Group, making them unsuitable for real-time operations. In this work, we propose a segmen-
Department of Electronics tation model followed by a deep learning classifier that can accurately increase the
and Communications deployability of such systems in real time. We apply the UNet model for the segmenta-
Engineering, Biomechanics
and Vision Computing Research tion of bogie parts which are further classified using an attention-based convolutional
Center, K.L. University, Green neural network (CNN) classifier. In this work, we propose a shape deformable attention
Fields, Vaddeswaram, Guntur DT, model to identify shape variations occurring in the video sequence due to viewpoint
Andhra Pradesh 522 502, India
changes during the train movement. The TRSNet is trained and tested on the high-
speed train bogie videos captured across four different trains. The results of the experi-
mentation have been shown to improve the recognition accuracy of the proposed
system by 6% over the state-of-the-art classifiers previously developed for TRSE.
Keywords: Deep learning, Train rolling stock, Automation, Convolutional neural
networks, UNet
Introduction
The mass public transport in the Indian subcontinent is commuted using railways.
Indian Railways is the largest network covering 68 km with 13 K passenger trains run-
ning across this rain network. Indian Railways serve around 23 million passengers every
day. Moreover, more than 50% of these trains run for more than 500 km at a time. Con-
sequently, operational research shows that the health of the train is directly propor-
tional to the safety of the train which in turn affects the ride quality of the passengers. To
ensure enhanced safety of the 100-crore train and its passengers, the most widely prac-
ticed inspection mechanism during the train movement is called Train Rolling Stock
Examination (TRSE).
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate-
rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 2 of 24
deep learning models apply annotations on all the bogie video frames to supervise the
recognition process with good accuracies. Moreover, these methods do not detect the
structure or shape similar to the bogie parts for maintaining structural durability dur-
ing transit. Consequently, the segmentation results at multiple stages were applied as
attention to the classifier for achieving higher resolution rates. The attention mechanism
followed is inception from transformers for speech recognition applications which has
been incepted into our proposed model.
Specifically, this part of the section presents the past research on technological devel-
opments on the road to automate TRSE. The entire section has three main ingredients.
The first one is related to methods developed in general towards the solution of rolling
stock examination. Secondly, the computer vision-based models were applied to video
data for the bogie shape segmentation process. Thirdly, the advancement of deep learn-
ing approaches and their applications in object detection.
5. Flat wheels
6. Missing bogie parts
The above parameters are all visually observable and are biologically compared with
the training models for evaluation. This results in documentary evidence that provides
an insight into the behavior of the bogie parts during transit. The objective of this work
is to transform the above visually identifiable problems into computer vision-based
models for automated TRSE. Currently, most rail network companies operate manually
due to unavailability of technology or research resources for finding a commercially via-
ble solution. However, video-based bogie part retrieval models have been developed in
the past with considerable research impact in the field of computer vision.
contrast video frames of rolling stock [28]. Actually, this work has been the basis for
automating TRSE. However, this work does not highlight anything about the algorithms
for bogie part identification. Another work that has drawn parallels with the above has
demonstrated the use of focus lights on the undercarriage to video capture the bogies
[29]. Additionally, this work applies basic image processing models to extract the edges
of bogie parts in order to identify them. However, the techniques described were not
able to represent the overlapping boundaries of bogie parts in the video frames. Moreo-
ver, the blurring induced by the moving train has made the edge detection process dif-
ficult for part identification. Recently, 3D modelling of contact bogie parts and wheel
surfaces has been shown to achieve good results for the detection of defects [30, 31].
The biggest problem with 3D modelled image data is their powerful graphics process-
ing requirements. The powerful graphics make these techniques incompatible with real-
time processing.
The two biggest drawbacks of the above models were their inability to segment bogie
parts effectively, and the video data was noisy due to recording of train movement at
30 frames per second shutter speed. These two bottlenecks were efficiently handled by
our previous models for TRSE [1]. To fight blurring, the recording is done by using a
high-speed wide-angle sports action visual sensor at 240 fps, the effectively exceptionally
high-quality bogie frames. Secondly, the segmentation problems were addressed using
active contour (AC) models with shape prior knowledge of the bogie parts [2, 3]. These
shape-based active contours with local information [5] have presented a 99% accuracy
in preserving the extracted bogie part shapes from the output of the models. Moreo-
ver, the work in [4] shows an upgraded touching Boundary segmentation algorithm
for collectively extracting bogie parts from the video frames. This model has generated
interest due to the fact that the bogie parts are indeed overlapping as they support each
other to tightly hold the entire structure as a single unit. The above AC-based models
have performed well in segmenting the bogie parts effectively. Despite their success in
bogie part segmentation, the AC models are iterative and are not suitable for real-time
implementation of TRSE. Apart from the above, the TRSE automation algorithms lack
adaptability, scalability, and reliability to transform the results into real-time production
models. Consequently, these gaps in current research methodologies have motivated us
to perceive the real-time implementable models for automating TRSE. Hence, the deep
learning approaches were leveraged to build and deploy automated TRSE systems for
generating actionable intelligence for assisting rail companies.
of Yolo is modified for the extraction of bogie parts on the video sequences. The model
was able to detect most of the bogie objects except for the places where the part defor-
mation is more than 50% of the actual trained part. The biggest challenge in implemen-
tation is attributed to the annotations of bogie parts from video sequences along with
the bounding box information on which the Yolo model is trained. Though the model
has recorded an 84.98% accuracy in correctly identifying the bogie part on a moving
train video sequence, it failed to identify bogie parts with high confidence scores for the
slightest deformation in the objects occurring due to viewpoint variations. Moreover,
to compensate for the object deformations, the model has been trained on a large set of
frames in the video sequence. Hence, it becomes extremely important to learn the object
deformations for the segmentation process. In deep learning, the segmentation process
has been applied through an architecture broadly called as hourglass model [39]. Then,
the upgrades with some minor modifications have reported betterment in segmenta-
tion results, though their basic structure matches the hourglass model. The most popu-
lar and powerful variants of the hourglass are UNet [40], VNet [41], SegNet [42], and
Auto Encoders [43]. The backbone network architectures in these segmentation mod-
ules can be any of the state-of-the-art network architectures such as VGG-16, Resnet-34,
and Inception Net. Once the segmentation processes are learned by the network using a
very small dataset of bogie parts, the next step is classification. Generally, instances have
shown that the segmented output is inputted along with the original video frame into the
classifier for recognition. The RGB input is multiplied with the segmented bogie parts
and passed to the classification module designed using the standard networks similar
to that of the backbone segmentation network [44, 45]. Unfortunately, doing the mul-
tiplicative attention will instigate the user to segment all the bogie parts in all the video
frames for maximally correct classification. Instead of performing the traditional mul-
tiplicative fusion between the RGB video frames and the segmented objects, this work
offers a solution incepted from the model of natural language processing called multi-
head attention [46]. Similar methods were proposed in the automation of construction
durability testing such as identifying payment cracks using capsule net segmentation
[47] and PCGANs [48].
Finally, the proposed model brings a novel methodology for real-time implementable
automated TRSE powered by computer vision, artificial intelligence, and video analytics.
The next section focuses on developing a detailed elaboration of the methods applied for
automating TRSE with deep learning.
Methods
The highly accurate and accepted attention model in speech processing is the multi-head
attention model. The multi head attention model is capable of providing attention to a
particular set of words during training. Similarly, the moving train induces motion arti-
facts such as bogie part shape deformation due to viewpoint variations on the fixed cam-
era positioning. Moreover, the camera outputs 240 fps video sequences with a lensed
angle of 54°. The method proposed in this work has a segmentation network followed by
multi head attention-based bogie parts classifier. The entire model is called deep bogie
part inspector (DBPI) which has a segmentation module in the back end and an atten-
tion-based classifier model in the backend. The primary network in the DBPI model is
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 7 of 24
Fig. 2 Deep bogie part inspector architecture with two models: the first model is UNet used for
segmentation of bogie parts, and the second is the multi-head attention-based classification network
the pointer to the key frames. The key frames are important as the video is captured
with 240 fps shutter speed; the number of frames within a second is equal to 240. The
change across 240 frames is less than noticeable by the artificial visual sensor, and hence,
key frames are extracted. Since all the bogie video frames have similar pixel densities
the feature-based key frames extraction models using histogram of oriented gradients
(HOG) features with K-means clustering had little impact on the outcome. However, the
entropy-based method [49] has shown a good deal of variation in pixels across the video
frames. The frame entropy is computed as follows:
E(f ) = − pf (j) × log pf (j)
(1)
j
connections that apprehend the cropped feature maps from the corresponding compres-
sor blocks after the up convolutions in the expander blocks at the same level. The crop-
ping of the encoder or compressor block features is necessary to ensure that uniform
dimensionality for concatenation with the expander feature maps. These concatenated
feature maps are further learned using two 3 × 3 convolutional layers followed by a non-
linear ReLu. Finally, a 1 × 1 convolutional layer maps each of the 16-component feature
vectors into the required classes. Our B-UNet has only 16 components against the 64 in
the original UNet architecture. This is because the segmentation stroke of the bogie part
in the entire video frame is small compared to the entire spatial resolution of the frame
itself. After multiple experiments, the 16-channel filter is perfect for bogie part segmen-
tation and is computationally faster than the traditional UNet model.
The bogie segmentation has a large background region when compared to the spa-
tial occupancy of the part in the video frame. This has resulted in a dominating loss
with respect to the background during the training process falling into the local mini-
mum frequently. Hence, we propose to apply the solution in [62], which addresses the
foreground-background pixel imbalance in the rolling stock video frames. We applied
the two traditional loss functions during the training process. They are a type of binary
cross-entropy (BEC) called focal loss (FL) and dice loss (DL). The BEC is given by
np
�0
−
α(1 − α)γ log(p), if GT = 1
i=1
FL(GT , p) = np
�1 (2)
− α(1 − α)γ pγ log (1 − p), otherwise
i=1
where GT is the ground truth in the pixel range {0, 1}and p ∈ [0, 1] is the probabilities of
foreground and background predicted by the model. {np0, np1} are classes that represent
background class with 0th values and foreground with 1 value. The values of α ∈ (0, 1]
and γ ∈ [0, 5] are adjustable hyperparameters. For B-UNet, we selected α = 0.5 and γ = 1
across all datasets.
The second loss used was dice loss (DL) which is a regular in segmentation problems
using deep learning models. Dice loss solves the problem of imbalance between fore-
ground and background pixels using the segmentation evaluation index between the
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 10 of 24
predicted segmentation mask and ground truth annotated masks. The DL is formulated
as follows:
np
2 pi GTi + δ
i=1
DL(p, GT ) = 1 − np np (3)
pi2 + GT2 i + δ
i=1 i=1
FL(p, GT )
SL = + DL(p, GT ) (4)
np
The B-UNet segmentation network is trained on K key frames to extract bp bogie parts
from B ∈ [1, b] bogies and P ∈ [1, p] parts. Testing is initiated on the sequences of bogie
parts that were not previously seen by the B-UNet of different trains. The obtained parts
are now applied as inputs to the classifier to identify the bogie part correctly and provide
the necessary analysis.
B‑UNet implementation
The original frame size from the high-speed camera sensor was 1280 × 1918 at 240 fps.
The sensor records 240 frames per second, and in a 1-min video, we have around 240 ×
60 = 14,400 frames per minute. Our dataset consists of passenger trains from the Indian
subcontinent which are having an average of 20 coaches per train. The camera sensor’s
average recording of a train happened for around 1.05 to 1.42 min. All the above values
are computed based on the video contents in our dataset. The average number of frames
in each training class was found to be around 15,456 frames per train. Using the entropy-
based formulation, the key frame extractor will assemble only frames with maximally
occupied bogie parts. The number of frames per bogie is around 0.2% of the total frames,
which is 30 frames/bogie. There will be two bogies per coach per side, and for a 20-coach
train, there will be 40 bogies. Finally, the training set for bogie part segmentation con-
sists of 30 × 40 = 1200 video frames. From these 1200 training bogie video frames, we
train only for 8 bogies with 18 parts. This is because the bogie parts are fairly constant
over the entire train; it is unnecessary to use all bogie frames for training. A number
8 also guarantees good data augmentation for training apart from others such as rota-
tion, scaling, zooming, and flipping horizontally and vertically in our model. Finally, the
training set has 320 frames in 100 different augmentations per frame. The total dataset
for B–UNet will have 32 K video frames and 32 K ground truth labels with around 1778
parts per label. The filter kernels are initialized using the zero mean Gaussian centered
around unit variance. A batch normalization layer is added after each convolution layer
to induce stability of the process. The hyperparameters in the loss function are selected
as discussed in the previous section for all the bogie videos through experimentation.
The optimizer is Adam with a learning rate of 0.00001 and a momentum factor of 0.02.
There is no decay in the learning rate as the error reaches a minimum value. All these
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 11 of 24
methods are unchanged across all datasets and on other models used for comparisons.
All the models were implemented on NVIDIA GTX1070i attached to 16GB memory.
The epochs hyperparameter is set to 100 for all models.
The testing is performed on the full bogie video sequence without key frame extrac-
tion for segmenting the bogie parts. The segmented bogie parts are now arranged in
chronological triplet order of current fc, previous frame fc − 1, and next frame fc + 1 for
each bogie part. These three groups of segmented video frames form the input to the
classifier which is built on the multi head attention model.
Meanwhile, the output h(fM, fB) is multiplied with the features in support spaces Φ(fB)
into an intermediate space g(fM, fB)formulated as follows:
g fM , fB = h fM , fB × w fB (6)
The wj(fB) are the features of the bogie parts at jth position in the network. This ensures
that the features that are relevant to the query image are retained and that which are
irrelevant are discarded.
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 12 of 24
Fig. 4 The proposed cascaded feature matching module with multi head self-attentions for accurate
tracking of bogie part position and identification
Finally, the output of matching network g(fM, fB) is reshaped to that of the original
query features and is concatenated with them by applying a δ weighing rule. The formu-
lation is computed as follows:
FI = δ × g fM , fB + (1 − δ)M(i) (7)
Experimentation
TRSE datasets
The datasets used in this work are shown in Fig. 5. A more detailed view of the captur-
ing mechanism and sensor used is given in our earlier works [33]. The train rolling stock
examination (TRSE) bogie videos are captured at different time stamps during the day as
shown in Fig. 5. Each of these videos were captured at 240 frames per second when the
train was moving at a little over 30 kmph at 1080P resolution. Each of the video datasets
has more than 21,000 frames.
Since it was difficult to find defective bogies within a short period in real time, we sim-
ulated the defects found regularly on bogie parts using photoshop and reinduced those
frames back into the original video sequence. Figure 6 shows two such defects on spring
suspension and binding rods.
The objective of the experimentation is to identify the following bogie parts in the
video sequence as shown in Fig. 7. Altogether, there are 16 bogie parts that should be
monitored during TRSE as per the Indian railway rolling stock examination manual.
The numbering will be part of the class names as there are multiple parts with the same
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 13 of 24
Fig. 6 Defective bogie parts induced into the original video frames through photoshop
name. A total of five parameters were used to judge the performance of the algorithms
qualitatively along with the visual validation on the test view frames. They are intersec-
tion-over-union (IoU), mean average precision (mAP), mean False Identification (mFI),
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 14 of 24
and mean non-identification (mNI). The IoU is generally used for understanding the
performance of the UNet segmentation module and its role in the judgment of the clas-
sifier. The range of IoU is between 0 and 1, with the latter being the desired value for
a good segmentation algorithm. Similarly, the mAP gives the precision with which the
classifier identifies the given bogie object. The mFI is a parameter that indicates the false
identification of a bogie object, and mNI gives the inability of the classifier to identify the
bogie object.
Fig. 8 UNet segmentation output for bogie part Axel across frames
Fig. 9 Ground truth (GT) masks of the axle l in the dataset B-1
is accomplished by mapping the bounding box locations from the annotating data. The
biggest advantage of the B-MHAC lies in the tiny training set that is sufficient for achiev-
ing robust performance over the entire test samples.
Fig. 10 Inferencing on test video frames from the dataset using the trained B-MHAC. Figure showing
randomly selected frames during the inferencing process. The order of the frames matches the order of the
datasets in Fig.5 (Zoom for better visibility)
17 bogie parts were found to be similar to Fig. 8. The first few frames in the output of
segmentation have shown to have weak edges as the size of the object is small and its
deformation is rapid between frames. However, with the increased object pixel density
and reduced intra frame object deformations, the segmentation process is relatively sta-
ble and provides exceptional quality of bogie parts for classification.
The trained B-MHAC is tested on the segmentation outputs of UNet, and the
results are projected onto the actual video frames through bounding boxes. First,
we show the results obtained on the databases in Fig. 6. The inferencing results are
shown in Fig. 10 on six different train videos captured under various circumstances.
The overall bogie part retrieval is found to be around 90% in video sequences where
the camera lens was perpendicular to the train movement. The relative position of
the bogie parts in the frames does not affect the recognition accuracies due to the
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 17 of 24
presence of the segmentation module and the multi head attention network. The
multi head attention network takes input from three sets of bogie parts at different
time steps and generalizes on the location of the objects in the continuous video
sequence. This has guaranteed greater accurate mapping of bounding box informa-
tion onto the video sequence.
Consequently, the effectiveness of B-MHAC bogie part identification model is to be
ascertained by comparing the results against popular image object detection models
such as SSD, R-CNN, Fast R-CNN, Faster R-CNN, and our previous method with dif-
ferent Yolo versions. The visual results are presented in Fig. 11 on B-4 dataset. The pro-
posed method outperformed other models due to the presence of multi head attention
network that was learned in time steps on bogie object deformations.
This type of learning involves instances of both spatial and temporal information for
classification making it robust to object deformations in the video sequences of moving
trains. Finally, the B-MHAC is tested for defective parts identification on modified video
sequences. The video frames with defective parts are fabricated with two defects on the
spring suspension and binding screw.
These defective part frames are induced into the video sequence, and the model was
trained from scratch to identify defects by using the existing hyperparameters from the
previous training. The results are projected onto the video sequence with a red bound-
ing box for defective parts as shown in Fig. 12. The ability to identify defective parts by
the proposed B-MHAC is found to be impressive. This is due to fact that the bogie part
is segmented, and it passes through an attention span of multiple time steps which gives
the network to learn distinct features across classes. Subsequently, the next subsection
highlights the qualitative results on all the datasets with the calculated parameters as
indicated above.
Fig. 11 Comparison of approaches for bogie part identification with the proposed B-MHAC (Zoom for better
visibility). a B-7: spring defects identification in frame 5596. a SSD. b R-CNN. c Fast R-CNN. d faster R-CNN. e
Yolo v2 with skip. h Yolo v2 bifold skip. i UNe + VGG 16. j B-MHAC
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 19 of 24
Fig. 12 Defective parts identification through inferencing on trained B-MHAC (Zoom for better visibility).
a B-7: spring defects identification in frame 5596. b B-7: spring defects identification in frame 6152. c B-7:
binding screw defects identification in frame 7531. d B-7: binding screw defects identification in frame 7697.
e B-7: spring defects identification in frame 12589. f B-7: spring defects identification in frame 12785
have better identification potential than those that are away from it. In practice, it
becomes extremely rigid to adjust the camera sensor position with respect to the mov-
ing train. Despite the above constraints, the B-MHAC has shown robust performance in
instances where the camera sensor is randomly positioned. Overall, the B-MHAC has
shown capabilities to sense bogie parts with exceptionally high accuracy when compared
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 20 of 24
B-1 0.6843 0.6952 0.6856 0.7125 0.7752 0.7856 0.8152 0.9214 0.9587
B-2 0.6239 0.6531 0.6598 0.6859 0.7431 0.7658 0.7895 0.8152 0.9025
B-3 0.6151 0.6194 0.6252 0.6657 0.6894 0.7252 0.7594 0.8047 0.8956
B-4 0.6547 0.6773 0.6654 0.6913 0.6973 0.7754 0.7973 0.8478 0.9385
B-5 0.5955 0.6115 0.6175 0.6323 0.6615 0.7025 0.7415 0.8523 0.9122
B-6 0.5759 0.5936 0.5948 0.6189 0.6436 0.6868 0.7236 0.7321 0.8473
Average 0.6249 0.6416 0.6413 0.6677 0.7016 0.7402 0.7710 0.8289 0.9091
mAP
B-1 0.4215 0.4125 0.3785 0.3329 0.2882 0.2663 0.2156 0.1752 0.1124
B-2 0.4862 0.4598 0.4296 0.3889 0.3389 0.3025 0.2856 0.2531 0.1853
B-3 0.4621 0.4479 0.4129 0.3609 0.3268 0.2939 0.2556 0.2365 0.1722
B-4 0.4468 0.4352 0.4017 0.3569 0.3075 0.2701 0.2356 0.2036 0.1486
B-5 0.5374 0.5206 0.5251 0.5249 0.5161 0.5177 0.5056 0.4852 0.1672
B-6 0.5827 0.5933 0.5873 0.5789 0.5654 0.5215 0.5206 0.4952 0.2379
Average 0.48945 0.4782 0.4558 0.4239 0.3904 0.362 0.3364 0.3081 0.1706
mFI
D-1 0.5563 0.5125 0.4569 0.4236 0.3896 0.3456 0.3179 0.1856 0.1243
D-2 0.5936 0.5469 0.4856 0.4598 0.4189 0.3823 0.3495 0.2658 0.1975
D-3 0.6044 0.6093 0.5908 0.5815 0.5517 0.5355 0.5231 0.3856 0.2235
D-4 0.5459 0.4781 0.4282 0.3874 0.3603 0.3089 0.2863 0.1925 0.1385
D-5 0.5017 0.5037 0.4895 0.4312 0.3931 0.3622 0.3147 0.2489 0.1596
D-6 0.6271 0.6149 0.6121 0.6088 0.5924 0.5894 0.5515 0.4023 0.2578
Average 0.5715 0.5442 0.5105 0.482 0.451 0.42 0.39 0.2801 0.1835
mNI
to other models. This is due to its multiple networks used for segmentation and recogni-
tion simultaneously.
Table 5 Experimental results showing defect identification abilities of TRSE automation models
with mAP as the performance indicator
Baseline SSD R-CNN Fast Faster Yolo v1 Yolo v2 Yolo v2 Yolo v2 B-MHAC
methods/ R-CNN R-CNN with skip bifold
parameters skip
mPA 0.4852 0.5325 0.5289 0.5475 0.6125 0.6589 0.6895 0.8745 0.9135
mFI 0.5987 0.5847 0.5245 0.5125 0.4528 0.4753 0.4236 0.2698 0.2258
mNI 0.5463 0.5126 0.5247 0.5169 0.4863 0.4236 0.4198 0.2891 0.2122
of the algorithms to determine defective parts. Accordingly, the one set of training
samples was selected as defective parts. In this work, only two defects were induced
manually on the spring suspension and binding screw. A total of 200 frames were cre-
ated with the two defects and were inducted into the video sequence of B-1. These are
called broken part defects where the width and location of the cut are varied every
20 frames. The testing is performed with a 4000-frame video where 40 continuous
frames were inducted into the original B-1 dataset at 5 randomly selected locations.
The results of the experiment are shown in Table 5. Markedly, the proposed method
shows robust defect identification capabilities over other methods by taking advan-
tage of the multi head attention network. However, the model also suffers from incon-
sistency in defect dimensions which have gone undetected in the video sequence.
training run of these models. Apparently, the learning-based models have performed
exceedingly better than the instance-based methods. Here, exclusively active con-
tours have been used for segmentation of bogie parts with prior knowledge about the
bogie part characteristics. Although it is evident that the active contours have been
shown to possess superior segmentation quality, they have poor generalization capa-
bilities on the test inputs. Hence, on video sequences with different camera angles,
these models have performed weakly.
Conclusions
An attempt has been made to apply deep learning approaches to automate TRSE. Ini-
tially, high-speed video sequences were recorded, and the dataset is created with high
sparsity and resolution. A hybrid segmentation-classification method has been pro-
posed to simultaneously segment and classify train bogie parts from video sequences.
Contrasting the regular CNN models, we propose a multi stream multi head bogie
part classifier (B-MHAC) on the segmented parts. Through extensive experimenta-
tion, it has been found that the proposed method resulted in an average recognition
of 90.11%. The success of B-MHAC is credited to the attention mechanism at multi-
ple time steps in the video sequence that helped the classifier to generalize better on
the bogie part deformations on the running trains during recording. Furthermore, the
approach has allowed for an automated interface environment where the TRSE can be
performed remotely with high accuracy.
Abbreviations
TRSE Train rolling stock examination
UNET U-shaped convolution network
CNN Convolutional neural network
DBPI Deep bogie part inspector
BPAS Bogie part assert score
IR Indian Railways
GPS Global Positioning System
KRATES Konkan Railways Automated Train Examination System
RGB Red-green-blue
TGV Train a Grande Vitesse
AC Active contour
DL Deep learning
GPU Graphics processing unit
VGG Visual Geometry Group
B-UNet Bogie – U shaped convolutional neural network
HOG Histogram of oriented gradients
BEC Binary cross-entropy
FL Focal loss
DL Dice loss
B MHACBogie — multi head attention classifier
IoU Intersection over-union
Map Mean Average Precision
Mfimean False identification
mNI Mean non-identification
Acknowledgements
We thank the Indian Railways staff at Guntur for their expertise and assistance throughout all aspects of our study and
for their help in data collection. We also thank the management of KLEF deemed to be university in helping us in all pos-
sible ways to accomplish this work.
Authors’ contributions
The author PVVK has conceptualized, validated, drafted, and edited the manuscript. KK has developed the methodology
and the underlying code for the project. The manuscript was written by KK. Finally, video data collection, visualizations,
and supervision were conducted by ChRP. The authors read and approved the final manuscript.
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 23 of 24
Funding
No funding was received to assist with the preparation of this manuscript.
Declarations
Ethics approval and consent to participate
Not applicable
Competing interests
The authors declare that they have no competing interests.
References
1. Kishore PVV, Prasad CR (2017) Computer vision based train rolling stock examination. Optik 132:427–444
2. Kishore PVV, Prasad CR (2015) Train rolling stock segmentation with morphological differential gradient active
contours. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).
IEEE, pp 1174–1178
3. Sasikala N, Kishore PVV, Anil Kumar D, Prasad C (2019) Localized region based active contours with a weakly super-
vised shape image for inhomogeneous video segmentation of train bogie parts in building an automated train
rolling examination. Multimed Tools Appl 78(11):14917–14946
4. Sasikala N, Kishore PVV, Prasad CR, Kiran Kumar E, Anil Kumar D, Kumar MTK, Prasad MVD (2018) Unifying boundary,
region, shape into level sets for touching object segmentation in train rolling stock high speed video. IEEE Access
6:70368–70377
5. Mohan KK, Prasad CR, Kishore PVV (2021) Yolo v2 with bifold skip: a deep learning model for video based real time
train bogie part identification and defect detection. J Eng Sci Technol 16(3):2166–2190
6. Sasikala N, Kishore PVV (2020) Train bogie part recognition with multi-object multi-template matching adaptive
algorithm. J King Saud Univ Comput Inform Sci 32(5):608–617
7. Krishnamohan K, Prasad CR, Kishore PVV (2020) Successive texture and shape based active contours for train bogie
part segmentation in rolling stock videos. Int J Adv Comput Sci Appl 11(6):589-598.
8. Chan TF, Vese LA (2001) Active contours without edges. IEEE Transact Image Process 10(2):266–277
9. Lankton S, Tannenbaum A (2008) Localizing region-based active contours. IEEE Transact Image Process
17(11):2029–2039
10. Tian B, Li L, Yansheng Q, Yan L (2017) Video object detection for tractability with deep learning method. In: 2017
Fifth International Conference on Advanced Cloud and Big Data (CBD). IEEE, pp 397–401
11. Mandal M, Kumar LK, Saran MS (2020) MotionRec: a unified deep framework for moving object recognition. In:
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2734–2743
12. Ding X, Luo Y, Li Q, Cheng Y, Cai G, Munnoch R, Xue D, Qingying Y, Zheng X, Wang B (2018) Prior knowledge-based
deep learning method for indoor object recognition and application. Syst Sci Control Eng 6(1):249–257
13. Bi F, Ma X, Chen W, Fang W, Chen H, Li J, Assefa B (2019) Review on video object tracking based on deep learning. J
New Media 1(2):63
14. Ran X, Chen H, Zhu X, Liu Z, Chen J (2018) Deepdecision: a mobile deep learning framework for edge video analyt-
ics. In: IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, pp 1421–1429
15. Liu P, Qi B, Banerjee S (2018) Edgeeye: an edge service framework for real-time intelligent video analytics. In: Pro-
ceedings of the 1st international workshop on edge systems, analytics and networking, pp 1–6
16. Olatunji IE, Cheng C-H (2019) Video analytics for visual surveillance and applications: an overview and survey. Mach
Learn Paradigms 1:475–515
17. Lee Y-H, Kim Y (2020) Comparison of CNN and YOLO for object detection. J Semiconduct Display Technol
19(1):85–92
18. Schabort EJ, Hawley JA, Hopkins WG, Blum H (1999) High reliability of performance of well-trained rowers on a row-
ing ergometer. J Sports Sci 17(8):627–632
19. Das NK, Das CK, Mozumder R, Bhowmik JC (2009) Satellite based train monitoring system. J Electr Eng 36(2):35–38
20. Cacchiani V, Caprara A, Galli L, Kroon L, Maróti G, Toth P (2012) Railway rolling stock planning: robustness against
large disruptions. Transp Sci 46(2):217–232
21. Liu H, Li J, Song X, Seneviratne LD, Althoefer K (2011) Rolling indentation probe for tissue abnormality identification
during minimally invasive surgery. IEEE Transact Robot 27(3):450–460
22. Ashwin T, Ashok S (2014) Automation of rolling stock examination. In: 2014 IEEE International Conference on
Advanced Communications, Control and Computing Technologies. IEEE, pp 260–263
Krishnamohan et al. Journal of Engineering and Applied Science (2022) 69:69 Page 24 of 24
23. Hart JM, Resendiz E, Freid B, Sawadisavi S, Barkan CPL, Ahuja N (2008) Machine vision using multi-spectral imaging
for undercarriage inspection of railroad equipment. In: Proceedings of the 8th world congress on railway research,
Seoul, Korea, vol. 18
24. Jarzebowicz L, Judek S (2014) 3D machine vision system for inspection of contact strips in railway vehicle current
collectors. In: 2014 International Conference on Applied Electronics. IEEE, pp 139–144
25. Kazanskiy NL, Popov SB (2015) Integrated design technology for computer vision systems in railway transportation.
Pattern Recognit Image Anal 25(2):215–219
26. Hwang J, Park H-Y, Kim W-Y (2010) Thickness measuring method by image processing for lining-type brake of rolling
stock. In: 2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content. IEEE, pp 284–286
27. Villar, Christopher M., Steven C. Orrell, II John Anthony Nagle. “System and method for inspecting railroad track.” U.S.
Patent 7,616,329, issued November 10, 2009.
28. Do NT, Gül M, Nafari SF (2020) Continuous evaluation of track modulus from a moving railcar using ANN-based
techniques. Vibration 3(2):149–161
29. Lu H, Wang J, Shi H, Zhang D (2018) On-track experiments on the ride comforts of an articulated railway vehicle. In:
Proceedings of the Asia-Pacific Conference on Intelligent Medical 2018 & International Conference on Transporta-
tion and Traffic Engineering 2018, pp 50–53
30. Meymand SZ, Keylin A, Ahmadian M (2016) A survey of wheel–rail contact models for rail vehicles. Veh Syst Dyn
54(3):386–428
31. Marques F, Magalhães H, Pombo J, Ambrósio J, Flores P (2020) A three-dimensional approach for contact detection
between realistic wheel and rail surfaces for improved railway dynamic analysis. Mech Mach Theory 149:103825
32. Shams S, Platania R, Lee K, Park S-J (2017) Evaluation of deep learning frameworks over different HPC architectures.
In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, pp 1389–1396
33. Sam SM, Kamardin K, Sjarif NNA, Mohamed N (2019) Offline signature verification using deep learning convolutional
neural network (CNN) architectures GoogLeNet inception-v1 and inception-v3. Proc Comput Sci 161:475–483
34. Qassim H, Verma A, Feinzimer D (2018) Compressed residual-VGG16 CNN model for big data places image rec-
ognition. In: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, pp
169–175
35. Deshpande A, Estrela VV, Patavardhan P (2021) The DCT-CNN-ResNet50 architecture to classify brain tumors with
super-resolution, convolutional neural network, and the ResNet50. Neurosci Inform 1(4):100013
36. Cheng D, Meng G, Cheng G, Pan C (2016) SeNet: Structured edge network for sea–land segmentation. IEEE Geosci
Remote Sensing Lett 14(2):247–251
37. Liu C, Zoph B, Neumann M, Shlens J, Hua W, Li L-J, Fei-Fei L, Yuille A, Huang J, Murphy K (2018) Progressive neural
architecture search. In: Proceedings of the European conference on computer vision (ECCV), pp 19–34
38. Tian Y, Yang G, Wang Z, Wang H, Li E, Liang Z (2019) Apple detection during different growth stages in orchards
using the improved YOLO-V3 model. Comput Electron Agric 157:417–426
39. Susanto Y, Livingstone AG, Ng BC, Cambria E (2020) The hourglass model revisited. IEEE Intell Syst 35(5):96–102
40. Li X, Chen H, Qi X, Dou Q, Chi-Wing F, Heng P-A (2018) H-DenseUNet: hybrid densely connected UNet for liver and
tumor segmentation from CT volumes. IEEE Transact Med Imaging 37(12):2663–2674
41. Abdollahi A, Pradhan B, Alamri A (2020) VNet: an end-to-end fully convolutional neural network for road extraction
from high-resolution remote sensing data. IEEE Access 8:179424–179436
42. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image
segmentation. IEEE Transact Pattern Anal Mach Intell 39(12):2481–2495
43. Kan M, Shan S, Chang H, Chen X (2014) Stacked progressive auto-encoders (spae) for face recognition across poses.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1883–1890
44. Thomas E, Pawan SJ, Kumar S, Anmol Horo S, Niyas SV, Kesavadas C, Rajan J (2020) Multi-res-attention UNet: a CNN
model for the segmentation of focal cortical dysplasia lesions from magnetic resonance images. IEEE J Biomed
Health Inform 25(5):1724–1734
45. Das A (2022) Adaptive UNet-based lung segmentation and ensemble learning with CNN-based deep features for
automated COVID-19 diagnosis. Multimed Tools Appl 81(4):5407–5441
46. Sun Z, Huang S, Wei H-R, Dai X-y, Chen J (2020) Generating diverse translation by manipulating multi-head atten-
tion. Proc AAAI Conf Artif Intell 34(05):8976–8983
47. Dong J, Wang N, Fang H, Qunfang H, Zhang C, Ma B, Ma D, Haobang H (2022) Innovative method for pavement
multiple damages segmentation and measurement by the Road-Seg-CapsNet of feature fusion. Constr Build Mater
324:126719
48. Ma D, Fang H, Wang N, Zhang C, Dong J, Hu H Automatic detection and counting system for pavement cracks
based on PCGAN and YOLO-MF. In: IEEE transactions on intelligent transportation systems. https://doi.org/10.1109/
TITS.2022.3161960
49. Xu Q, Liu Y, Li X, Yang Z, Wang J, Sbert M, Scopigno R (2014) Browsing and exploration of video sequences: a new
scheme for key frame extraction and 3D visualization using entropy based Jensen divergence. Inf Sci 278:736–756
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.