IOT Mod4@AzDOCUMENTS - in
IOT Mod4@AzDOCUMENTS - in
IOT Mod4@AzDOCUMENTS - in
MODULE 4
Descriptive: Descriptive data analysis tells you what is happening, either now or in
the past. For example, a thermometer in a truck engine reports temperature values
every second. From a descriptive analysis perspective, you can pull this data at any
moment to gain insight into the current operating condition of the truck engine. If the
temperature value is too high, then there may be a cooling problem or the engine may
be experiencing too much load.
Diagnostic: When you are interested in the ―why,‖ diagnostic data analysis can
provide the answer. Continuing with the example of the temperature sensor in the
truck engine, you might wonder why the truck engine failed. Diagnostic analysis
might show that the temperature of the engine was too high, and the engine
overheated. Applying diagnostic analysis across the data generated by a wide range of
smart objects can provide a clear picture of why a problem or an event occurred.
Predictive: Predictive analysis aims to foretell problems or issues before they occur.
For example, with historical values of temperatures for the truck engine, predictive
analysis could provide an estimate on the remaining life of certain components in the
engine. These components could then be proactively replaced before failure occurs.
Or perhaps if temperature values of the truck engine start to rise slowly over time, this
could indicate the need for an oil change or some other sort of engine cooling
maintenance.
Prescriptive: Prescriptive analysis goes a step beyond predictive and recommends
solutions for upcoming problems. A prescriptive analysis of the temperature data from
a truck engine might calculate various alternatives to cost-effectively maintain our
truck. These calculations could range from the cost necessary for more frequent oil
changes and cooling maintenance to installing new cooling equipment on the engine
or upgrading to a lease on a model with a more powerful engine. Prescriptive analysis
looks at a variety of factors and makes the appropriate recommendation.
Scaling problems: Due to the large number of smart objects in most IoT networks
that continually send data, relational databases can grow incredibly large very
quickly. This can result in performance issues that can be costly to resolve, often
requiring more hardware and architecture changes.
Volatility of data: With relational databases, it is critical that the schema be designed
correctly from the beginning. Changing it later can slow or stop the database from
operating. Due to the lack of flexibility, revisions to the schema must be kept at a
minimum. IoT data, however, is volatile in the sense that the data model is likely to
change and evolve over time. A dynamic schema is often required so that data model
changes can be made daily or even hourly.
Live streaming nature of its data and with managing data at the network level. It is
valuable only if it is possible to analyze and respond to it in real-time. Real-time
analysis of streaming data allows you to detect patterns or anomalies that could
indicate a problem or a situation that needs some kind of immediate response.
Network Data or Network Analytics: With the large numbers of smart objects in
IoT networks that are communicating and streaming data, it can be challenging to
ensure that these data flows are effectively managed, monitored, and secure. Network
analytics tools provide the capability to detect irregular patterns or other problems in
the flow of IoT data through a network.
Variety: Variety refers to different types of data. Often you see data categorized as
structured, semi-structured, or unstructured. Different database technologies may only
be capable of accepting one of these types. Hadoop is able to collect and store all
three types. This can be beneficial when combining machine data from IoT devices
that is very structured in nature with data from other sources, such as social media or
multimedia that is unstructured.
Volume: Volume refers to the scale of the data. Typically, this is measured from
gigabytes on the very low end to petabytes or even exabytes of data on the other
extreme. Generally, big data implementations scale beyond what is available on
locally attached storage disks on a single node. It is common to see clusters of servers
that consist of dozens, hundreds, or even thousands of nodes for some large
deployments.
Answer:
Apache Kafka is a distributed publisher-subscriber messaging system that is built to
be scalable and fast.
It is composed of topics, or message brokers, where producers write data and
consumers read data from these topics.
Due to the distributed nature of Kafka, it can run in a clustered configuration that can
handle many producers and consumers simultaneously and exchanges information
between nodes, allowing topics to be distributed over multiple nodes.
The goal of Kafka is to provide a simple way to connect to data sources and allow
consumers to connect to that data in the way they would like.
The ―in-memory‖ characteristic of Spark is what enables it to run jobs very quickly.
At each stage of a Map Reduce operation, the data is read and written back to the
disk, which means latency is introduced through each disk operation.
Real-time processing is done by a component of the Apache Spark project called
Spark Streaming. Spark Streaming is an extension of Spark Core that is responsible
for taking live streamed data from a messaging system, like Kafka, and dividing it
into smaller micro batches. These micro batches are called discretized streams
Apache Storm and Apache Flink are other Hadoop ecosystem projects designed for
distributed stream processing and are commonly deployed for IoT use cases. Storm
can pull data from Kafka and process it in a near-real-time fashion, and so can Apache
Flink.
Lambda Architecture
1. Lambda is a data management system that consists of two layers for ingesting data
(Batch and Stream) and one layer for providing the combined data (Serving).
2. These layers allow for the packages like Spark and Map Reduce, to operate on the
data independently, focusing on the key attributes for which they are designed and
optimized.
3. Data is taken from a message broker, commonly Kafka, and processed by each layer
in parallel, and the resulting data is delivered to a data store where additional
processing or queries can be run.
4. Layers of Lambda Architecture are:
Batch layer: The Batch layer consists of a batch-processing engine and data
store. If an organization is using other parts of the Hadoop ecosystem for the
other layers, Map Reduce and HDFS can easily fit.
Serving layer: The Serving layer is a data store and mediator that decides
which of the ingest layers to query based on the expected result or view into the
data. The Serving layer is often used by the data consumers to access both
stream and batch layers simultaneously.
5. The Lambda Architecture can provide a robust system for collecting and processing
massive amounts of data and the flexibility of being able to analyze that data at
different rates.
6. One limitation of this type of architecture is its place in the network. Due to the
processing and storage requirements of many of these pieces, the vast majority of
these deployments are either in data centres or in the cloud. This could limit the
effectiveness of the analytics to respond rapidly enough if the processing systems are
milliseconds or seconds away from the device generating the data.
Raw input data: This is the raw data coming from the sensors into the analytics
processing unit.
Analytics processing unit (APU): The APU filters and combines data streams,
organizes them by time windows, and performs various analytical functions. It is at
this point that the results may be acted on by micro services running in the APU.
Output streams: The data that is output is organized into insightful streams and is
used to influence the behaviour of smart objects, and passed on for storage and further
processing in the cloud. Communication with the cloud often happens through a
standard publisher/subscriber messaging protocol, such as MQTT.
a. Filter: The streaming data generated by IoT endpoints is likely to be very large, and
most of it is irrelevant. The filtering function identifies the information that is
considered important.
b. Transform: In the data warehousing world, Extract, Transform, and Load (ETL)
operations are used to manipulate the data structure into a form that can be used for
other purposes. Analogous to data warehouse ETL operations, in streaming analytics,
once the data is filtered; it needs to be formatted for processing.
c. Time: As the real-time streaming data flows, a timing context needs to be established.
This could be to correlated average temperature readings from sensors on a minute-
by-minute basis. The APU is programmed to report the average temperature every
minute from the sensors, based on an average of the past two minutes.
d. Correlate: Streaming data analytics becomes most useful when multiple data streams
are combined from different types of sensors. Different types of data come from
different instruments, but when this data is combined and analyzed, it provides an
invaluable picture of the situation. Another key aspect is combining and correlating
real-time measurements with pre-existing, or historical, data.
e. Match patterns: Once the data streams are properly cleaned, transformed, and
correlated with other live streams as well as historical data sets, pattern matching
operations are used to gain deeper insights to the data. The patterns can be simple
relationships, or they may be complex, based on the criteria defined by the
application. Machine learning may be leveraged to identify these patterns.
Streaming analytics may be performed directly at the edge, in the fog, or in the cloud
data centre. There are no hard-and-fast rules dictating where analytics should be done,
but there are a few guiding principles.
Fog analytics allows you to see beyond one device, giving you visibility into an
aggregation of edge nodes and allowing correlating data from a wider set.
While there may be some value in doing analytics directly on the edge, the sensors
communicate via MQTT through a message broker to the fog analytics node, allowing
a broader data set.
An example of an oil drilling company that is measuring both pressure and
temperature on an oil rig.The fog node is located on the same oil rig and performs
streaming analytics from several edge devices, giving it better insights due to the
expanded data set. It may not be able to respond to an event as quickly as analytics
performed directly on the edge device, but it is still close to responding in real-time as
events occur. Once the fog node is finished with the data, it communicates the results
to the cloud through a message broker via MQTT for deeper historical analysis
through big data analytics tools.
Figure shows field area network (FAN) traffic analytics performed on the aggregation
router in a smart grid.
Network analytics has the power to analyze details of communications patterns made
by protocols and correlate this across the network. It allows you to understand what
should be considered normal behavior in a network and to quickly identify anomalies
that suggest network problems due to suboptimal paths, intrusive malware, or
excessive congestion.
Network analytics offer capabilities to cope with capacity planning for scalable IoT
deployment as well as security monitoring in order to detect abnormal traffic volume
and patterns such as an unusual traffic spike for a normally quiet protocol for both
centralized or distributed architectures, such as fog computing.
The benefits of flow analytics, in addition to other network management services, are as
follows:
Network traffic monitoring and profiling: Flow collection from the network layer provides
global and distributed near-real-time monitoring capabilities. IPv4 and IPv6 network wide
traffic volume and pattern analysis helps administrators proactively detect problems and
quickly troubleshoot and resolve problems when they occur.
Application traffic monitoring and profiling: Monitoring and profiling can be used to gain
a detailed time-based view of IoT access services, such as the application layer protocols,
including MQTT, CoAP, and DNP3, as well as the associated applications that are being used
over the network.
Capacity planning: Flow analytics can be used to track and anticipate IoT traffic growth and
help in the planning of upgrades when deploying new locations or services by analyzing
captured data over a long period of time. This analysis affords the opportunity to track and
anticipate IoT network growth on a continual basis.
Security analysis: Because most IoT devices typically generate a low volume of traffic and
always send their data to the same server(s), any change in network traffic behaviour may
indicate a cyber security event, such as a denial of service (DoS) attack. Security can be
enforced by ensuring that no traffic is sent outside the scope of the IoT domain.
Accounting: In field area networks, routers or gateways are often physically isolated and
leverage public cellular services and VPNs for backhaul. Deployments may have thousands
of gateways connecting the last-mile IoT infrastructure over a cellular network. Flow
monitoring can thus be leveraged to analyze and optimize the billing, in complement with
other dedicated applications, such as Cisco Jasper, with a broader scope than just monitoring
data flow.
Data warehousing and data mining: Flow data can be warehouse for later retrieval and
analysis in support of proactive analysis of multiservice IoT infrastructures and applications.
Answer:
FNF is a flow technology developed by Cisco Systems that is widely deployed all over the
world. Key advantages of FNF are as follows:
FNF Components:
FNF Flow Monitor (Net Flow cache): The FNF Flow Monitor describes the Net
Flow cache or information stored in the cache. The Flow Monitor contains the flow
record definitions with key fields (used to create a flow, unique per flow record:
match statement) and non-key fields (collected with the flow as attributes or
characteristics of a flow) within the cache.
FNF flow record: A flow record is a set of key and non-key Net Flow field values
used to characterize flows in the Net Flow cache. Flow records may be predefined for
ease of use or customized and user defined. A typical predefined record aggregates
flow data and allows users to target common applications for Net Flow. User-defined
records allow selections of specific key or non-key fields in the flow record.
FNF Exporter: There are two primary methods for accessing Net Flow data: Using
the show commands at the command-line interface (CLI), and using an application
reporting tool. Net Flow Export to the Net Flow reporting collector. The Flexible Net
Flow Exporter allows the user to define where the export can be sent, the type of
transport for the export, and properties for the export. Multiple exporters can be
configured per Flow Monitor.
Flow export timers: Timers indicate how often flows should be exported to the
collection and reporting server.
Net Flow export format: This simply indicates the type of flow reporting format.
Net Flow server for collection and reporting: This is the destination of the flow
export. It is often done with an analytics tool that looks for anomalies in the traffic
patterns.
Challenges with deploying flow analytics tools in an IoT network include the following:
a. The distributed nature of fog and edge computing may mean that traffic flows are
processed in places that might not support flow analytics, and visibility is thus lost.
b. IPv4 and IPv6 native interfaces sometimes need to inspect inside VPN tunnels, which
may impact the router’s performance.
Pervasive Legacy Systems: Due to the static nature and long lifecycles of equipment
in industrial environments, many operational systems may be deemed legacy systems.
Legacy components are not restricted to isolated network segments but have now
been consolidated into the IT operational environment. Legacy components are not
restricted to isolated network segments but have now been consolidated into the IT
operational environment. From a security perspective, this is potentially dangerous as
many devices may have historical vulnerabilities or weaknesses that have not been
patched and updated.
requirements. Their operation was often within an assumed secure network. Common
industrial protocols and their respective security concerns are as follows:
The Logical Framework Based on the Purdue Model for Control Hierarchy
Enterprise zone:
o Level 5: Enterprise network: Corporate-level applications such as Enterprise
Resource Planning (ERP), Customer Relationship Management (CRM),
document management, and services such as Internet access and VPN entry
from the outside world exist at this level.
o Level 4: Business planning and logistics network: The IT services exist at
this level and may include scheduling systems, material flow applications,
optimization and planning systems, and local IT services such as phone, email,
printing, and security monitoring.
Operational zone:
o Level 3: Operations and control: This level includes the functions involved
in managing the workflows to produce the desired end products and for
monitoring and controlling the entire operational system.
Safety zone:
o Safety-critical: This level includes devices, sensors, and other equipment
used to manage the safety functions of the control system.
OCTAVE
The first step of the OCTAVE Allegro methodology is to establish a risk
measurement criterion. The point of having a risk measurement criterion is that at any
point in the later stages, prioritization can take place against the reference model.
The second step is to develop an information asset profile. This profile is populated
with assets, a prioritization of assets, attributes associated with each asset, including
owners, custodians, people, explicit security requirements, and technology assets.
The third step is to identify information asset containers. Roughly speaking, this is the
range of transports and possible locations where the information might reside. The
emphasis is on the container level rather than the asset level. The value is to reduce
potential inhibitors within the container for information operation.
The fourth step is to identify areas of concern. Judgments are made through a
mapping of security-related attributes to more business-focused use cases. The analyst
looks to risk profiles and delves into the previously mentioned risk analysis.
The Fifth step is where threat scenarios are identified. Threats are broadly (and
properly) identified as potential undesirable events. This definition means that results
from both malevolent and accidental causes are viable threats.
At the sixth step risks are identified. Within OCTAVE, risk is the possibility of an
undesired outcome. This is extended to focus on how the organization is impacted.
The seventh step is risk analysis, with the effort placed on qualitative evaluation of
the impacts of the risk. Here the risk measurement criteria defined in the first step are
explicitly brought into the process.
Mitigation is applied at the eighth step. There are three outputs or decisions to be
taken at this stage. One may be to accept a risk and do nothing, other than document
the situation, potential outcomes, and reasons for accepting the risk. The second is to
mitigate the risk with whatever control effort is required. The final possible action is
to defer a decision, meaning risk is neither accepted nor mitigated.
FAIR:
FAIR (Factor Analysis of Information Risk) is a technical standard for risk definition
from The Open Group. FAIR has clear applications within operational technology.
FAIR places emphasis on both unambiguous definitions and the idea that risk and
associated attributes are measurable. Measurable, quantifiable metrics are a key area
of emphasis, which should lend itself well to an operational world with a richness of
operational data.
FAIR has a definition of risk as the probable frequency and probable magnitude of
loss. A clear hierarchy of sub-elements emerges, with one side of the taxonomy
focused on frequency and the other on magnitude.
Loss even frequency is the result of a threat agent acting on an asset with a resulting
loss to the organization. This happens with a given frequency called the threat event
frequency (TEF), in which a specified time window becomes a probability.
Vulnerability here is not necessarily some compute asset weakness, but is more
broadly defined as the probability that the targeted asset will fail as a result of the
actions applied.
The other side of the risk taxonomy is the probable loss magnitude (PLM), which
begins to quantify the impacts, with the emphasis again being on measurable metrics.
FAIR defines six forms of loss, four of them externally focused and two internally
focused. Of particular value for operational teams are productivity and replacement
loss. Response loss is also reasonably measured, with fines and judgments easy to
measure but difficult to predict.
Exercise Questions
1. Explain i) NOSQL databases, ii) Hadoop iii) YARN
2. Explain i)Supervised Learning, ii)Unsupervised Learning iii) Neural Network
3. The Phased Application of Security in an Operational Environment.