LAMBDA Book Chapter 1
LAMBDA Book Chapter 1
LAMBDA Book Chapter 1
Valentina Janev
1 Introduction
In 2001, in an attempt to characterize and visualize the changes that are likely
to emerge in the future, Douglas Laney [269] of META Group (Gartner now)
proposed three dimensions that characterize the challenges and opportunities of
increasingly large data: Volume, Velocity, and Variety, known as the 3 Vs of big
data. Thus, according to Gartner
“Big data” is high-volume, velocity, and variety information assets that
demand cost-effective, innovative forms of information processing for en-
hanced insight and decision making.
According to Manyika et al. [295] this definition is intentionally subjective
and incorporates a moving definition of how big a dataset needs to be in order to
be considered big data. Along this lines, big data to Amazon or Google (see Table
1) is quite different from big data to a medium-sized insurance or telecommuni-
cations organization. Hence, many different definitions have emerged over time
(see Chapter 3), but in general, it refers to “datasets whose size is beyond the
ability of typical database software tools to capture, store, manage, and analyze”
[295] and technologies that address “data management challenges” and process
2 Valentina Janev
and analyze data to uncover valuable information that can benefit businesses
and organizations. Additional “Vs” of data have been added over the years, but
Volume, Velocity and Variety are the tree main dimensions that characterize the
data.
The volume dimension refers to the largeness of the data. The data size in
a big data ecosystem can range from dozens of terabytes to a few zettabytes
and is still growing [480]. In 2010, the McKinsey Global Institute estimated that
enterprises globally stored more than 7 exabytes of new data on disk drives,
while consumers stored more than 6 exabytes of new data on devices such as
PCs and notebooks. While more than 800,000 Petabytes (1 PB= 1015 bytes) of
data were stored in the year 2000, according to International Data Corporation
expectations [344] this volume will exceed 175 zettabytes (1 PB= 1021 bytes) by
2025 [84].
The velocity dimension refers to the increasing speed at which big data is cre-
ated and the increasing speed at which the data need to be stored and analysed,
while the variety dimension refers to increased diversity of data types.
Variety introduces additional complexity to data processing as more kinds
of data need to be processed, combined and stored. While the 3 Vs have been
continuously used to describe big data, the additional dimensions of veracity
and value have been added to describe data integrity and quality, in what is
called the 5 Vs of big data. More Vs have been introduced, including validity,
vulnerability, volatility, and visualization, which sums up to the 10 Vs of big
data [137] (see Table 1). Regardless of how many descriptors are isolated when
describing the nature of big data, it is abundantly clear that the nature of big
data is highly complex and that it, as such, requires special technical solutions
for every step in the data workflow.
In order to depict the information processing flow in just a few phases, in Figure
1, from left to right, we have divided the processing workflow into three layers:
– Data sources;
– Data management (integration, storage and processing);
– Data analytics, Business intelligence (BI) and knowledge discovery (KD).
Such partition will allow the authors of this book to discuss big data topics
from different perspectives. For computer scientists and engineers, big data poses
problems of data storage and management, communication, and computation.
For data scientists and statisticians responsible for machine learning models
development, the issues are how to get usable information out of datasets that are
too huge and complex for many traditional or classical methods to handle. From
an organizational viewpoint, business analysts are expected to select and deploy
analytics services and solutions that contribute mostly to the organizational
strategic goals, for instance, taking into consideration a framework for measuring
the organizational performance.
Data Sources. In a modern data ecosystem, the data sources layer is com-
posed of both private and public data sources – see the left side of Figure 2. The
corporate data originates from internal systems, cloud-based systems, as well as
external data provided from partners and third parties. Within a modern data
4 Valentina Janev
architecture, any type of data can be acquired and stored; however, the most
challenging task is to capture the heterogeneous datasets from various service
providers. In order to allow developers to create new applications on top of open
datasets (see examples below), machine-readable formats are needed. As such,
XML and JSON have quickly become the de facto format for the web and mobile
applications due to their ease of integration into browser technologies and server
technologies that support Javascript. Once the data has been acquired, the in-
terlinking of diverse data sources is quite a complex and challenging process,
especially for the acquired unstructured data. That is the reason why semantic
technologies and Linked Data principles [50] have become popular over the last
decade [221]. Using Linked Data principles and a set of agreed vocabularies for
a domain, the input data is modeled in the form of resources, while the existing
relationships are modeled as a set of (named) relationships between resources.
In order to represent the knowledge of a specific domain, conceptual schemas are
applied (also called ontologies). Automatic procedures are used to map the data
to the target ontology, while standard languages are used to represent the map-
pings (see Chapter 4). Furthermore, in order to unify the knowledge represen-
tation and data processing, standardized hierarchical and multilingual schemas
are used called taxonomies. Over the last decade, thousands of data reposito-
Chapter 1 Ecosystem of Big Data 5
ries emerged on the web[47] that companies can use to improve their products
and/or processes. The public data sources (statistics, trends, conversations, im-
ages, videos, audios, and podcasts for instance from Google Trends, Twitter,
Instagram, and others [297]) provide real-time information and on-demand in-
sights that enable businesses to analyse user interactions, draw patterns and
conclusions. IoT devices have also created significant challenges in many indus-
tries and enabled the development of new business models. However, one of the
main challenges associated with these repositories is automatically understand-
ing the underlying structures and patterns of the data. Such an understanding
is a prerequisite to the application of advanced analytics to the retrieved data
[142]. Examples of Open Data Sources from different domains are:
– Facebook Graph API, curated by Facebook, is the primary way for apps to
read and write to the Facebook social graph. It is essentially a representation
of all information on Facebook now and in the past. For more info see here1 .
1
https://developers.facebook.com/docs/graph-api
2
https://opencorporates.com/
6 Valentina Janev
– Open Street Map is a map of the world, created by people free to use
under an open license. It powers map data on thousands of websites, mobile
apps, and hardware devices. For more info, see here4 .
The wide availability of big data also means that there are many quality
issues that need to be dealt with before using such data. For instance, data
inherently contains a lot of noise and uncertainty or is compromised because of
sensor malfunctioning or interferences, which may result in missing or conflicting
data. Therefore, quality assessment approaches and methods applicable in open
big data ecosystems have been developed [477].
Furthermore, in order to ensure interoperability between different processes
and interconnected systems, the semantic representation of data sources / pro-
cesses was introduced where a knowledge graph, from one side, meaningfully de-
scribes the data pipeline, and from the other, is used to generate new knowledge
(see Chapter 4).
Interoperability remains a major burden for the developers of the big data ecosys-
tem. In its EU 2030 vision, the European Union has set out the creation of an
internal single market through a standardised system of laws that apply in all
member states and a single European data [84] space – a genuine single market
for data where businesses have easy access to an almost infinite amount of high-
quality industrial data. The vision is also supported by the EU Rolling Plan
for ICT Standardisation [85] that identifies 170 actions organised around five
priority domains — 5G, cloud, cybersecurity, big data and Internet of Things.
In order to enable broad data integration, data exchange and interoperability
with the overall goal of fostering innovation based on data, standardisation at
different levels (such as metadata schemata, data representation formats and
licensing conditions of open data) is needed. This refers to all types of (multi-
lingual) data, including both structured and unstructured data, and data from
different domains as diverse as geospatial data, statistical data, weather data,
public sector information (PSI) and research data, to name just a few.
In the domain of big data, five different actions have been requested that also
involve the following standardization organizations:
– OGC, the Open Geospatial Consortium defines and maintains standards for
location-based, spatio-temporal data and services. The work includes, for
instance, schema allowing descriptions of spatio-temporal sensors, images,
simulations, and statistics data (such as “datacubes”), a modular suite of
standards for Web services allowing ingestion, extraction, fusion, and (with
the web coverage processing service (WCPS) component standard) analytics
of massive spatio-temporal data like satellite and climate archives. OGC also
contributes to the INSPIRE project;
– W3C, the W3C Semantic Web Activity Group has accepted numerous Web
technologies as standards or recommendations for building semantic applica-
tions including RDF (Resource Description Framework) as a general-purpose
language; RDF Schema as a meta-language or vocabulary to define prop-
erties and classes of RDF resources; SPARQL as a standard language for
querying RDF data: OWL, Web Ontology Language for effective reasoning.
More about semantic standards can be found in [222].
Over the last 50 years, Data Analytics has emerged as an important area
of study for both practitioners and researchers. The Analytics 1.0 era began
in the 1950s and lasted roughly 50 years. As a software approach, this field
evolved significantly with the invention of Relational Databases in the 1970s by
Edgar F. Codd, the development of artificial intelligence as a separate scien-
tific discipline, and the invention of the Web by Sir Tim Berners-Lee in 1989.
With the development of Web 2.0-based social and crowd-sourcing systems in
the 2000s, the Analytics 2.0 era started. While the business solutions were tied
to relational and multi-dimensional database models in the Analytics 1.0 era,
the Analytics 2.0 era brought NOSQL and big data database models that
opened up new priorities and technical possibilities for analyzing large amounts
of semi-structured and unstructured data. Companies and data scientists refer
to these two periods in time as before big data (BBD) and after big data (ABD)
[99]. The main limitations observed during the first era were that the potential
capabilities of data were only utilised within organisations, i.e. the business in-
telligence activities addressed only what had happened in the past and offered
12 Valentina Janev
no predictions about its future trends. The new generation of tools with fast-
processing engines and NoSQL stores made possible the integration of internal
data with externally sourced data coming from the internet, sensors of various
types, public data initiatives (such as the human genome project), and captures
of audio and video recordings. Also significantly developed in this period was the
Data Science field (multifocal field consisting of an intersection of Mathematics&
Statistics, Computer Science, and Domain Specific Knowledge), which delivered
scientific methods, exploratory processes, algorithms and tools that can be easily
leveraged to extract knowledge and insights from data in various forms.
The Analytics 3.0 era started [23] with the development of the “Internet
of Things” and cloud computing, which created possibilities for establishing hy-
brid technology environments for data storage, real-time analysis and intelligent
customer-oriented services. Analytics 3.0 is also named the Era of Impact or the
Era of Data-enriched offerings after the endless opportunities for capitalizing
on analytics services. For creating value in the data economy, Davenport [99]
suggests that the following factors need to be properly addressed:
7.1 Challenges
The 3 Vs of big data call for the integration of complex data sources (includ-
ing complex types, complex structures, and complex patterns), as previously
14 Valentina Janev
tems, relational data base system and dedicated encryption of data trans-
mission. Toll line controllers are based on industrial PC-technology and dedi-
cated electronic interface boards. The toll plaza subsystem is the supervisory
system for all line controllers. It collects all the data from lane controllers
including financial transactions, digital images of vehicles, technical malfunc-
tions, line operators’ actions and failures. All data concerning toll collection
processes and equipment status are permanently collected from the plaza
computers and stored in a central system database. The toll collection sys-
tem also comprises features concerning vehicle detection and classification,
license plate recognition and microwave-based dedicated short-range com-
munications.
– The Main Control Centre is connected through an optical communication
link with the Plaza Control Centres. Also, the Control Centre is constantly
exchanging data with various institutions such as: banks, insurance compa-
nies, institutions that handle credit and debit cards, RF tags vendors, etc.
through a computer network. Data analytics is based on data warehouse
architecture enabling optimal performances in near real time for statistical
and historical analysis of large data volumes. Reporting is based on opti-
mized data structures, allowing both predefined (standardized) reports as
well as ad hoc (dynamic) reports, which are generated efficiently using the
Oracle BI platform. Data analytics includes scenarios, such as
Here, we have pointed to just one mode of transport and traffic management,
i.e. the control of highways and motorways. However, nowadays, an increasing
number of cities around the world struggle with traffic congestion, optimizing
public transport, planning parking spaces, and planning cycling routes. These
issues call for new approaches for studying human mobility by exploiting ma-
chine learning techniques [404], forecasting models or through the application of
complex event processing tools [134].
Chapter 1 Ecosystem of Big Data 17
8 Conclusions
This chapter presents the author’s vision of a Big data ecosystem. It serves as an
introductory chapter to point to a number of aspects that are relevant for this
book. Over the last two decades, advances in hardware and software technolo-
gies, such as the Internet of Things, mobile technologies, data storage and cloud
computing, and parallel machine learning algorithms have resulted in the ability
to easily acquire, analyze and store large amounts of data from different kinds
of quantitative and qualitative domain-specific data sources. The monitored and
collected data presents opportunities and challenges that, as well as focusing on
the three main characteristics of volume, variety, and velocity, require research
of other characteristics such as validity, value and vulnerability. In order to au-
tomate and speed up the processing, interoperable data infrastructure is needed
and standardization of data-related technologies, including developing metadata
standards for big data management. One approach to achieve interoperability
among datasets and services is to adopt data vocabularies and standards as de-
fined in the W3C Data on the Web Best Practices, which are also applied in the
tools presented in this book (see Chapters 4 to 9).
In order to elaborate the challenges and point to the potential of big data,
a case study from the traffic sector is presented and discussed in this chapter,
while more big data case studies are set out in Chapter 9 and Chapter 10.
18 Valentina Janev