Big Data Computing

Big Data Computing
Day 3
Day 3
Big data modelling and management
1
In this session…
• Data management
• Data modelling
• Components of a data model:
– Data structure
– Data operations
– Constraints
• Kinds of data models:
– Relational data model
– Semi‐structured data model
– Vector space model
– Graph data model
– Array as a data model
• Data Model vs. Data Format
3
S Ortega‐Martorell
Data Management
4
2
What is Data Management?
“Data management is the development and execution of architectures,
policies, practices and procedures in order to manage the information
lifecycle needs of an enterprise in an effective manner.”[*]
• Big data management refers to the organisation, administration and governance of large
volumes of both structured and unstructured data.
• Big data management strategies are used to contend with fast‐growing pools of data, which
typically involve a variety of data types.
[*] http://searchdatamanagement.techtarget.com/definition/data‐management 5
What is Data Management?
• Let's think of some questions that must be asked and answered well if we are to manage a
reasonable amount of data:
How do we ingest the data?
Where and how do we store it?
How can we ensure data quality?
What operations do we perform on the data?
How can these operations be efficient?
How to scale up data volume, variety, velocity and access?
How to keep the data secure?
6
3
Data ingestion
• Ingestion means the process of getting the data into the data system that we are building or
using.
• Simple ways of doing this include:
2.
1. These ways of data ingestion are
valid for some big data systems
Reading the
data from a file Web form
collecting data
Use some commands and Placing the data in Transferring the data

the file system
7
Data ingestion
• However, when you think of a large scale system you would like to have more automation in
the data ingestion processes.
• Data ingestion then becomes part of the big data management infrastructure.
• Things to consider when automating data ingestion include:
– How many data sources?
– How large are data items?
– Will the number of data sources grow?
– What is the rate of the data ingestion?
– What to do with bad data?
– What to do when data is too little or too much?
8
4
Data ingestion – example: hospital information system
Medical
Records
• Let's think of a hypothetical hospital information Treatment
Business
system and the answers to the questions: Office
Logs
– How many data sources? ~20
Pharmacy Specialty
– How large are data items? Avg. record size: 5KB, Avg. Databases
Hospital
image size: 2GB, #records: 50 Million Pathology information
Laboratory Admissions
– Will the number of data sources grow? Not much
– What is the rate of the data ingestion? Not huge (often Outpatient
Oncology
Services
proportional to the number of patient activities that takes Clinics
place at the hospital), ~3K per hour Diagnostic
Radiation
Imaging
– What to do with bad data? Warn, flag and ingest (these Oncology
data can never be discarded, even if with errors)
Ingestion Policy
– What to do when data is too little or too much? Not likely
(not included in this example)
9
Data ingestion – example: cloud database of personal info
• Hypothetical database of personal information:
– How many data sources? 2 Million
– How large are data items? Avg. record size: 3KB, Avg. image size: 2MB,
#records: 200 Billion
– Will the number of data sources grow? Now 25M, growing at 15% per year
– What is the rate of the data ingestion? Very fast, ~200K/h at peak time
Cloud database of personal information
– What to do with bad data? Retry once, then discard (primary challenge is
to keep up with the data rate, therefore erroneous data will be discarded
after first retry)
There is an actual policy for
– What to do when data is too little or too much? Keep the excess data in a
handling data overflow.
site store, and ingest them when the data rate becomes slower. But if the
site store starts getting full, start dropping some incoming data at a rate of
0.1% at a time.
This is why data ingestion, together with it’s policies, should be an integral
part of a big data system. Especially when it involves storing fast data.
10
5
Data storage
• The goal of a storage infrastructure, obviously, is to store data.
• There are two storage‐related issues we consider here:
1. Capacity 2. Scalability
How much storage should we allocate? How fast do we need to read/write?
• What should be the size of the memory? • Should the storage devices be attached directly to the
computers?
• How large and how many disk units should we have?
– making direct input/output fast but less scalable.
• Or should the storage be attached to the network that
connects the computers in the cluster?
– making disk access slower but allowing to add
more storage to the system easily.
These questions do not have a simple answer
11
Data storage
• Memory storage hierarchy:
SSDs (Solid State Devices): much faster than spinning hard disks
NVMe (non‐volatile memory express): makes data transfer
between SSDs and memory much faster
65 nanoseconds
This gap has prompted the design of data
per access
structures and algorithms that use a hard disk but
tries to minimize the cost of the IO operations
10 milliseconds between the fast memory and the slower disk
12
6
Data storage
• What all this means in a big data system is that now we have the choice of architecting a
storage infrastructure by choosing how much of each type of storage we need to have.
The components become increasingly
more expensive as we go from the lower
layers of the pyramid to the upper layers.
So ultimately, it becomes an issue of cost‐benefit trade‐off.
13
Data quality
• Now we stored the data efficiently. But is it any good?
• Are there ways of knowing if the data is potentially error‐free and useful for the intended
purpose?
• This is the issue of data quality.
• There are many reasons why any data application, especially larger applications need to be
mindful of data quality.
14
7
Data quality
Better quality means better
analytics and decision making
Quality assurance means needed
for regulatory compliance
Quality leads to better engagement
and interaction with external entities
15
Data quality
Better quality means better analytics and decision making
This emphasizes that big data can give us actionable insight.
16
8
Data quality
Quality Assurance is the framework that ensures that the development and
manufacture of products (such as pharmaceutical, agrochemical and medical
devices) are performed in compliance with regulatory requirements.
Errors in data in regulated industries (e.g. pharma companies or banks)
can violate regulations leading to legal complications. 17
Data quality
Quality leads to better engagement and interaction with external entities
• This 3rd factor is different.
• It says: if your big data should be used by other people, or a
third party software, it is very important that the data has
good quality to gain trust as a leader provider.
E.g. data collected for research to understand
scientific questions, which may be used later on by
other researchers, etc.
18
9
Data operations
• A very significant aspect of data management is to document, define, implement, and test
the set of operations that are required for a specific application.
• Some operations are independent of the type of data and some others would require us to
know the nature of the data because the operations make use of a particular data model.
• There are two broad divisions of operations: subarray
– Those that work on a single object, e.g. cropping an image
– Those that work on collections of data objects, e.g.:
subset
• Operations that select a part of a collection 1 2 3 4 5 4 5
3 merge
• Operations that combine two collections 1 2 2 3
4 1 2
4
• Operations that compute a function on a collection count
A B C D E 5 19
Data operations
• Every operator must be efficient
– i.e. every operator must perform its task as fast as possible by taking up as little memory, or our
disk, as possible.
• The time to perform an operation will depend on the sizes of the input and the output.
– So, if there is an opportunity to parallelise, it should definitely do so.
• Example, select even numbers: In parallel:
5 2 3 4 1
Sequential
subset
1 2 3 4 5 2 4
2 4
20
10
Data scalability and security
• One way of achieving scalability: Scaling up and Scaling Out
Vertical Scaling (Scale‐up): Horizontal Scaling (Scale‐out):
Adding more processors and RAM, buying a Adding more, possibly less‐powerful
more expensive and robust server machines that interconnect over a network
• Many operations perform better with • Parallel operations will possibly be
more memory, more cores. slower
• Maintenance can be difficult, expensive • Easier in practice to add more machines
The Server industry has many solutions for scale‐up/scale‐out decisions
21
Data scalability and security
• Data security – a must for sensitive data
• Increasing the number of machines leads to
more security risks
• Data in transit must be secure
• Encryption and decryption increase security
but make data operations expensive
22
11
Data modelling
23
What is a data model?
• Data models describe data characteristics.
• The data modelling process may involve the definition of three data models defined at
different abstraction levels, namely:
– Conceptual data models
– Logical data models Purpose
– Physical data models
Communication and definition
Conceptual of business terms and rules
Clarification and detail of
business rules and data structure
Logical
Technical implementation
on a physical database
Physical
24
12
Conceptual Data Model
A conceptual data model is used to define, at a very high and platform‐independent level of abstraction, the
entities or concepts, which represent the data of the problem domain, and their relationships.
• It leaves further details about the entities
(such as their attributes, types or primary
keys) for the next steps.
• This model is typically used to explore domain
concepts with the stakeholders
• It can be omitted or used instead of the logical
data model.
E.g. Three entities (Person, Student and Lecturer) and their
main relationships (teach and supervise associations)
25
Logical Data Model
A logical data model is a refinement of the previous conceptual model. It details the domain entities and
their relationships, but standing also at a platform‐independent level.
• It depicts all the attributes that characterise
each entity
– possibly also including its unique identifier, the
primary key.
• It also shows all the relationships between the
entities
– possibly including the keys identifying those
relationships, the foreign keys.
• Despite being independent of any DBMS, this
model can easily be mapped on to a physical
data model thanks to the details it provides.
E.g. Three entities (Person, Student and Lecturer) and their
26
13
Physical Data Model
A physical data model visually represents the structure of the
data as implemented by a given class of DBMS.
• Therefore, entities are represented as tables,
attributes are represented as table columns
and have a given data type that can vary
according to the chosen DBMS, and the
relationships between each table are
identified through foreign keys.
• Unlike the previous models, this model tends
to be platform‐specific, because it reflects the
database schema and, consequently, some
platform‐specific aspects (e.g. database‐
specific data types or query language
extensions). E.g. Three entities (Person, Student and Lecturer) and their
27
Data models
• The complexity and detail increase from a conceptual to a physical data model.
• First, it is important to perceive at a higher level of abstraction, the data entities and their
relationships using a Conceptual Data Model.
• Then, the focus is on detailing those entities without worrying about implementation details
using a Logical Data Model.
• Finally, a Physical Data Model allows to represent how data is supported by a given DBMS.
28
14
Data models in the context of big data
Volume
VALUE
Data Modelling plays a crucial role in big data analytics as a great proportion of the Big Data is unstructured data.
29
Data models in the context of big data
• One way to characterise data variety is to identify the different models of data that are used
in any application.
• I will introduce you to three components of a data model and what they tell us about the
data:
Structure Operations Constraints
Regardless of whether the data is big or small, we need to know or determine the
characteristics of data to be able to manipulate or analyse it meaningfully.
30
15
Components of a Data model – Data structure
31
Data structure:
• Structured data: • Unstructured data:
Sales transaction data in a relational table:
32
16
Example of structured data
File 1 File 2
(John, Smith, 10‐12‐1989) (John, Smith, 10‐12‐1989, Plumber, 30000)
(Liz, Spencer, 09‐29‐1980) (Liz, Spencer, 09‐29‐1980, Joiner, 35000)
(Marie, Bishop, 11‐07‐1992) (Marie, Bishop, 11‐07‐1992, Driver, )
(Steve, Richards, 04‐16‐1958) (Steve, Richards, 04‐16‐1958, Salesman, 60000)
• File content is uniform: each record has • The records have 5 fields, except the 3rd which

3 fields (data properties or attributes). seems to be missing the last entry.
• The first 2 of these fields are strings • Is this file structured? Yes it is.
and the 3rd one is a date. – the missing value makes the 3rd record incomplete,
but it does not break the structure or the pattern of
• Even if the data grows, the pattern of the data organisation.
the data organisation remains identical.
• Does it have the same structure as file 1?
• This repeatable pattern of data – Apparently not, however they seem to have been
organisation makes the file structured generated by a similar organisational structure, and
hence they have the same data model.
33
Example of structured data
File 1 File 2
(John, Smith, 10‐12‐1989) (John, Smith, 10‐12‐1989, Plumber, 30000)
(Liz, Spencer, 09‐29‐1980) (Liz, Spencer, 09‐29‐1980, Joiner, 35000)
(Marie, Bishop, 11‐07‐1992) (Marie, Bishop, 11‐07‐1992, Driver, )
(Steve, Richards, 04‐16‐1958) (Steve, Richards, 04‐16‐1958, Salesman, 60000)
𝐴 ,𝐴 ,…,𝐴
𝐵 ,𝐵 ,…,𝐵
structure
…
𝑍 ,𝑍 ,…,𝑍
34
17
Example of unstructured data
• Now in contrast, consider this file:
– Just looking at it, it is impossible to figure out how the data is organised and how to identify subparts of the data.
কার কোথায়থাকা
উচিতবোঝাযাচ্ছে
না ইদা&#24#2472;ীং! ঘরে
থাকবে কে,
আরবাইরেই বা কে,
বর্ধমানে কার
থাকা দরকার, কার
চলে যাওয়া দরকার
মালদহ থেকে— সব
কেমন গুলিয়ে
যাচ্ছে। সরকার
যাঁকে ঘরের আসনে
বসিয়ে রাখে,
নির্বাচন কমিশন
We would call this data unstructured
35
Example of unstructured data
• Often, compressed data like JPEG images, MP3 audio files, MPEG3 video files, encrypted data,
are usually unstructured.
JPEG images
MPEG3 video files
MP3 audio files

36
18
Components of a Data model – Data operations
37
Data operations ‐ Subsetting
• Example: Given a collection of data, and a condition, find a subset of data from the
collection so that each element in the subset is satisfied
(John, Smith, 10‐12‐1989, Plumber, 30000)
(Liz, Spencer, 09‐29‐1980, Joiner, 35000)
(Marie, Bishop, 11‐07‐1992, Driver, )
(Steve, Richards, 04‐16‐1958, Salesman, 60000)
Field 5 > 50000
(Steve, Richards, 04‐16‐1958, Salesman, 60000)
38
19
Data operations ‐ Substructure extraction
• Given a data collection with some structure, extract from each data item a part of the
structure as specified by a condition
(John, Smith, 10‐12‐1989, Plumber, 30000) (John, Smith)
(Liz, Spencer, 09‐29‐1980, Joiner, 35000) (Liz, Spencer)
(Marie, Bishop, 11‐07‐1992, Driver, ) (Marie, Bishop)
(Steve, Richards, 04‐16‐1958, Salesman, 60000) Field 1, Field 2 (Steve, Richards)
39
Data operations ‐ Union
• Given two data collections, create a new one with elements of the two input collections
• Duplicate elimination
(John, Smith, 10‐12‐1989)
(Liz, Spencer, 09‐29‐1980)
(Marie, Bishop, 11‐07‐1992) (John, Smith, 10‐12‐1989)
(Liz, Spencer, 09‐29‐1980)
(Marie, Bishop, 11‐07‐1992)
union (Lance, Holt, 04‐02‐1976)
(Lance, Holt, 04‐02‐1976)
(Liz, Spencer, 09‐29‐1980)
40
20
Data operations ‐ Join
• Given two data collections, create a new one with elements of the two input collections
• Duplicate elimination
(12, John, Smith, 10‐12‐1989)
(14, Liz, Spencer, 09‐29‐1980)
(18, Marie, Bishop, 11‐07‐1992)
(20, Sue, Daveson, 03‐16‐1986) (12, John, Smith, 10‐12‐1989, Plumber, 30000)
(14, Liz, Spencer, 09‐29‐1980, Joiner, 35000)
(12, Plumber, 30000) (18, Marie, Bishop, 11‐07‐1992, Driver, 45000)
join
(14, Joiner, 35000)
(18, Driver, 45000)
(23, Student, 30000)
This operation is more complex and can be very expensive when the size of the true collections is large.
41
Data operations
• This is a set of operations (subsetting, substructure extraction, union, join) that can be
performed on the data, without considering the bigness aspect.
• Operations specify the methods to manipulate the data.
• Since different data models are typically associated with different structures, the operations
on them will be different.
• But some types of operations are usually performed across all data models.
42
21
Components of a Data model – Constraints
43
Data constraints
• A constraint is a logical statement.
– That means, we can compute and test whether the statement is true or false.
• Constraints are part of the data model because they can specify something about the
semantics, i.e. the meaning of the data.
• Examples:
– A week has seven and only seven days is something that a data system would not know unless
this knowledge is passed on to it in the form of a constraint.
– The number of titles for a movie is restricted to one.
• Different data models have different ways to express constraints.
44
22
Types of constraints
• Value constraint
– E.g. the age is never negative
• Uniqueness constraint
– E.g. a movie can have only one title
• In the words of logic, there should exist no data object that is a movie and has more than one title.
• Enforcing this type of constraint requires counting the number of titles and verifying that it is one.
Generalising: we can count the number of values associated with each object and check
whether it lies between an upper and lower bound. This is called:
• Cardinality constraint
– E.g. A person can take between 0 and 3 blood pressure medications at a time
45
• Type constraint
– For restricting the type of data allowed in a field.
– E.g. the name of a person cannot be ‐99.
• To ensure that this does not happen, we can enforce the type of the name to be a non‐numeric
alphabetic string.
• E.g. of a logical expression for this constraint: Name:string, not(isNumeric(Name))
A type constraint is a special kind of domain constraint.
• Domain constraint
– The domain of a data property or attribute is the possible set of values that are allowed for that
attribute.
– E.g. the possible values for the day part of the date field can be between 1 and 31
Day in (1 … 31) Month in (1 … 12) or Month in (‘Jan’, ‘Feb’, … ‘Dec’)
46
23
Month in (1 … 12) or Month in (‘Jan’, ‘Feb’, … ‘Dec’)
A more complex constraint would be:
Month in (1 … 12) or Month in (‘Jan’, ‘Feb’, … ‘Dec’), with
• 1 or ‘Jan’ having 1 to 31 days
• 2 or ‘Feb’ having 1 to 28 days (or 29 depending on the year)
• ...
• However, all these constraints are Value
Uniqueness
value constraints.
Cardinality
– they only state how to restrict the Type
values of some data property Domain
47
A totally different type of constraints are:
• Structural Constraints
– A structural constraint puts restrictions on the structure of the data rather than the data values
themselves.
– E.g.:
row column
We have a squared data matrix
It will translate as:
If we transform
it into a table “the number of data rows in
the new table will be the
square of the number of
rows of the original table”
and impose the same
squareness constraint
48
24
Kinds of data models
49
Data models
• So far, we have seen that a Data model is characterized by:
– the structure of the data that it admits
– the operations on that structure
– and a way to specify constraints.
• Now we are going to see a more detailed description of a number of common data models.
Relational data models
Semi‐structured data models
Vector Space Model
Graph Data Model
Array as a Data Model

50
25
Kinds of data models – Relational data models
51
Relational data models
• A relational data model is one of the simplest and most frequently used data models
today.
• It forms the basis of many other traditional database management systems, such as:
52
26
Relational data models – tables
The primary data structure for a relational model is a table.
ID FName LName Depart. Title Salary

202 John Gonzales IT DB Specialist 104750
203 Mary Roberts Research Director 175400
204 Janaki Rao HR Financial Analyst 63850
205 Alex Knight IT Security Specialist 123500
206 Pamela Ziegler IT Programmer 85600
53
The primary data structure for a relational model is a table.
ID FName LName Depart. Title Salary • A relational tuple implies that, unless
202 John Gonzales IT DB Specialist 104750 otherwise stated, the elements of it
203 Mary Roberts Research Director 175400 (e.g. 203 or 204), are atomic.
– That is, they represent one unit of
information and cannot be
205 Alex Knight IT Security Specialist 123500 decomposed further.
206 Pamela Ziegler IT Programmer 85600 – Thus, this is a relation of six tuples.
These tables represent a set of tuples.
Relational tuple, represented as a row
There are also relational tables, which represent
the different relationships between tables
54
27
No duplicates are allowed
207 Harry Dawson HR Director 115450
55
Dissimilar tuples are not allowed either:
Research
Jane Dow 208 65800 Research
Associate
It has all the right pieces of information,
but they are all in the wrong order
56
28
Relational data models – table schema
Relational schema
Employee
ID: FName: LName: Department: Title: Salary: This row is part of the
Int string string Enum (HR, IT, string int schema of the table
Primary key Not null Not null Research, Business) > 25000
203 Mary Roberts Research Director 175400 The schema includes:
204 Janaki Rao HR Financial Analyst 63850 ‐ name of the table
‐ attributes of the relation
(columns)
206 Pamela Ziegler IT Programmer 85600 ‐ for each column, the
207 Harry Dawson HR Director 115450 allowed data type
‐ constraints (in yellow)
Jane Dow 208 Research Associate 65800 Research
Given this schema, it should now be clear why
the last red row does not belong to this table.
57
Relational data models – primary key
Employee
Primary key ID: FName: LName: Department: Title: Salary:
(unique) Int string string Enum (HR, IT, string int
Primary key Not null Not null Research, Business) > 25000
• This means it is unique for each employee.
• And for every employee, knowing his/her primary key, will also uniquely know the other attributes.
• A table with a primary key logically implies that the table cannot have a duplicate record, otherwise it
will violate the uniqueness constraint associated with the primary key.
58
29
Relational data models – relational tables and FK
Relational tables represent the different relationships between tables
Example modified to reflect the salary of the employee at different times
EmpSalaries
EmpID Date Salary
202 1/1/2016 104750
203 2/15/1016 175400
204 6/1/2015 63850
205 9/15/2015 123500
206 10/1/2015 85600
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
Employees.ID references EmpSalaries.EmpID
205 10/1/2014 113400
Primary key (PK) Foreign key (FK)
59
Relational data models – relational join
Employee Output table of the relational join
ID FName LName ID FName LName Date Salary
202 John Gonzales EmpSalaries 202 John Gonzales 1/1/2016 104750
203 Mary Roberts EmpID Date Salary 202 John Gonzales 9/15/2014 101250
204 Janaki Rao 202 1/1/2016 104750 203 Mary Roberts 2/15/1016 175400
205 Alex Knight 203 2/15/1016 175400 204 Janaki Rao 6/1/2015 63850
206 Pamela Ziegler 204 6/1/2015 63850 204 Janaki Rao 3/1/2015 48000
207 Harry Dawson 205 9/15/2015 123500 205 Alex Knight 9/15/2015 123500
206 10/1/2015 85600 205 Alex Knight 10/1/2014 113400
207 4/15/2015 115450 206 Pamela Ziegler 10/1/2015 85600
Relational join 202 9/15/2014 101250 207 Harry Dawson 4/15/2015 115450

204 3/1/2015 48000 207 Harry Dawson 9/15/2013 106900
207 9/15/2013 106900
205 10/1/2014 113400 Natural Join
60
30
Relational data models – relational join
• Join is one of the most expensive operations, that means
– time consuming and
– space consuming
• As data becomes larger, and tables contain hundreds of millions of tuples, the join
operation can easily become a bottleneck in a larger analytic application.
• So for analytical big data application that needs joins, it is very important to choose
a suitable data management platform that makes this operation efficient.
61
Practicalities
• In many business solutions, people start with CSV files, manipulate
them with the spreadsheet
• Then, they migrate their relational system only as an afterthought,
when the data becomes too large to handle the spreadsheet.
• While the spreadsheet offers many useful features, it does not conform
and enforce many principles of relational data models.
– Consequently, a large amount of time may be spent in cleaning up and
correcting data errors after the migration actually happens.
62
31
Example: terrorism data
• Example: a spreadsheet that has 125,000 rows and over 100 columns.
– It contains a lists terrorist attacks gathered from news media. So each row represents one attack.
• Let us look at it from a relational data modelling viewpoint:
A query like “find all attacks for which
Two weapons used in the attack the property damage is = minor”
separated by a semicolon cannot be answered directly
This makes this column non‐atomic A substring search for minor makes it
more expensive 63
• These are the columns of the spreadsheet:
– So this is part of the schema of the data.
If you observe carefully you will see a recurring pattern.
64
32
The design of the data table determined that there can be at most three types of attacks
within a single encounter and represented with three separate columns.
• In proper relational modelling, one would say that there is a 1‐to‐many relationship between
the attack and the number of attack types.
• In such a case, it would be better to place these attack‐type columns in a separate table and
connect with the parent using a PK‐FK relationship.
65
Similar pattern here, this time this is about the types and subtypes of weapons used.
66
33
Kinds of data models – Semi‐structured data models
67
Semi‐structure data models – HTML
Let's a take a very simple web page:
• Everywhere here a block is nested within a larger block
– li tags within p tags, which are within body tags, etc.
• Unlike a relational structure, there are multiple list items and
multiple paragraphs
– and any single document would have a different number of them.
This means while the data object has some structure, it is more flexible.
This is the hallmark of a semi‐structure data model.
68
34
Semi‐structure data models – XML
• XML stands for eXtensible Markup Language.
• XML is another well‐known standard to
represent data.
• It can be seen as a generalisation of HTML
– where the elements can be any string, and not
only the ones allowed by HTML.
69
Semi‐structure data models – XML
1. There are two ‘sampleattribute’ elements.
– They are structurally different as they have different
number of sub‐elements.
1
2 2. XML allows the querying of both, schema and data.
– E.g. you can query the name of the element that contains
a sub‐element whose content is ‘CellType’?
1
We cannot perform an operation
like 2 in a relational data model
2 E.g. we cannot query: which relation
has a column with a value, say, John
70
35
Semi‐structure data models – JSON Example of JSON:
• The same idea can also be seen in JSON.
• JSON stands for Java Script Object Notation.
• It is a very popular format used for many
different data, e.g. Twitter and Facebook
data.
• It is an open‐standard file format.
71
Semi‐structure data models – JSON
Key‐value pair
• JSON uses human‐readable text to
transmit data objects consisting of
attribute–value pairs and array data types
(or any other serializable value).
• Although the format looks different to
XML, it has a similar nested structure. Tuple
Squared brackets
indicate arrays
Useful link: http://jsonviewer.stack.hu/
72
36
JSON
Semi‐structure data models – tree data structure
• One way to generalise these different forms of
semi structured data is to model them as trees.
XML
73
Semi‐structure data models – tree data structure
XML
Tree data structure
JSON
Useful link: http://countwordsfree.com/xmlviewer 74
37
Semi‐structure data models – Tree operations
• getParent
– parent of Case: Dataset
Tree data structure
• getChildren
– children of Case: Tissue and Spectrum
• getSiblings
– children of Case: other Case nodes
• root‐to‐node‐path
– for ‐1.1 0.79 ‐0.45 would be:
XML/DATASET/Case/Spectrum/Points/‐1.1 0.79 ‐0.45
Queries need tree navigation
75
Hands‐on activity:
1. Exploring the Semi‐structured Data Model of JSON
76
38
Details of the hands‐on activity: Data Model of JSON
• View contents of Twitter JSON
• Exploring JSON schema
• Extract fields from JASON
• Instructions for this activity can be found here:
– “HandsOn 1. JSON Data.docx”
77
Kinds of data models – Vector Space Model
78
39
Vector model
• The Vector Space Model has been successfully used to retrieve data from large collections
of text and images.
Starting with text
• Text is often considered unstructured data.
– It does not have attributes and relationships.
– So we need a different way to find and analyse text data.
• Finding text from a huge collection of text data is different from the data models we have
seen so far.
79
Vector model
• To find text, we not only need the text data itself, but we need a different structure that is
computed from the text data.
• That structure will be provided by the document vector model.
• With this, finding a document won’t be an exact search problem.
– Instead, we will provide a query document and ask the system to find all documents that are
similar to it.
• Search engines use some form of vector models
and similarity search to locate text data
– the same principle can be use for finding similar images
80
40
Vector model – Term Frequency
• Example for describing the concept of a document:
d1: “new york times”
3 documents d2: “new york post”
d3: “los angeles times”
We can create the frequency matrix of the 3 documents,
containing the term frequencies (TF):
angeles los new post times york • The rows stand for the documents and columns

represent the words in the documents.
d1 0 0 1 0 1 1
• We put the number of occurrences of returning
d2 0 0 1 1 0 1 the document in the appropriate cell of the
d3 1 1 0 0 1 0 matrix.
81
Vector model – Inverse Document Frequency
• Example for describing the concept of a document:
d1: “new york times”
3 documents d2: “new york post”
d3: “los angeles times”
# docs term count
Also, we can create a new vector called the
Inverse Document Frequency (IDF) for each term:
log2(3/1)=1.584
Term Doc freq. IDF
angeles 1 log2(3/1)=1.584 The use of log to the base 2
los 1 log2(3/1)=1.584 (instead of log10)
is just a convention
new 2 log2(3/2)=0.584
post 1 log2(3/1)=1.584
times 2 log2(3/2)=0.584 What is the intuition
york 2 log2(3/2)=0.584 behind the IDF vector?
82
41
Vector model – understanding why we need IDF
Example:
• Take 100 random newspaper articles
– 10 of them cover election news
– Would you expect to see the term ‘election’ more often than ‘is’?
• Example:
– Doc. frequency of ‘election’ = 50
freq. ‘is’ = 6 x freq. ‘election’
– Doc. frequency of ‘is’ = 300
– Should ‘is’ have more importance than ‘election’?
• ‘is’ is such a common word that it’s prevalence has a negative impact on its
informativeness.
– if you compute the IDF of ‘is’ and ‘election’, the IDF of ‘is’ will be far lower
Using IDF penalises words/terms that are too common in the collection.
83
Vector model – the FT‐IDF matrix
• Obtained by multiplying the TF numbers by the IDF.
Term Doc freq. IDF
angeles 1 log2(3/1)=1.584
angeles los new post times york
los 1 log2(3/1)=1.584
d1 0 0 1 0 1 1
new 2 log2(3/2)=0.584
d2 0 0 1 1 0 1 post 1 log2(3/1)=1.584
d3 1 1 0 0 1 0 times 2 log2(3/2)=0.584
york 2 log2(3/2)=0.584
angeles los new post times york length

FT‐IDF matrix
d1 0 0 0.584 0 0.584 0.584 1.011 Length of d1 0.584 0.584 0.584 1.011
d2 0 0 0.584 1.584 0 0.584 1.786 (The square root of the sum of squares
of the individual term scores)
d3 1.584 1.584 0 0 0.584 0 2.316
84
42
Vector model – the FT‐IDF matrix
FT‐IDF matrix
d1 0 0 0.584 0 0.584 0.584 1.011
d2 0 0 0.584 1.584 0 0.584 1.786
d3 1.584 1.584 0 0 0.584 0 2.316
• Therefore, for each document, we have a vector represented here as a row.
• So that row represents the relative importance of each term in the vocabulary.
– Vocabulary means the collection of all words that appear in this collection.
– If the vocabulary has 3 million entries, then this vector can get quite long.
– Also, if the number of documents grows, e.g. to 1 billion, it becomes a big data problem.
• The last column represents the length of the document vector.
85
Vector model – searching in the vector space
• To perform a search in the vector space, we write a query just like we type terms in Google:
query q: new new york 3 terms, with ‘new’ appearing twice
• Maximum frequency of all terms in the query = 2 (‘new’)
• Create the query vector:
q 0 0 (2/2)*0.584=0.584 0 0 (1/2)*0.584=0.292 0.652
number of max. term IDF square root of the

occurrences frequency sum of squares
Note: Max. frequency of all terms in the query for each of the documents in the previous example was = 1 (there were no terms repeated). 86
43
q 0 0 0.584 0 0 0.292 0.652
• Next, we will compute the similarity between the query vector and each document
– measuring how far the query vector is from each document.
• There are many similarity functions defined and used for different things.
• A popular similarity measure is the cosine function:
– It measures the cosine function of the angle between these two vectors:
87
• The intuition behind using the cosine function as a similarity measure is that:
– If the vectors are identical, then the angle between them is zero. And therefore, the
cosine function evaluates to one.
– As the angle increases, the value of the cosine function decreases to make them more
dissimilar.
• The way to compute the function is:
– For two vectors: 𝑎 𝑎 … 𝑎 and 𝑏 𝑏 … 𝑏 , with document lengths 𝑙 and 𝑙 , respectively:
cosSim 𝑎, b 𝑎 ∗𝑏 𝑎 ∗𝑏 ⋯ 𝑎 𝑏 ⁄ 𝑙 𝑙
• The results of the distance function are:
much more similar to the
𝑐𝑜𝑠𝑆𝑖𝑚 𝑑1, 𝑞 0.584 ∗ 0.584 0.584 ∗ 0.292 ⁄ 1.011 ∗ 0.652 0.776
𝑐𝑜𝑠𝑆𝑖𝑚 𝑑2, 𝑞 0.584 ∗ 0.584 ⁄ 1.786 ∗ 0.652 0.292
query than the other two
𝑐𝑜𝑠𝑆𝑖𝑚 𝑑3, 𝑞 0.584 ∗ 0.292 ⁄ 2.316 ∗ 0.652 0.112
88
44
Vector model – query term weighting
• More often than not, users would like a little more control over the ranking of terms.
• One way of accomplishing this is to put different weights on each query term.
• Example:
Q = york^1 times^2 post^5
𝑤𝑡 𝑦𝑜𝑟𝑘 1/ 1 2 5 1/8 0.125
𝑤𝑡 𝑡𝑖𝑚𝑒𝑠 2/ 1 2 5 2/8 0.25
𝑤𝑡 𝑝𝑜𝑠𝑡 5/ 1 2 5 5/8 0.625
Multiply the query vector with these weights
As a results, “new york post” will rank first.
89
Vector model – image search
• Similarity search is often used for images using a vector space model.
– We can compute features from images.
– One common feature is a scatter histogram.
• Example:
• One can create the histogram of the red, green and
blue channels where histogram is the count of pixels
having a certain density value.
• This picture is mostly bright, so the count of dark
pixels is relatively small.
• Now we can think of histograms like a vector.
90
45
Vector model – image search
0‐31 32‐63 64‐95 96‐127 128‐159 159‐191 192‐223 223‐255
Red 0.04 0.12 0.23 0.06 0.24 0.12 0.13 0.06
Green 0.05 0.07 0.11 0.07 0.26 0.24 0.17 0.03
Blue 0.08 0.13 0.16 0.08 0.03 0.12 0.19 0.21
• The table shown is a feature vector where the numbers for each row have been normalised with the
size of the image to make the row sum equal to one.
• Similar vectors can be computed of the image texture, shapes of objects and any other properties.
The vector space model is significant for unstructured data.
91
Kinds of data models – Graph Data Model
92
46
Graph Data Model
• What distinguishes a graph from other data models is that it bears two kinds of information:
1. properties and attributes of entities and relationships, and
2. the connectivity structure that constitutes the network itself.
Typical example:
Social networks
93
Graph Data Model
• One way to look at this data model is using the following representation (from Apache Spark)
The graph on the left is represented by the two tables on the right
The vertex (or node) table,
gives IDs to nodes and lists
their properties.
The edge table has two parts:
• The properties of the edge (in orange)
• the direction of the arrows in the network
(source ‐> destination).
Thus, since there is a directed edge going from node 3 to
node 7, there is a tuple 3,7 in that part of the edge table.
This form of the graph model is called the property graph model.
94
47
Graph Data Model
• Representing connectivity information gives graph data a new kind of computing ability that
is different from other data models we have seen so far.
• Even without looking at the properties of the nodes and edges, one can get very interesting
information just by analysing or querying this connectivity structure.
• E.g.: small social network with:
– three types of nodes:
• user (blue nodes)
• city (red node)
• restaurant (green nodes)
– three types of edges:
• friend (blue arrow)
• likes (green arrow)
• lives in (red arrow)
95
Graph Data Model – small social network example
Friend Italian2
Likes
Lives in
• Tom is interested in finding a good Italian
Italian1
Bob restaurant in Liverpool that his friends, or
Tim Jen
Jen their friends, who also live in Liverpool, like.
Tom
Jill • Tom shall possibly choose Italian3 because:
Max – It has the highest number of ‘like’ edges
coming into it from people who have a ‘lives
in’ edge to Liverpool.
Liz Pam
– And can be reached by following the ‘friend’
Italian3 edges going out from Tom.
This shows an important class of operations and ground data, namely traversal,
that involves edge following based on some sort of conditions.
96
48
Graph Data Model – optimal path operations
A number of path operations require some sort of optimisation.
1. The simplest is the well‐known shortest path query
– to find the best route from a source location to a target location.
2. To find an optimal path that must include some user‐specified nodes
– the operation has to determine the order in which the nodes will be visited.
– E.g. a trip planner, where the users specify the cities they wish to visit, and
the operation will optimise the criterion, like the total distance covered.
3. This is a case where the system must find the best possible path, given
two or more optimisation criteria, which cannot be satisfied
simultaneously.
– E.g. if I want to travel from my house to the airport using the shortest
distance, but also minimising the amount of highway travel, the algorithm
must find a best compromise.
– This is called a Pareto‐optimality problem on graphs.
97
Graph Data Model – neighbourhood
• The neighbourhood of a node N in a graph is a set of edges directly connected to it.
– A K neighbourhood of N is a collection of edges between nodes that are, at most, K steps away
from N.
Italian2
• E.g.:
Italian1
Bob
Tim Jen
Jen Bob, Jill, Liz: 1st neighbours of Tom
Tom Tim, Max, Pam: 2nd neighbourhood of Tom

Jill
Max Jen: 3rd neighbourhood of Tom
Liz Pam
Italian3
98
49
Graph Data Model – communities
• An important class of analysis to perform with neighbourhoods is
community finding.
– The graph shown in this figure has four communities.
• We can see that each community has a higher density of edges within the
community and a lower density across two different communities.
• Operations:
– Finding densely connected parts of a graph
• It helps identify neighbourhoods that can be recognised as communities.
– Finding the best possible clusters (communities) in a graph
• So that any other grouping of nodes into communities will be less effective.
• This is a more complex operation
• As graphs become bigger and denser, these methods become harder to compute.
• Thus, neighbourhood‐based optimisation operations present significant scalability challenges.
99
Graph Data Model – anomalous neighbourhoods
Odd because it is almost perfectly star shaped.
• That is, the nodes that the red node is connected to are almost
unconnected amongst themselves.
• That is odd because it does not happen much in reality.
• So it is an anomalous node.
Near clique.
• This shows a neighbourhood to which a significantly large number of
neighbours has connected amongst themselves.
• This makes the graph very cliquish.
– clique refers to a neighbourhood where each node is connected to all
other neighbourhood nodes in the neighbourhood.
100
50
Graph Data Model – anomalous neighbourhoods
Heavy vicinity.
• A neighbourhood where some edges have an unusually heavy
weight compared to the others
Predominant edge.
• A special case of the third (heavy vicinity).
• A neighbourhood where one edge is predominantly high‐rate
compared to all the other edges.
101
Graph Data Model – Connectedness
• Connectedness is a fundamental property of a graph.
– In a connected graph, each node is reachable from every other node through some path.
– If a graph is not connected, but there are subgraphs of it, which are connected, then these
subgraphs are called connected components of the original graph.
• In this figure there are four connected components.
– A search gradient like finding optimal paths should be performed
only within each component and not across them.
For large graphs, there are several new parallelised techniques
for the detection of connected components.
102
51
Kinds of data models – Array as a data model
103
Other Data Models – Array as a Data model
• Arrays can serve as a data model.
• What is an array?
1D array:
3D array: 4
1 2 3 4 5 6
3
4
2 3 2
2D array: 1
1
1 1
1
2
2 2
3
3 3
4
4 1 2 3 4 4
1 2 3 4
1 2 3 4 5 6
104
52
Other Data Models – Arrays of vectors
1 2 3 4 5 6 1 2 3 4 5 6
1 10 77 18 21 4 39 (10, (77, (18, (21, (4, (39,
1 200, 182, 310, 231, 217, 168,
2 23 16 31 19 62 42 68) 83) 56) 78) 75) 90)
3 28 47 37 93 20 54 (23, (16, (31, (19, (62, (42,
2 193, 301, 290, 253, 383, 203,
4 93 99 58 42 47 84 35) 74) 84) 49) 49) 75) E.g. images:
(28, (47, (37, (93, (20, (54, R,G,B channels per pixel
A(row, column) = value 3 174, 168, 341, 236, 386, 326,
56) 90) 57) 83) 50) 53)
A(2, 4) = 19 (93, (99, (58, (42, (47, (84,
4 348, 192, 293, 294, 432, 388,
67) 79) 82) 74) 45) 94)
A(2, 4) = (19,253,49)
A(2, 4)[1] = 19
105
Other Data Models – Operations on arrays of vectors
• dim(A) – number of dimensions of A
• size(A) – size of A; size(A, dim) – size of a specific dimension
• A(i,j) – value of the element at the (i,j)‐th cell
• A(i,j)[k] – value of the k‐th element of the cell at A(i,j)
• length(A(i,j)) – vector‐length of the vector at the (i,j)‐th cell
• distance(A(i,j), A(k,l),f) – vector distance between the values of two cells, i.e. A(i,j) and A(k,l),

given the distance function f
• etc.
106
53
Data Model vs. Data Format
What is the difference between them?
107
Data Model vs. Data Format
Data Model Data Format
• Example:
husband
Jill
Jack Profession:
Profession: baker
plumber
CSV file
wife Age: 32
Age: 35 Jack, profession, plumber, age, 35, wife, Jill
Jill, profession, baker, age, 32, husband, Jack
Peter Peter, profession, teacher, age, 36, friend, Jack
Profession:
friend teacher
Age: 36
108
54
2. Exploring the Array Data Model of an Image
109
Details of the hands‐on activity: Array Data Model of an Image
• Display an image file
• Examine the structure of the image
• Extract pixel values from various locations in the image
– “HandsOn 2. Image Data.docx”.
110
55
3. Exploring Sensor Data
111
Details of the hands‐on activity: Exploring sensor data
• Look at weather station measurements in text file
• Identify the major components in semi‐structured data from a weather station
• Create plots of weather station data
– “HandsOn 3. Sensor Data.docx”.
112
56
Conclusions
113
Summary
We have covered today:
1. The meaning of ‘data management’, and also in the context of Big Data.
2. Structured and unstructured data, basic data operations, types of data constraints, and why
they are useful to specify the semantics of the data.
3. The structural components of a relational data model, the ‘schema’, primary and foreign keys;
and some of the operations.
4. Semi‐structured data models, that most semi‐structured data are tree‐structured, and why
tree navigation operations are important for XML and JSON.
5. What a vector model is, the concepts of similarity function and similarity search; and that
many document and image search engines use these concepts.
6. The graph data model; and described path, neighbourhood, and connectivity operations.
7. How arrays can serve as a data model, and why images can be modelled as vector arrays.
8. The difference between format and data model.
114
57

Big Data Computing

Uploaded by

Copyright:

Available Formats

Big Data Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Computing

Uploaded by

Copyright:

Available Formats

Big Data Computing

Use some commands and Placing the data in Transferring the data

Structure Operations Constraints

• File content is uniform: each record has • The records have 5 fields, except the 3rd which

MPEG3 video files

MP3 audio files

Array as a Data Model

ID FName LName Depart. Title Salary

Relational join 202 9/15/2014 101250 207 Harry Dawson 4/15/2015 115450

angeles los new post times york • The rows stand for the documents and columns

angeles los new post times york length

d1 0 0 0.584 0 0.584 0.584 1.011

d2 0 0 0.584 1.584 0 0.584 1.786

d3 1.584 1.584 0 0 0.584 0 2.316

query q: new new york 3 terms, with ‘new’ appearing twice

angeles los new post times york length

q 0 0 (2/2)*0.584=0.584 0 0 (1/2)*0.584=0.292 0.652

number of max. term IDF square root of the

q 0 0 0.584 0 0 0.292 0.652

0‐31 32‐63 64‐95 96‐127 128‐159 159‐191 192‐223 223‐255

Red 0.04 0.12 0.23 0.06 0.24 0.12 0.13 0.06

Green 0.05 0.07 0.11 0.07 0.26 0.24 0.17 0.03

Blue 0.08 0.13 0.16 0.08 0.03 0.12 0.19 0.21

Tom Tim, Max, Pam: 2nd neighbourhood of Tom

• size(A) – size of A; size(A, dim) – size of a specific dimension

• distance(A(i,j), A(k,l),f) – vector distance between the values of two cells, i.e. A(i,j) and A(k,l),

You might also like

q 0 0 (2/2)0.584=0.584 0 0 (1/2)0.584=0.292 0.652