Normalization and Functional Dependency

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Ques What is normalization?

Justify the need of normalization with


examples.
• Normalization is a process of organizing the data in database to avoid data redundancy, insertion
anomaly, update anomaly & deletion anomaly.
• There are three types of anomalies that occur when the database is not normalized.

• These are – Insertion, update and deletion anomaly. Let’s take an example to understand this.

Example: Suppose a manufacturing company stores the employee details in a table named
employee that has four attributes: emp_id for storing employee’s id, emp_name for storing
employee’s name, emp_address for storing employee’s address and emp_dept for storing the
department details in which the employee works. At some point of time the table looks like this:

emp_id emp_name emp_address emp_dept


101 Rick Delhi D001
101 Rick Delhi D002
123 Maggie Agra D890
166 Glenn Chennai D900
166 Glenn Chennai D004

The above table is not normalized. We will see the problems that we face when a table is not
normalized.

Update anomaly: In the above table we have two rows for employee Rick as he belongs to
two departments of the company. If we want to update the address of Rick then we have to
update the same in two rows or the data will become inconsistent. If somehow, the correct
address gets updated in one department but not in other then as per the database, Rick would
be having two different addresses, which is not correct and would lead to inconsistent data.

Insert anomaly: Suppose a new employee joins the company, who is under training and
currently not assigned to any department then we would not be able to insert the data into the
table if emp_dept field doesn’t allow nulls.

Delete anomaly: Suppose, if at a point of time the company closes the department D890 then
deleting the rows that are having emp_dept as D890 would also delete the information of
employee Maggie since she is assigned only to this department.

To overcome these anomalies we need to normalize the data.

Ques Explain different normal forms with examples .Also explain how they are
achieved.

Here are the most commonly used normal forms:

• First normal form(1NF)


• Second normal form(2NF)
• Third normal form(3NF)
• Boyce & Codd normal form (BCNF)

First normal form (1NF)

As per the rule of first normal form, an attribute (column) of a table cannot hold multiple values.

It should hold only atomic values.

The domain of attribute should hold only atomic values.

Example: Suppose a company wants to store the names and contact details of its employees.
It creates a table that looks like this:

emp_id emp_name emp_address emp_mobile


101 Herschel New Delhi 8912312390
102 Jon Kanpur 8812121212
9900012222

103 Ron Chennai 7778881212


104 Lester Bangalore 9990000123
8123450987

Two employees (Jon & Lester) are having two mobile numbers so the company stored them in
the same field as you can see in the table above.

This table is not in 1NF as the rule says “each attribute of a table must have atomic (single)
values”, the emp_mobile values for employees Jon & Lester violates that rule.

To make the table complies with 1NF we should have the data like this:

emp_id emp_name emp_address emp_mobile


101 Herschel New Delhi 8912312390
102 Jon Kanpur 8812121212
102 Jon Kanpur 9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore 9990000123
104 Lester Bangalore 8123450987

Second normal form (2NF)

A table is said to be in 2NF if both the following conditions hold:

• Table is in 1NF (First normal form)


• There should be no partial dependencies ,that means ,No non-prime attribute is
dependent on the proper subset of any candidate key of table

• An attribute that is not part of any candidate key is known as non-prime attribute.

Example: Suppose a school wants to store the data of teachers and the subjects they teach.
They create a table that looks like this: Since a teacher can teach more than one subjects, the
table can have multiple rows for a same teacher.

teacher_id subject teacher_age


111 Maths 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
Candidate Keys: {teacher_id, subject}

Non prime attribute: teacher_age

• The table is in 1 NF because each attribute has atomic values.

• However, it is not in 2NF because non prime attribute teacher_age is dependent on teacher_id
alone which is a proper subset of candidate key. This violates the rule for 2NF as the rule says
“no non-prime attribute is dependent on the proper subset of any candidate key of the table”.

To make the table complies with 2NF we can break it in two tables like this:

teacher_details table:

teacher_id teacher_age
111 38
222 38
333 40
teacher_subject table:

teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
Now the tables comply with Second normal form (2NF).

Third Normal form (3NF)


A table design is said to be in 3NF if both the following conditions hold:

• Table must be in 2NF


• Transitive functional dependency of non-prime attribute on any super key should be
removed.

(An attribute that is not part of any candidate key is known as non-prime attribute.)

In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each
functional dependency X-> Y at least one of the following conditions hold:

• X is a super key of table


• Y is a prime attribute of table

An attribute that is a part of one of the candidate keys is known as prime attribute.

Example: Suppose a company wants to store the complete address of each employee, they
create a table named employee_details that looks like this:

emp_id emp_name emp_zip emp_state emp_city emp_district


1001 John 282005 UP Agra Dayal Bagh
1002 Ajeet 222008 TN Chennai M-City
1006 Lora 282007 TN Chennai Urrapakkam
1101 Lilly 292008 UK Pauri Bhagwan
1201 Steve 222999 MP Gwalior Ratan
Functional dependencies:

• Emp_id -> emp_ name, emp_ zip

• Emp_ zip -> emp_ state, emp_ city, emp_ district

Super keys: {emp_id}, {emp_id, emp_name}, {emp_id, emp_name, emp_zip}…so on

Candidate Keys: {emp_id}

Non-prime attributes: all attributes except emp_id are non-prime as they are not part of any
candidate keys.

Here, emp_state, emp_city & emp_district dependent on emp_zip. And, emp_zip is dependent
on emp_id that makes non-prime attributes (emp_state, emp_city & emp_district) transitively
dependent on super key (emp_id).

This violates the rule of 3NF.

To make this table complies with 3NF we have to break the table into two tables to remove the
transitive dependency:

employee table:
emp_id emp_name emp_zip
1001 John 282005
1002 Ajeet 222008
1006 Lora 282007
1101 Lilly 292008
1201 Steve 222999

employee_zip table:

emp_zip emp_state emp_city emp_district


282005 UP Agra Dayal Bagh
222008 TN Chennai M-City
282007 TN Chennai Urrapakkam
292008 UK Pauri Bhagwan
222999 MP Gwalior Ratan

Boyce Codd normal form (BCNF)

• It is an advance version of 3NF that’s why it is also referred as 3.5NF.

• BCNF is stricter than 3NF.

• A table is said to be in BCNF if it is in 3NF and for every functional dependency X->Y, X
should be the super key of the table.

Example: Suppose there is a company wherein employees work in more than one
department. They store the data like this:

emp_id emp_nationality emp_dept dept_type dept_no_of_emp


1001 Austrian Production and planning D001 200
1001 Austrian stores D001 250
1002 American design and technical D134 100
support
1002 American Purchasing department D134 600
Functional dependencies in the table above:

emp_id -> emp_nationality

emp_dept -> {dept_type, dept_no_of_emp}

Candidate key: {emp_id, emp_dept}

The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
To make the table comply with BCNF we can break the table in three tables like this:

emp_nationality table:

emp_id emp_nationality
1001 Austrian
1002 American
emp_dept table:

emp_dept dept_type dept_no_of_emp


Production and planning D001 200
stores D001 250
design and technical D134 100
support
Purchasing department D134 600
emp_dept_mapping table:

emp_id emp_dept
1001 Production and planning
1001 stores
1002 design and technical
support
1002 Purchasing department
Functional dependencies:

emp_id -> emp_nationality

emp_dept -> {dept_type, dept_no_of_emp}

Candidate keys:

For first table: emp_id

For second table: emp_dept

For third table: {emp_id, emp_dept}

This is now in BCNF as in both the functional dependencies left side part is a key.

Ques Explain functional dependency?

The attributes of a table is said to be dependent on each other, when an attribute of a


table uniquely identifies another attribute of the same table.
For example:

Suppose we have a student table with attributes: Stu_Id, Stu_Name, Stu_Age. Here
Stu_Id attribute uniquely identifies the Stu_Name attribute of student table because
if we know the student id we can tell the student name associated with it.

This is known as functional dependency and can be written as Stu_Id->Stu_Name or


in words we can say Stu_Name is functionally dependent on Stu_Id.

Formally

If column A of a table uniquely identifies the column B of same table then it can
represented as A->B (Attribute B is functionally dependent on attribute A).Here A is called
determinant and B is called dependent attribute.

Types of Functional Dependencies


Trivial functional dependency

The dependency of an attribute on a set of attributes is known as trivial functional


dependency if the set of attributes includes that attribute.

Symbolically: A ->B is trivial functional dependency if B is a subset of A.

The following dependencies are also trivial: A->A & B->B

For example: Consider a table with two columns Student_id and Student_Name.

{Student_Id, Student_Name} -> Student_Id is a trivial functional dependency as


Student_Id is a subset of {Student_Id, Student_Name}.

That makes sense because if we know the values of Student_Id and Student_Name
then the value of Student_Id can be uniquely determined.

Also, Student_Id -> Student_Id & Student_Name -> Student_Name are trivial
dependencies too.

Non trivial functional dependency

If a functional dependency X->Y holds true where Y is not a subset of X then this
dependency is called non trivial Functional dependency.

For example:

An employee table with three attributes: emp_id, emp_name, emp_address.


The following functional dependencies are non-trivial:

emp_id -> emp_name (emp_name is not a subset of emp_id)

emp_id -> emp_address (emp_address is not a subset of emp_id)

On the other hand, the following dependencies are trivial:

{emp_id, emp_name} -> emp_name [emp_name is a subset of {emp_id, emp_name}].

Completely non trivial FD: If a FD X->Y holds true where X intersection Y is null then
this dependency is said to be completely non trivial function dependency.

Multivalued functional dependency

When one column data match with multiple values in another columns within a same
table is called multivalued dependency.

For example: Consider a bike manufacture company, which produces two colors
(Black and red) in each model every year.

bike_model manuf_year color


M1001 2007 Black
M1001 2007 Red
M2012 2008 Black
M2012 2008 Red
M2222 2009 Black
M2222 2009 Red
Here columns manuf_year and color are independent of each other and dependent on
bike_model.

In this case these two columns are said to be multivalued dependent on bike_model.
These dependencies can be represented like this:

bike_model ->-> manuf_year

bike_model ->-> color

Transitive dependency

A functional dependency is said to be transitive if it is indirectly formed by two functional


dependencies. For e.g.

X -> Z is a transitive dependency if the following three functional dependencies hold


true:
• X->Y
• Y does not ->X
• Y->Z

Note: A transitive dependency can only occur in a relation of three of more attributes.
This dependency helps us normalizing the database in 3NF (3rd Normal Form).

Example: Let’s take an example to understand it better:

Book Author Author_age


Game of George R. R. 66
Thrones Martin
Harry Potter J. K. Rowling 49
Dying of the George R. R. 66
Light Martin

{Book} ->{Author} (if we know the book, we knows the author name)

{Author} does not ->{Book}

{Author} -> {Author_age}

Therefore as per the rule of transitive dependency: {Book} -> {Author_age} should
hold, that makes sense because if we know the book name we can know the author’s
age.

Ques What is decomposition?

• Decomposition is the process of breaking down in parts or elements.


• It replaces a relation with a collection of smaller relations.
• It breaks the table into multiple tables in a database.
• It should always be lossless, because it confirms that the information in the original relation can
be accurately reconstructed based on the decomposed relations.
• If there is no proper decomposition of the relation, then it may lead to problems like loss of
information.

Properties of Decomposition

Following are the properties of Decomposition,

1. Lossless Decomposition
2. Dependency Preservation
3. Lack of Data Redundancy
1. Lossless Decomposition

• Decomposition must be lossless. It means that the information should not get lost
from the relation that is decomposed.
• It gives a guarantee that the join will result in the same relation as it was
decomposed.
Example:
Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1,
E2, E3, . . . . En; With instance: e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en,
then it is called as 'Lossless Join Decomposition'.

• In the above example, it means that, if natural joins of all the decomposition give
the original relation, then it is said to be lossless join decomposition.
Example: <Employee_Department> Table
Eid Ename Age City Salary Deptid DeptName
E001 ABC 29 Pune 20000 D001 Finance
E002 PQR 30 Pune 30000 D002 Production
E003 LMN 25 Mumbai 5000 D003 Sales
E004 XYZ 24 Mumbai 4000 D004 Marketing
E005 STU 32 Bangalore 25000 D005 Human Resource
• Decompose the above relation into two relations to check whether a decomposition
is lossless or lossy.
• Now, we have decomposed the relation that is Employee and Department.
Relation 1 : <Employee> Table

Eid Ename Age City Salary


E001 ABC 29 Pune 20000
E002 PQR 30 Pune 30000
E003 LMN 25 Mumbai 5000
E004 XYZ 24 Mumbai 4000
E005 STU 32 Bangalore 25000
• Employee Schema contains (Eid, Ename, Age, City, Salary).
Relation 2 : <Department> Table

Deptid Eid DeptName


D001 E001 Finance
D002 E002 Production
D003 E003 Sales
D004 E004 Marketing
D005 E005 Human Resource
• Department Schema contains (Deptid, Eid, DeptName).
• So, the above decomposition is a Lossless Join Decomposition, because the two
relations contains one common field that is 'Eid' and therefore join is possible.
• Now apply natural join on the decomposed relations.
Employee ⋈ Department

Eid Ename Age City Salary Deptid DeptName


E001 ABC 29 Pune 20000 D001 Finance
E002 PQR 30 Pune 30000 D002 Production
E003 LMN 25 Mumbai 5000 D003 Sales
E004 XYZ 24 Mumbai 4000 D004 Marketing
E005 STU 32 Bangalore 25000 D005 Human Resource
Hence, the decomposition is Lossless Join Decomposition.

• If the <Employee> table contains (Eid, Ename, Age, City, Salary) and
<Department> table contains (Deptid and DeptName), then it is not possible to
join the two tables or relations, because there is no common column between
them. And it becomes Lossy Join Decomposition.

2. Dependency Preservation

• Dependency is an important constraint on the database.


• Every dependency must be satisfied by at least one decomposed table.
• If {A → B} holds, then two sets are functional dependent. And, it becomes more
useful for checking the dependency easily if both sets in a same relation.
• This decomposition property can only be done by maintaining the functional
dependency.
• In this property, it allows to check the updates without computing the natural join
of the database structure.

3. Lack of Data Redundancy


• Lack of Data Redundancy is also known as a Repetition of Information.
• The proper decomposition should not suffer from any data redundancy.
• The careless decomposition may cause a problem with the data.
• The lack of data redundancy property may be achieved by Normalization process.

QUES What is Query optimization?

• Query optimization is a difficult part of the query processing.


• It determines the efficient way to execute a query with different possible query plans.
• It cannot be accessed directly by users once the queries are submitted to the database server or
parsed by the parser.
• A query is passed to the query optimizer where optimization occurs.
• Main aim of Query Optimization is to minimize the cost function,
I/O Cost + CPU Cost + Communication Cost
• It defines how an RDBMS can improve the performance of the query by re-ordering the
operations.
• It is the process of selecting the most efficient query evaluation plan from among various
strategies if the query is complex.
• It computes the same result as per the given expression, but it is a least costly way of generating
result.

Importance of Query Optimization

• Query optimization provides faster query processing.


• It requires less cost per query.
• It gives less stress to the database.
• It provides high performance of the system.
• It consumes less memory.

Ques State the purpose of query optimization.

The goal of query optimization is to reduce the system resources required to fulfill a query, and
ultimately provide the user with the correct result set faster.
• First, it provides the user with faster results, which makes the application seem faster to
the user.
• Secondly, it allows the system to service more queries in the same amount of time,
because each request takes less time than unoptimized queries.
• Thirdly, query optimization ultimately reduces the amount of wear on the hardware (e.g.
disk drives), and allows the server to run more efficiently (e.g. lower power consumption,
less memory usage).

Ques Explain the various steps query optimization or query processing.

Introduction to Query Processing

• Query Processing is a translation of high-level queries into low-level expression.


• It is a step wise process that can be used at the physical level of the file system, query
optimization and actual execution of the query to get the result.
• It requires the basic concepts of relational algebra and file structure.
• It refers to the range of activities that are involved in extracting data from the database.
• It includes translation of queries in high-level database languages into expressions that can be
implemented at the physical level of the file system.
• In query processing, we will actually understand how these queries are processed and how they
are optimized.

In the above diagram,


• The first step is to transform the query into a standard form.
A query is translated into SQL and into a relational algebraic expression. During this process,
Parser checks the syntax and verifies the relations and the attributes which are used in the
query.
The system constructs a parse tree representation of the query which it then translated into a
relational algebra expression.

• The second step is Query Optimizer. In this, it transforms the query into equivalent expressions
that are more efficient to execute.

3. The third step is Query evaluation. It executes the above query execution plan and returns the
result.

Translating SQL Queries into Relational Algebra

Example

SELECT Ename FROM Employee

WHERE Salary > 5000;


Translated into Relational Algebra Expression

σ Salary > 5000 (π Ename (Employee))


OR
π Ename (σ Salary > 5000 (Employee))

• A sequence of primitive operations that can be used to evaluate a query is a Query Execution
Plan or Query Evaluation Plan.
• The above diagram indicates that the query execution engine takes a query execution plan and
returns the answers to the query.
• Query Execution Plan minimizes the cost of query evaluation.

You might also like