Normalization
Normalization
Normalization
Higher degrees of normalization typically involve more tables and create the need for a
larger number of joins, which can reduce performance. Accordingly, more highly
normalized tables are typically used in database applications involving many isolated
transactions (e.g. an Automated teller machine), while less normalized tables tend to be
used in database applications that do not need to map complex relationships between data
entities and data attributes (e.g. a reporting application, or a full-text search application).
Although the normal forms are often defined informally in terms of the characteristics of
tables, rigorous definitions of the normal forms are concerned with the characteristics of
mathematical constructs known as relations. Whenever information is represented
relationally, it is meaningful to consider the extent to which the representation is
normalized.
Contents
[hide] [hide]
• 9 External links
An insertion anomaly. Until the new faculty member is assigned to teach at least one
course, his details cannot be recorded.
A deletion anomaly. All information about Dr. Giddens is lost when he temporarily
ceases to be assigned to any courses.
A table that is not sufficiently normalized can suffer from logical inconsistencies of
various types, and from anomalies involving data operations. In such a table:
Ideally, a relational database table should be designed in such a way as to exclude the
possibility of update, insertion, and deletion anomalies. The normal forms of relational
database theory provide guidelines for deciding whether a particular design will be
vulnerable to such anomalies. It is possible to correct an unnormalized design so as to
make it adhere to the demands of the normal forms: this is called normalization.
History
This short section requires expansion.
Edgar F. Codd first proposed the process of normalization and what came to be known as
the 1st normal form:
There is, in fact, a very simple elimination[1] procedure which we shall call normalization.
Through decomposition non-simple domains are replaced by "domains whose elements
are atomic (non-decomposable) values."
—Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks[2]
In his paper, Edgar F. Codd used the term "non-simple" domains to describe a
heterogeneous data structure, but later researchers would refer to such a structure as an
abstract data type. In his biography Edgar F. Codd also cited that the inspiration for his
work was his eager assistant Tom Ward who used to challenge Edgar to rounds of
database normalization similar to a chess match between master and apprentice. Tom
Ward has been often quoted in industry magazines as stating that he has always enjoyed
database normalization ever more than sudoku.....
Normal forms
The normal forms (abbrev. NF) of relational database theory provide criteria for
determining a table's degree of vulnerability to logical inconsistencies and anomalies. The
higher the normal form applicable to a table, the less vulnerable it is to such
inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by
definition, a table always meets the requirements of its HNF and of all normal forms
lower than its HNF; also by definition, a table fails to meet the requirements of any
normal form higher than its HNF.
The normal forms are applicable to individual tables; to say that an entire database is in
normal form n is to say that all of its tables are in normal form n.
Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF).
These normal forms have been summarized as requiring that all non-key attributes be
dependent on "the key, the whole key and nothing but the key". The fourth and fifth
normal forms (4NF and 5NF) deal specifically with the representation of many-to-many
and one-to-many relationships among attributes. Sixth normal form (6NF) incorporates
considerations relevant to temporal databases.
A table is in first normal form (1NF) if and only if it faithfully represents a relation.[3]
Given that database tables embody a relation-like form, the defining characteristic of one
in first normal form is that it does not allow nulls or duplicate rows. Simply put, a table
with a unique key and without any nullable columns is in 1NF. One requirement of a
relation is that every tuple contain exactly one value for each attribute. This is sometimes
expressed as "no repeating groups"[4]. While that statement itself is axiomatic, experts
disagree about what qualifies as a "repeating group", in particular whether a value may be
a relation value; thus the precise definition of 1NF is the subject of some controversy.
Notwithstanding, this theoretical uncertainty applies to relations, not tables. Table
manifestations are intrinsically free of variable repeating groups because they are
structurally constrained to the same number of columns in all rows.
See the first normal form article for a fuller discussion of the nuances of 1NF.
A table is in Boyce-Codd normal form (BCNF) if and only if, for every one of its non-
trivial functional dependencies X → Y, X is a superkey—that is, X is either a candidate
key or a superset thereof.[6]
A table is in fourth normal form (4NF) if and only if, for every one of its non-trivial
multivalued dependencies X →→ Y, X is a superkey—that is, X is either a candidate key
or a superset thereof.[7]
The criteria for fifth normal form (5NF and also PJ/NF) are:
Domain/key normal form (or DKNF) requires that a table not be subject to any
constraints other than domain constraints and key constraints.
A table is in sixth normal form (6NF) if and only if it satisfies no non-trivial join
dependencies at all.[8] This obviously means that the fifth normal form is also satisfied.
The sixth normal form was only defined when extending the relational model to take into
account the temporal dimension. Unfortunately, most current SQL technologies as of
2005 do not take into account this work, and most temporal extensions to SQL are not
relational. See work by Date, Darwen and Lorentzos[9] for a relational temporal extension,
Zimyani[10] for further discussion on Temporal Aggregation in SQL, or TSQL2 for a non-
relational approach.
Denormalization
Main article: Denormalization
Databases intended for Online Transaction Processing (OLTP) are typically more
normalized than databases intended for Online Analytical Processing (OLAP). OLTP
Applications are characterized by a high volume of small transactions such as updating a
sales record at a super market checkout counter. The expectation is that each transaction
will leave the database in a consistent state. By contrast, databases intended for OLAP
operations are primarily "read mostly" databases. OLAP applications tend to extract
historical data that has accumulated over a long period of time. For such databases,
redundant or "denormalized" data may facilitate Business Intelligence applications.
Specifically, dimensional tables in a star schema often contain denormalized data. The
denormalized or redundant data must be carefully controlled during ETL processing, and
users should not be permitted to see the data until it is in a consistent state. The
normalized alternative to the star schema is the snowflake schema. It has never been
proven that this denormalization itself provides any increase in performance, or if the
concurrent removal of data constraints is what increases the performance. The need for
denormalization has waned as computers and RDBMS software have become more
powerful.
In recognition that denormalization can be deliberate and useful, the non-first normal
form is a definition of database designs which do not conform to the first normal form, by
allowing "sets and sets of sets to be attribute domains" (Schek 1982). This extension is a
(non-optimal) way of implementing hierarchies in relations. Some theoreticians have
dubbed this practitioner developed method, "First Ab-normal Form", Codd defined a
relational database as using relations, so any table not in 1NF could not be considered to
be relational.
To transform this NF² table into a 1NF an "unnest" operator is required which extends the
relational algebra of the higher normal forms. The reverse operator is called "nest" which
is not always the mathematical inverse of "unnest", although "unnest" is the mathematical
inverse to "nest". Another constraint required is for the operators to be bijective, which is
covered by the Partitioned Normal Form (PNF).
Normalization Example
First, we have to consider functional dependency. It addresses the concept that certain data
fields are dependent upon other data fields in order to uniquely define and access them.
Consider, for example, the following situation:
Data on students' last names are stored in a data file. There may be (very probably will be)
several students with the last name "Smith." If we want a computer program to retrieve
information about student "Smith," we must have some way to specify which specific "Smith" we
desire. This is done through the specification of a uniquely assigned student number. John Smith
may have student number 11223, while Steve Smith may have 14322, and Tom Smith may have
33215. By specifying the correct student number, we are able to retrieve information about the
desired student named "Smith." Retrieving information about the specific "Smith" depends upon
the specification of the correct student number. For this reason we say that the last name is
functionally dependent upon the student number.
Why not, you may suggest, specify both the last name and the first name. Won't this eliminate
any ambiguity when searching for students named "Smith?" The problem is not eliminated
because it is possible (actually quite probable) that two or more students have the same last and
first names, causing problems when searching for specific information. If we have two students
whose names are John Smith, we run into the same problem when searching for specific
information. We can see that both the student last name and first name are functionally
dependent upon the student number.
Now let's turn our attention to a "representative" situation for which database normalization is
required. Think, for example, of the process of getting you enrolled in classes. There is quite a bit
of information that must be entered into the RDBMS. The normalization process brings logical
order and structure to the information gathering process.
When thinking about the logical normalization process we first look at all of the data required to
accomplish a task. Consider the following (part of a) report:
Class Enrolment
Class Code Class Description Student Number Name
503 Mgt Info Systems 00001 Masters, Rick
00003 Smith, Steve
00005 Jones, Terry
540 Quant Methods 00002 Wallace, Fred
00003 Smith, Steve
00004 Nurk, Sterling
What is called "repeating groups" exists within this data. Each class code can have any number
of students in it, so the students' information constitutes what is called a repeating group. Data
cannot be stored or processed in a database when it is in this form. What we must have is one
record containing all the data for each student who is enrolled in a class. There can be no "gaps"
in the data when stored in a file. The following table (data file) illustrates the data in First Normal
Form (1NF)
Class Enrolment
Class Code Class Description Student Number Name
503 Mgt Info Systems 00001 Masters, Rick
503 Mgt Info Systems 00003 Smith, Steve
503 Mgt Info Systems 00005 Jones, Terry
540 Quant Methods 00002 Wallace, Rusty
540 Quant Methods 00003 Smith, Steve
540 Quant Methods 00004 Nurk, Sterling
Converting to 1NF basically requires that we "flatten" the report above so that each row (record)
contains no repeating groups. No more than one entry per field can be entered and no "gaps"
exist in the data.
Now consider the following object (data file or table), named ENROL, that contains the data fields
(attributes) required to enrol you in a class. (note: this object contains all data fields whereas the
above examples illustrated only the first four fields)
This object is said to be in First Normal Form (1NF) if it is in the format illustrated above with no
"gaps" or repeating groups. It is simply a collection of data fields necessary to complete the job of
enrolling you in class, with each record in the file containing all data necessary for the enrolment.
The problem with 1NF is that there is redundancy with respect to entering all of the data into a
computer for each and every class in which you enrol. For example, your name, address, etc., will
have to be entered for each class that you take. If you take four classes, your name will have to
be entered four times - not a comforting thought for the data entry person, not to mention the
opportunities to incorrectly enter it. Developing a logical method of eliminating the entry of your
name four times leads us to the definition of what is called Second Normal Form (2NF).
We must next introduce the concept of a "KEY" field. A key field is one (or more logically joined)
field(s) that is used to uniquely identify each record in a data file. For example, the
Student_Number field can be used to uniquely identify each student's record in a student data
file. However, since one student may be enrolled in more than one class each quarter, the
Student_Number field alone is not sufficient to uniquely identify each record in the ENROL file
illustrated above. The combination of the Student_Number field and the Class_Code field forms a
unique combination and can therefore be considered as the key field for the ENROL file. We
usually indicate the key fields in the object descriptions by underlining the field name(s) as
illustrated below.
A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute (field) is fully functionally
dependent upon the key field. This means that all data attributes (fields) that are not used to
uniquely identify records (tuples or rows) in a file (table) should not appear more than once in the
entire database and should never have to be entered into the database more than once. Any non-
identifying data fields should be placed into separate objects (files). For example, we could
remove the name, address, etc. fields into an object named STUDENT and remove them from
the ENROL object. The result will yield two objects (files):
Here we see that the student name, address, etc., are functionally dependent upon the student
number in the STUDENT object (file), and that the class description, start date, building name,
etc., are functionally dependent upon the Student Number and the Class Code in the ENROL
object (file).
The relation between these objects (files) is said to be in 2NF. The relation is the logical linkage
between the files so that all data necessary to enrol students in classes is available and may be
uniquely retrieved when necessary.
While getting the data files into 2NF is better than 1NF, there are still some problems with the
form. For example, if the location of the class changes buildings, all records in the ENROL file for
that class will have to be updated. The building name and address are "transitively dependent"
upon the building number. Resolving the "transitive dependency" leads us to Third Normal Form
(3NF).
A relation is in 3NF if, and only if, it is in 2NF and no non-key fields are transitively dependent
upon the key field(s). That is, no non-key field can be functionally dependent upon another non-
key field. Our example is clearly not in 3NF since the building name (non-key field) depends upon
the building number (non-key field). The relation can be resolved into 3NF by dividing it into
component relations, each meeting 3NF form. Astute students will also have recognized that the
class description, start time, and start date are transitively dependent upon the class code, which
is not considered a key field here because it forms only part of the key field for the ENROL object.
They will also recognize that Lecturer name is functionally dependent upon the Lecturer code,
which is not a key field. The building code and Lecturer code fields are not key fields because
they are not used to uniquely identify each record in the ENROL object (file).
Note also that the LECTURER object is not in 3NF since the Department Name is transitively
dependent upon the Department Code. We resolve this into:
This exercise illustrates that you must consider ALL relationships within the organization's
database and resolve ALL relations into 3NF. This can take some time and effort, but the rewards
are great.
A very important point here is that no data may be lost during the normalization process. We
must always be able to reconstruct the original data after the normalization. To lose data will
cause problems and will be the result of an invalid normalization process.
Generic Example:
Assumption: A customer can have multiple orders and an order can include multiple
products.
0NF
3NF, BCNF
As above
Assumption: A customer can have multiple orders but an order can be for only 1 product.
CustName and OrderNo preassigned as keys.
0NF
Why Normalize?
•Flexibility
–Structure supports many ways to look at the data
•Data Integrity
–“Modification Anomalies”
•Deletion
•Insertion
•Update
•Efficiency
–Eliminate redundant data and save space
Normalization Defined
•“ In relational database design, the process of organizing data to minimize
duplication.
•Normalization usually involves dividing a database into two or more tables and
defining relationships between the tables.
•The objective is to isolate data so that additions, deletions, and modifications of a
field can be made in just one table and then propagated through the rest of the
database via the defined relationships.” - Webopedia,
http://webopedia.internet.com/TERM/n/normalization.html
Another Definition
•"Normalization" refers to the process of creating an efficient, reliable, flexible, and
appropriate "relational" structure for storing information. Normalized data must be in
a "relational" data structure. - Reid Software Development,
http://www.accessdatabase.com/normalize.html
Primary Key
•Unique Identifier for every row in the table
–Integers vice Text to save memory, increase speed
–Can be “composite”
–Surrogate is best bet!
•Meaningless, numeric column acting as primary key in lieu of
something like SSN or phone number - (both can be reissued!)
Referential Integrity
•Every piece of “foreign” key data has a primary key on the one site of
the relationship
–No “orphan” records. Every child has a parent
–Can’t delete records from primary table if in related table
•Benefits - Data Integrity and Propagation
–If update fields in main table, reflected in all queries
–Can’t add a record in related table without adding it to main
–Cascade Delete: If delete record from primary table, all children
deleted - use with care! Better idea to “archive”
–Cascade Update: If change the primary key field, will change
foreign key