Dbms Notes
Dbms Notes
Dbms Notes
It represents some aspect of the real (or an imagined) world, called the mini world or universe of discourse. Changes to the mini world are reflected in the database. Imagine, for example, a UNIVERSITY mini world concerned with students, courses, course sections, grades, and course prerequisites. It is a logically coherent collection of data, to which some meaning can be attached. (Logical coherency requires, in part, that the database not be self-contradictory.) It has a purpose: there is an intended group of users and some preconceived applications that the users are interested in employing.
To summarize: a database has some source (i.e., the mini world) from which data are derived, some degree of interaction with events in the represented mini world, and an audience that is interested in using it. Aside: data vs. information vs. knowledge: Data is the representation of "facts" or "observations" whereas information refers to the meaning thereof (according to some interpretation) or processed data. Knowledge, on the other hand, refers to the ability to use information to achieve intended ends. Computerized vs. manual: Not surprisingly (this being a CS course), our concern will be with computerized database systems, as opposed to manual ones, such as the card catalog-based systems that were used in libraries in ancient (i.e., before the year 2000) times. 1|Page
Size/Complexity: Databases run the range from being small/simple to being huge /complex. Definition 1: A database management system (DBMS) is general purpose system software facilitating each of the following (with respect to a database):
defining: specifying data types, data organization, and constraints to which the data must conform constructing: the process of storing the data on some medium (e.g., magnetic disk) that is controlled by the DBMS manipulating: querying, updating, report generation Sharing: a database allows multiple users and programs to database concurrently. Protection & security: protection from hardware and software crashes and security from unauthorized access
Definition 2: Database management system is software of collection of small programs to perform certain operation on data and manage the data. Two basic operations performed by the DBMS are: Management of Data in the Database Management of Users associated with the database.
Management of the data means to specify that how data will be stored, structured and accessed in the database. Management of database users means to manage the users in such a way that they can perform any desired operations on the database. DBMS also ensures that a user can not perform any operation for which he is not allowed. And also an authorized user is not allowed to perform any action which is restricted to that user. In General DBMS is a collection of Programs performing all necessary actions associated to a database.
File Processing Systems: 1 0 2 Where data are stored to individual files is a very old, but often used approach to system development.
Each program (system) often had its own unique set of files.
3 Diagrammatic representation of conventional file systems Users of file processing systems are almost always at the mercy of the Information Systems department to write programs that manipulate stored data and produce needed information such as printed reports and screen displays. What is a file, then? A File is a collection of data about a single entity. Files are typically designed to meet needs of a particular department or user group. Files are also typically designed to be part of a particular computer application Advantages: 1 2 Disadvantages: 1 2 3 4 5 6 7 Program-data dependence. Duplication of data. Limited data sharing. Lengthy program and system development time. Excessive program maintenance when the system changed. Duplication of data items in multiple files. Duplication can affect on input, maintenance, storage and possibly data integrity problems. Inflexibility and non-scalability. Since the conventional files are designed to support single application, the original file structure cannot support the new requirements. a re relatively easy to design and implement since they are normally based on a single application. The processing speed is faster than other ways of storing data.
Today, the trend is in favor of replacing file-based systems and applications with database systems and applications.
3|Page
Main Characteristics of database approach: 1. Self-Description: A database system not only includes the data stored that is of relevance to the organization but also a complete definition/description of the database's structure and constraints. This meta-data (i.e., data about data) is stored in the so-called system catalog, which contains a description of the structure of each file, the type and storage format of each field, and the various constraints on the data (i.e., conditions that the data must satisfy). The system catalog is used not only by users (e.g., who need to know the names of tables and attributes, and sometimes data type information and other things), but also by the DBMS software, which certainly needs to "know" how the data is structured/organized in order to interpret it in a manner consistent with that structure. Recall that a DBMS is general purpose, as opposed to being a specific database application. Hence, the structure of the data cannot be "hardcoded" in its programs, but rather must be treated as a "parameter" in some sense. 2. Insulation between Programs and Data; (Data Abstraction): Program-Data Independence: In traditional file processing, the structure of the data files accessed by an application is "hard-coded" in its source code. (E.g., consider a student file in a C program which uses array of structures: it gives a detailed description of the records in a file.) If, for some reason, we decide to change the structure of the data (e.g., by adding another field Blood Group), every application in which a description of that file's structure is hard-coded must be changed! In contrast, DBMS access programs, in most cases, do not require such changes, because the structure of the data is described (in the system catalog) separately from the programs that access it and those programs consult the catalog in order to ascertain the structure of the data (i.e., providing a means by which to determine boundaries between records and between fields within records) so that they interpret that data properly. In other words, the DBMS provides a conceptual or logical view of the data to application programs, so that the underlying implementation may be changed without the programs being modified. (This is referred to as program-data independence.) Also, which accesses paths (e.g., indexes) exist are listed in the catalog, helping the DBMS to determine the most efficient way to search for items in response to a query? 3. Multiple Views of Data: Different users (e.g., in different departments of an organization) have different "views" or perspectives on the database. For example, from the point of view of A.Os Office, student data does not include anything about which courses were taken or which grades were earned. (This is an example of a subset view.) As another example, a Registrar's Office employee might think that PERCENTAGE is a field of data in each student's record. In reality, the underlying database might calculate that value each time it is called for. This is called virtual (or derived) data. A view designed for an academic advisor might give the appearance that the data is structured to point out the prerequisites of each course. A good DBMS has facilities for defining multiple views. This is not only convenient for users, but also addresses security issues of data access. 4. Data Sharing and Multi-user Transaction Processing: As you learned about in the OS course, the simultaneous access of computer resources by multiple users/processes is a major source of complexity. The same is true for multi-user DBMS's. Arising from this is the need for concurrency control, which is supposed to ensure that several users trying to update the same data do so in a "controlled" manner so that the results of the 4|Page
updates are as though they were done in some sequential order (rather than interleaved, which could result in data being incorrect). This gives rise to the concept of a transaction, which is a process that makes one or more accesses to a database and which must have the appearance of executing in isolation from all other transactions (even ones that access the same data at the "same time") and of being atomic (in the sense that, if the system crashes in the middle of its execution, the database contents must be as though it did not execute at all). Applications such as airline reservation systems are known as online transaction processing applications.
DBMS system designers/implementers: provide the DBMS software that is at the foundation of all this! Tool developers: design and implement software tools facilitating database system design, performance monitoring, creation of graphical user interfaces, prototyping, etc. Operators and maintenance personnel: responsible for the day-to-day operation of the system.
5|Page
Capabilities of DBMS's
These are additional characteristics of DBMS. 1. Controlling Redundancy: Data redundancy (such as tends to occur in the "file processing" approach) leads to wasted storage space, duplication of effort (when multiple copies of a datum need to be updated), and a higher likelihood of the introduction of inconsistency. a. On the other hand, redundancy can be used to improve performance of queries. Indexes, for example, are entirely redundant, but help the DBMS in processing queries more quickly. b. A DBMS should provide the capability to automatically enforce the rule that no inconsistencies are introduced when data is updated. 2. Restricting Unauthorized Access: A DBMS should provide a security and authorization subsystem, which is used for specifying restrictions on user accounts. Common kinds of restrictions are to allow read-only access (no updating), or access only to a subset of the data 3. Providing Persistent Storage for Program Objects: Object-oriented database systems make it easier for complex runtime objects (e.g., lists, trees) to be saved in secondary storage so as to survive beyond program termination and to be retrievable at a later time. 4. Providing Storage Structures for Efficient Query Processing: The DBMS maintains indexes (typically in the form of trees and/or hash tables) that are utilized to improve the execution time of queries and updates. (The choice of which indexes to create and maintain is part of physical database design and tuning and is the responsibility of the DBA. a. The query processing and optimization module is responsible for choosing an efficient query execution plan for each query submitted to the system. 5. Providing Backup and Recovery: The subsystem having this responsibility ensures that recovery is possible in the case of a system crash during execution of one or more transactions. 6. Providing Multiple User Interfaces: For example, query languages for casual users, programming language interfaces for application programmers, forms and/or command codes for parametric users, menu-driven interfaces for stand-alone users. 7. Representing Complex Relationships Among Data: A DBMS should have the capability to represent such relationships and to retrieve related data quickly. 8. Enforcing Integrity Constraints: Most database applications are such that the semantics (i.e., meaning) of the data require that it satisfy certain restrictions in order to make sense. Perhaps the most fundamental constraint on a data item is its data type, which specifies the universe of values from which its value may be drawn. (E.g., a Grade field could be defined to be of type Grade Type, which, say, we have defined as including precisely the values in the set { "A", "A-", "B+", ..., "F" }. a. Another kind of constraint is referential integrity, which says that if the database includes an entity that refers to another one, the latter entity must exist in the database. 9. Permitting Inference and Actions Via Rules: In a deductive database system, one may specify declarative rules that allow the database to infer new data! E.g., Figure out which students are on academic probation. Such capabilities would take the place of application programs that would be used to ascertain such information otherwise. a. Active database systems go one step further by allowing "active rules" that can be used to initiate actions automatically. 10. Potential for enforcing standards: this is very crucial for the success of database applications in large organizations Standards refer to data item names, display formats, screens, report structures, meta-data (description of data) etc. 11. Reduced application development time: incremental time to add each new application is reduced. 6|Page
12. Flexibility to change data structures : database structure may evolve as new requirements are defined. 13. Availability of up-to-date information very important for on-line transaction systems such as airline, hotel, car reservations. 14. Economies of scale: by consolidating data and applications across departments wasteful overlap of resources and personnel can be avoided.
The database system The above fig represents the database system which is collection of database, DBMS, AP, users.
When a DBMS may be unnecessary: If the database and applications are simple, well defined, and not expected to change. If there are stringent real-time requirements that may not be met because of DBMS overhead. If access to data by multiple users is not required.
When no DBMS may suffice: If the database system is not able to handle the complexity of data because of modeling limitations. If the database users need special operations not supported by the DBMS.
7|Page
One fundamental characteristic of the database approach is that it provides some level of data abstraction by hiding details of data storage that are irrelevant to database users. A data model ---A set of concepts to describe the structure of a database, the operations for manipulating these structures, and certain constraints that the database should obey. By structure is meant the data types, relationships, and constraints that should hold for the data. Most data models also include a set of basic operations for specifying retrievals/updates.
There are other well-known data models that have been the basis for database systems. The best-known models pre-dating the relational model are the hierarchical (in which the entity types form a tree) and the network (in which the entity types and relationships bet
High-level/conceptual: (e.g., ER model ) provides a view close to the way users would perceive data; uses concepts such as o entity: real-world object or concept (e.g., student, employee, course, department, event) o attribute: some property of interest describing an entity (e.g., height, age, color) o relationship: an interaction among entities (e.g., works-on relationship between an employee and a project) Representational / implementational: intermediate level of abstractness; example is relational data model (or network). Also called record-based model. Low-level / physical: gives details as to how data is stored in computer system, such as record formats, orderings of records, access paths (indexes).
Schemas, Instances, and Database State One must distinguish between the description of a database and the database itself. The former is called the database schema, which is specified during design and is not expected to change often. The actual data stored in the database probably changes often. The data in the database at a particular time is called the state of the database, or a snapshot. Schema is also called intension. State is also called extension.
Three-Schema Architecture: This idea was first described by the ANSI/SPARC committee in late 1970's. The goal is to separate (i.e., insert layers of "insulation" between) user applications and the physical database. It is an ideal that few real-life DBMS's achieve fully.
internal/physical schema: describes the physical storage structure (using a low-level data model) conceptual schema: describes the (logical) structure of the whole database for a community of users. Hides physical storage details, concentrating upon describing entities, data types,
9|Page
relationships, user operations, and constraints. Can be described using either high-level or implementational data model. external schema (or user views): Each such schema describes part of the database that a particular category of users is interested in, hiding rest of database. Can be described using either high-level or implementational data model. (In practice, usually described using same model as is the conceptual schema.)
Users (including application programs) submit queries that are expressed with respect to the external level. It is the responsibility of the DBMS to transform such a query into one that is expressed with respect to the internal level (and to transform the result, which is at the internal level, into its equivalent at the external level). Example: Select students with GPA > 3.5. Q: How is this accomplished? A: By virtue of mappings between the levels:
external/conceptual mapping (providing logical data independence) conceptual/internal mapping (providing physical data independence)
Data independence is the capacity to change the schema at one level of the architecture without having to change the schema at the next higher level. We distinguish between logical and physical data independence according to which two adjacent levels are involved. The former refers to the ability to change the conceptual schema without changing the external schema. The latter refers to the ability to change the internal schema without having to change the conceptual. Logical Data Independence: The capacity to change the conceptual schema without having to change the external schemas and their associated application programs. Physical Data Independence: The capacity to change the internal schema without having to change the conceptual schema. For an example of physical data independence, suppose that the internal schema is modified (because we decide to add a new index, or change the encoding scheme used in representing some field's value, or stipulate that some previously unordered file must be ordered by a particular field ). Then we can change the mapping between the conceptual and internal schemas in order to avoid changing the conceptual schema itself. Not surprisingly, the process of transforming data via mappings can be costly (performance-wise), which is probably one reason that real-life DBMS's don't fully implement this 3-schema architecture. DBMS Languages DDL: Data definition (conceptual schema, possibly internal/external)) Used by the DBA and database designers to specify the conceptual schema of a database. In many DBMSs, the DDL is also used to define internal and external schemas (views). In some DBMSs, separate storage definition language (SDL) and view definition language (VDL) are used to define internal and external schemas.
10 | P a g e
SDL: Storage definition (internal schema) SDL is typically realized via DBMS commands provided to the DBA and database designers
VDL: View definition (external schema) DML: Data manipulation (retrieval, update) o High-Level or Non-procedural Languages: These include the relational language SQL May be used in a standalone way or may be embedded in a programming language o Low Level or Procedural Languages: These must be embedded in a programming language DBMS Interfaces Menu-based: popular for browsing on the web Forms-based: designed for nave users GUI-based: (Point and Click, Drag and Drop, etc.) Natural language: requests in written English Special purpose for parametric users: Bank clerks uses interfaces with minimum function keys. For DBA: Creating user accounts, granting authorizations Setting system parameters Changing schemas or access paths.
11 | P a g e
Stored data manager: this module of DBMS controls access to DBMS information that is stored on disk, whether it is part of the database or the catalog. The dotted lines illustrate the accesses. DDL compiler: it processes schema definitions, specified in the DDL, and stores description of schemas in the DBMS catalog. DBMS catalog: includes information such as the names and data types of data item, storage details of each file, mapping information among schemas, and constraints. These are accessed by various modules of DBMS as shown by dotted lines. Run time database processor: handles database accesses at runtime; it receives retrieval or update operations and carries them out on the database. Query compiler: handles high level queries that are entered interactively. It parses, analyzes and compiles or interprets a query by creating database access code and then generates calls to the runtime database processor. Pre compiler: Extracts DML commands from an application program written in a host language. these are sent to DML compiler for further processing. DBMS UTILITIES To perform certain functions such as: 12 | P a g e
o Loading data stored in files into a database. Includes data conversion tools. o Backing up the database periodically on tape. o Reorganizing database file structures. o Report generation utilities. o Performance monitoring utilities. o Other functions, such as sorting, user monitoring, data compression, etc. Data dictionary / repository: Used to store schema descriptions and other information such as design decisions, application program descriptions, user information, usage standards, contains all information stored in catalog, but accessed by users rather than DBMS. Classification of DBMS's Based upon
underlying data model (e.g., relational, object, object-relational, network) multi-user vs. single-user centralized vs. distributed cost general-purpose vs. special-purpose types of access path options
13 | P a g e
A simplified diagram to illustrate the main phases of database design The main phases of database design are depicted in Figure:
Requirements Collection and Analysis: purpose is to produce a description of the users' requirements. Conceptual Design: purpose is to produce a conceptual schema for the database, including detailed descriptions of entity types, relationship types, and constraints. All these are expressed in terms provided by the data model being used. Implementation: purpose is to transform the conceptual schema (which is at a high/abstract level) into a (lower-level) representational/implementational model supported by whatever DBMS is to be used. Physical Design: purpose is to decide upon the internal storage structures, access paths (indexes), etc., that will be used in realizing the representational model produced in previous phase
Example: A STUDENT entity might be described by RNO, Name, BirthDate, etc., attributes, each having a particular value. What distinguishes an entity from an attribute is that the latter is strictly for the purpose of describing the former and is not, in and of itself, of interest to us. It is sometimes said that an entity has an independent existence, whereas an attribute does not. In performing data modeling, however, it is not always clear whether a particular concept deserves to be classified as an entity or "only" as an attribute. We can classify attributes along these dimensions:
simple/atomic vs. composite single-valued vs. multi-valued (or set-valued) stored vs. derived
A composite attribute is one that is composed of smaller parts. An atomic attribute is indivisible or indecomposable.
Example 1: A BirthDate attribute can be viewed as being composed of (sub-)attributes for month, day, and year. Example 2: An Address attribute can be viewed as being composed of (sub-)attributes for street address, city, state, and zip code. A street address can itself be viewed as being composed of a number, street name, and apartment number. As this suggests, composition can extend to a depth of two (as here) or more.
To describe the structure of a composite attribute, one can draw a tree , it is customary to write its name followed by a parenthesized list of its sub-attributes. For the examples mentioned above, we would write BirthDate(Month,Day,Year) Address(StreetAddr(StrNum, StrName, AptNum), City, State, Zip)
Single- vs. multi-valued attribute: Consider a PERSON entity. The person it represents has (one) SSN, (one) date of birth, (one, although composite) name, etc. But that person may have zero or more academic degrees, dependents, or (if the person is a male living in tenali) spouses! How can we model this via attributes AcademicDegrees, Dependents, and Spouses? One way is to allow such attributes to be multi-valued , which is to say that we assign to them a (possibly empty) set of values rather than a single value. 15 | P a g e
To distinguish a multi-valued attribute from a single-valued one, it is customary to enclose the former within curly braces (which makes sense, as such an attribute has a value that is a set, and curly braces are traditionally used to denote sets). Using the PERSON example from above, we would depict its structure in text as PERSON(SSN, Name, BirthDate(Month, Day, Year), { AcademicDegrees(School, Level, Year) }, { Dependents }, ...) Here we have taken the liberty to assume that each academic degree is described by a school, level (e.g., B.S., Ph.D.), and year. Thus, AcademicDegrees is not only multi-valued but also composite. We refer to an attribute that involves some combination of multi-valuedness and compositeness as a complex attribute. A more complicated example of a complex attribute is AddressPhone. This attribute is for recording data regarding addresses and phone numbers of a business. The structure of this attribute allows for the business to have several offices, each described by an address and a set of phone numbers that ring into that office. Its structure is given by { AddressPhone( { Phone(AreaCode, Number) }, Address(StrAddr(StrNum, StrName, AptNum), City, State, Zip)) } Stored vs. derived attribute: Perhaps independent and derivable would be better terms for these (or nonredundant and redundant). In any case, a derived attribute is one whose value can be calculated from the values of other attributes, and hence need not be stored. Example: Age can be calculated from BirthDate, assuming that the current date is accessible. The Null value: In some cases a particular entity might not have an applicable value for a particular attribute. Or that value may be unknown. Or, in the case of a multi-valued attribute, the appropriate value might be the empty set. Example: The attribute DateOfDeath is not applicable to a living person and its correct value may be unknown for some persons who have died. In such cases, we use a special attribute value, called null. There has been some argument in the database literature about whether a different approach (such as having distinct values for not applicable and unknown) would be superior.
Key Attributes of an Entity Type: A minimal collection of attributes (often only one) that, by design, distinguishes any two (simultaneously-existing) entities of that type. In other words, if attributes A1 through Am together form a key of entity type E, and e and f are two entities of type E existing at the same time, then, in at least one of the attributes Ai (0 < i <= m), e and f must have distinct values. An entity type could have more than one key., in which the CAR entity type is postulated to have both { Registration(RegistrationNum, State) } and { VehicleID } as keys.)
The CAR entry type with two key attribute, Registration and Vehicle_id (a) E-R diagram notation. (b) Entity set with three entities. Domains (Value Sets) of Attributes: The domain of an attribute is the "universe of values" from which its value can be drawn. In other words, an attribute's domain specifies its set of allowable values. The concept is similar to data type.
has a SSN that uniquely identifies her/him has an address has a salary has a sex has a birthdate has a direct supervisor has a set of dependents is assigned to one department works some number of hours per week on each of a set of projects (which need not all be controlled by the same department) Each dependent o has first name o has a sex o has a birthdate o is related to a particular employee in a particular way (e.g., child, spouse, pet) o is uniquely identified by the combination of her/his first name and the employee of which (s)he is a dependent
o o o o o o o o o
DEPARTMENT(Name, Number, { Locations }, Manager, ManagerStartDate, { Employees }, { Projects }) PROJECT(Name, Number, Location, { Workers }, ControllingDept)
18 | P a g e
EMPLOYEE(Name(FName, MInit, LName), SSN, Sex, Address, Salary, BirthDate, Dept, Supervisor, { Dependents }, { WorksOn(Project, Hours) }) DEPENDENT(Employee, FirstName, Sex, BirthDate, Relationship)
Remarks: Note that the attribute WorksOn of EMPLOYEE (which records on which projects the employee works) is not only multi-valued (because there may be several such projects) but also composite, because we want to record, for each such project, the number of hours per week that the employee works on it. Also, each candidate key has been indicated by underlining. For similar reasons, the attributes Manager and ManagerStartDate of DEPARTMENT really ought to be combined into a single composite attribute. Not doing so causes little or no harm, however, because these are single-valued attributes. Multi-valued attributes would pose some difficulties, on the other hand. Suppose, for example, that a department could have two or more managers, and that some department had managers Mary and Harry, whose start dates were 10-4-1999 and 1-13-2001, respectively. Then the values of the Manager and ManagerStartDate attributes should be { Mary, Harry } and { 10-4-1999, 113-2001 }. But from these two attribute values, there is no way to determine which manager started on which date. On the other hand, by recording this data as a set of ordered pairs, in which each pair identifies a manager and her/his starting date, this deficiency is eliminated. End of Remarks
EMPLOYEE MANAGES DEPARTMENT (arising from Manager attribute in DEPARTMENT) DEPARTMENT CONTROLS PROJECT (arising from ControllingDept attribute in PROJECT and the Projects attribute in DEPARTMENT) EMPLOYEE WORKS_FOR DEPARTMENT (arising from Dept attribute in EMPLOYEE and the Employees attribute in DEPARTMENT) EMPLOYEE SUPERVISES EMPLOYEE (arising from Supervisor attribute in EMPLOYEE) EMPLOYEE WORKS_ON PROJECT (arising from WorksOn attribute in EMPLOYEE and the Workers attribute in PROJECT) DEPENDENT DEPENDS_ON EMPLOYEE (arising from Employee attribute in DEPENDENT and the Dependents attribute in EMPLOYEE)
19 | P a g e
In ER diagrams, relationship types are drawn as diamond-shaped boxes connected by lines to the entity types involved. Note that attributes are depicted by ovals connected by lines to the entity types they describe (with multi-valued attributes in double ovals and composite attributes depicted by trees). The original attributes that gave rise to the relationship types are absent, having been replaced by the relationship types. A relationship set is a set of instances of a relationship type. If, say, R is a relationship type that relates entity types A and B, then, at any moment in time, the relationship set of R will be a set of ordered pairs (x,y), where x is an instance of A and y is an instance of B. What this means is that, for example, if our COMPANY miniworld is, at some moment, such that employees e1, e3, and e6 work for department d1, employees e2 and e4 work for department d2, and employees e5 and e7 work for department d3, then the WORKS_FOR relationship set will include as instances the ordered pairs (e1, d1), (e2, d2), (e3, d1), (e4, d2), (e5, d3), (e6, d1), and (e7, d3). Ordering of entity types in relationship types : Note that the order in which we list the entity types in describing a relationship is of little consequence, except that the relationship name (for purposes of clarity) ought to be consistent with it. For example, if we swap the two entity types in each of the first two relationships listed above, we should rename them IS_MANAGED_BY and IS_CONTROLLED_BY, respectively. Degree of a relationship type: Also note that, in our COMPANY example, all relationship instances will be ordered pairs, as each relationship associates an instance from one entity type with an instance of another (or the same, in the case of SUPERVISES) relationship type. Such relationships are said to be binary, or to have degree two. Relationships with degree three (called ternary) or more are also possible, although not as common. where a relationship SUPPLY (perhaps not the best choice for a name) has as instances ordered triples of suppliers, parts, and projects, with the intent being that inclusion of the ordered triple (s2, p4, j1), for example, indicates that supplier s2 supplied part p4 to project j1). Roles in relationships: Each entity that participates in a relationship plays a particular role in that relationship, and it is often convenient to refer to that role using an appropriate name. For example, in each instance of a WORKS_FOR relationship set, the employee entity plays the role of worker or (surprise!) employee and each department plays the role of employer or (surprise!) department. Indeed, as this example suggests, often it is best to use the same name for the role as for the corresponding entity type. 20 | P a g e
An exception to this rule occurs when the same entity type plays two (or more) roles in the same relationship. (Such relationships are said to be recursive, which I find to be a misleading use of that term. A better term might be self-referential.) For example, in each instance of a SUPERVISES relationship set, one employee plays the role of supervisor and the other plays the role of supervisee. Constraints on Relationship Types Often, in order to make a relationship type be an accurate model of the miniworld concepts that it is intended to represent, we impose certain constraints that limit the possible corresponding relationship sets. (That is, a constraint may make "invalid" a particular set of instances for a relationship type.) There are two main kinds of relationship constraints (on binary relationships). For illustration, let R be a relationship set consisting of ordered pairs of instances of entity types A and B, respectively.
cardinality ratio: o 1:1 (one-to-one): Under this constraint, no instance of A may particpate in more than one instance of R; similarly for instances of B. In other words, if (a1, b1) and (a2, b2) are (distinct) instances of R, then neither a1 = a2 nor b1 = b2. Example: Our informal description of COMPANY says that every department has one employee who manages it. If we also stipulate that an employee may not (simultaneously) play the role of manager for more than one department, it follows that MANAGES is 1:1. o 1:N (one-to-many): Under this constraint, no instance of B may participate in more than one instance of R, but instances of A are under no such restriction. In other words, if (a1, b1) and (a2, b2) are (distinct) instances of R, then it cannot be the case that b1 = b2. Example: CONTROLS is 1:N because no project may be controlled by more than one department. On the other hand, a department may control any number of projects, so there is no restriction on the number of relationship instances in which a particular department instance may participate. For similar reasons, SUPERVISES is also 1:N. o N:1 (many-to-one): This is just the same as 1:N but with roles of the two entity types reversed. Example: WORKS_FOR and DEPENDS_ON are N:1. o M:N (many-to-many): Under this constraint, there are no restrictions. (Hence, the term applies to the absence of a constraint!) Example: WORKS_ON is M:N, because an employee may work on any number of projects and a project may have any number of employees who work on it.
Notice the notation in for indicating each relationship type's cardinality ratio. Suppose that, in designing a database, we decide to include a binary relationship R as described above (which relates entity types A and B, respectively). To determine how R should be constrained, with respect to cardinality ratio, the questions you should ask are these: May a given entity of type B be related to multiple entities May a given entity of type A be related to multiple entities of type B? of type A?
The pair of answers you get maps into the four possible cardinality ratios as follows: (yes,yes)-->M:N (yes,no)-->N:1 (no,yes)-->1:N (no, no) --> 1:1 participation: specifies whether or not the existence of an entity depends upon its being related to another entity via the relationship.
21 | P a g e
total participation (or existence dependency): To say that entity type A is constrained to participate totally in relationship R is to say that if (at some moment in time) R's instance set is { (a1, b1), (a2, b2), ... (am, bm) }, then (at that same moment) A's instance set must be { a1, a2, ..., am }. In other words, there can be no member of A's instance set that does not participate in at least one instance of R. According to our informal description of COMPANY, every employee must be assigned to some department. That is, every employee instance must participate in at least one instance of WORKS_FOR, which is to say that EMPLOYEE satisfies the total participation constraint with respect to the WORKS_FOR relationship. In an ER diagram, if entity type A must participate totally in relationship type R, the two are connected by a double line.
partial participation: the absence of the total participation constraint! (E.g., not every employee has to participate in MANAGES; hence we say that, with respect to MANAGES, EMPLOYEE participates partially. This is not to say that for all employees to be managers is not allowed; it only says that it need not be the case that all employees are managers.
Attributes of Relationship Types Relationship types, like entity types, can have attributes. A good example is WORKS_ON, each instance of which identifies an employee and a project on which (s)he works. In order to record (as the specifications indicate) how many hours are worked by each employee on each project, we include Hours as an attribute of WORKS_ON. In the case of an M:N relationship type (such as WORKS_ON), allowing attributes is vital. In the case of an N:1, 1:N, or 1:1 relationship type, any attributes can be assigned to the entity type opposite from the 1 side. For example, the StartDate attribute of the MANAGES relationship type can be given to either the EMPLOYEE or the DEPARTMENT entity type. Weak Entity Types: An entity type that has no set of attributes that qualify as a key is called weak. (Ones that do are strong.) An entity of a weak identity type is uniquely identified by the specific entity to which it is related (by a so-called identifying relationship that relates the weak entity type with its so-called identifying or owner entity type) in combination with some set of its own attributes (called a partial key). Example: A DEPENDENT entity is identified by its first name together with the EMPLOYEE entity to which it is related via DEPENDS_ON. (Note that this wouldn't work for former heavyweight boxing champion George Foreman's sons, as they all have the name "George"!) Because an entity of a weak entity type cannot be identified otherwise, that type has a total participation constraint (i.e., existence dependency) with respect to the identifying relationship. This should not be taken to mean that any entity type on which a total participation constraint exists is weak. For example, DEPARTMENT has a total participation constraint with respect to MANAGES, but it is not weak. In an ER diagram, a weak entity type is depicted with a double rectangle and an identifying relationship type is depicted with a double diamond.
22 | P a g e
Design Choices for ER Conceptual Design: Sometimes it is not clear whether a particular miniworld concept ought to be modeled as an entity type, an attribute, or a relationship type. Here are some guidelines (given with the understanding that schema design is an iterative process in which an initial design is refined repeatedly until a satisfactory result is achieved):
As happened in our development of the ER model for COMPANY, if an attribute of entity type A serves as a reference to an entity of type B, it may be wise to refine that attribute into a binary relationship involving entity types A and B. It may well be that B has a corresponding attribute referring back to A, in which case it, too, is refined into the aforementioned relationship. In our COMPANY example, this was exemplified by the Projects and ControllingDept attributes of DEPARTMENT and PROJECT, respectively. An attribute that exists in several entity types may be refined into its own entity type. For example, suppose that in a UNIVERSITY database we have entity types STUDENT, INSTRUCTOR, and COURSE, all of which have a Department attribute. Then it may be wise to introduce a new entity type, DEPARTMENT, and then to follow the preceding guideline by introducing a binary relationship between DEPARTMENT and each of the three aforementioned entity types. An entity type that is involved in very few relationships (say, zero, one, or possibly two) could be refined into an attribute (of each entity type to which it is related).
23 | P a g e
Enhanced ER Model
The ER model is generally sufficient for "traditional" database applications. But more recent applications of DB technology (e.g., CAD/CAM, telecommunication, images/graphics, multimedia, data mining/warehousing, geographic info systems) cry out for a richer model. The EER model extends the ER model by, in part, adding the concept of specialization (and its inverse, generalization), which is analogous to the same-named concept (also called extension or subclassing) from object-oriented design/programming. An entity type may be recognized as having one or more subclasses, with respect to some criterion. This represents a specialization of the entity type. A subclass inherits the features of its parent (the entity type), and can be given its own "local" features. By features 24 | P a g e
we mean not only attributes but also relationships. (This is entirely analogous to what we find in OO programming languages, where a subclass inherits its parent's features (instance variables and methods) and can be defined to have new ones specific to it.)
Using E&N's example , in Figure 4.1 to illustrate, suppose that we have an entity type EMPLOYEE with attributes Name, SSN, BirthDate, and Address. Specializing this entity type with respect to job type, we might identify SECRETARY, TECHNICIAN, and ENGINEER as subclasses, with attributes TypingSpeed, TGrade, and EngType, respectively. The idea is that (at any moment in time) every member of SECRETARY's entity set is also a member of EMPLOYEE's entity set. And similarly for members of the entity sets of TECHNICIAN and ENGINEER. Hence, a SECRETARY entity, also being an EMPLOYEE entity, can participate in relationship types involving EMPLOYEE (such as WORKS_FOR, although no such relationship type is shown in the figure). Specialization is the process of defining a set of subclasses of an entity type, usually on the basis of some distinguishing characteristics (or "dimensions") of the entities in the entity type. Interestingly, you may introduce multiple specializations of a single class, each based upon a different distinguishing characteristic of (the entities of) that class. For example, in Figure 4.1, EMPLOYEE is specialized according to job type (resulting in subclasses SECRETARY, TECHNICIAN, and ENGINEER) and also on the basis of method of pay (resulting in subclasses SALARIED_EMPLOYEE and HOURLY_EMPLOYEE). This results in a situation (not typically seen in OO programming) in which a single entity can be an instance of two sibling subclasses (or, more generally, of two subclasses neither of which is an ancestor of the other). A reasonable question to ask is What is the purpose of extending the ER model to include specialization? A few answers:
So as not to carry around null-valued attributes that don't apply To define relationships in which only subclass entities may participate (e.g., trade union membership is applicable only to hourly employees)
The term generalization refers to the inverse of specialization. That is, generalization refers to recognizing the need/benefit of introducing a superclass to one or more classes that have already been postulated. Hence, generalization builds a class hierarchy in a bottom-up manner whereas specialization builds it in a top-down manner. Constraints and Characteristics of Specialization/Generalization 25 | P a g e
Constraints on Specialization and Generalization: A subclass D of a class C is said to be predicate-defined (or condition-defined) if, for any member of the entity set of C, we can determine whether or not it also is a member of D by examining the values in its attributes (and seeing whether those values satisfy the so-called defining predicate). If all subclasses in a given specialization are predicate-defined, then the specialization is said to be attribute-defined.
For example, Figure 4.4 shows an EER diagram indicating that the value of an EMPLOYEE entity's Job_Type attribute determines in which subclass (if any) that entity has membership. When there is no algorithm for determining membership in a subclass, we say that the subclass is userdefined. For such subclasses, membership of entities cannot be decided automatically and hence must be specified by a user. Disjointness vs. overlapping constraint: Let C be a class and S1, ... Sk be the (immediate) subclasses of C arising from some specialization thereof. For this specialization to satisfy the disjointness constraint requires that no instance of C be an instance of more than one of the Si's. In other words, for every i and j, with 0 < i < j <= k, the intersection of the entity sets of subclasses Si and Sj must be empty. If a specialization is not defined to satisfy the disjointness constraint, we say that it is overlapping, meaning that there is no restriction prohibiting an entity from being a member of two or more subclasses.
Completeness Constraint: partial vs. total The total specialization constraint requires that every entity of a superclass be a member of at least one of its (immediate) subclasses. A partial constraint is simply the absence of the total constraint (and hence is no constraint at all). 26 | P a g e
Note that the disjointness and completeness constraints are independent of one another, giving rise to four possible combinations:
Hierarchies vs. Lattices In a specialization hierarchy (i.e., tree), each subclass has exactly one (immediate) superclass. In a specialization lattice, a subclass may have more than one (immediate) superclasses. (See Figures 4.6 and 4.7.) Having more than one superclass gives rise to the concept of multiple inheritance.
Specialization vs. Generalization: Analogous to top-down refinement vs. bottom-up synthesis. In reallife, most projects are developed using a combination of these approaches.
27 | P a g e
Hierarchical Model In a Hierarchical model you could create links between these record types; the hierarchical model uses Parent Child Relationships. These are a 1: N mapping between record types. This is done by using trees, like set theory used in the relational model, "borrowed" from maths. For example, an organization might store information about an employee, such as name, employee number, department, salary. The organization might also store information about an employee's children, such as name and date of birth. The employee and children data forms a hierarchy, where the employee data represents the parent segment and the children data represents the child segment. If an employee has three children, then there would be three child segments associated with one employee segment. In a hierarchical database the parent-child relationship is one to many. This restricts a child segment to having only one parent segment. Hierarchical DBMSs were popular from the late 1960s, with the introduction of IBM's Information Management System (IMS) DBMS, through the 1970s.
Advantages Simplicity Data Security and Data Integrity Efficiency Disadvantages 1 Implementation Complexity 2 Lack of structural independence 3 Programming complexity
The figure above is a Customer-order-line item database: / There are three data types (record types) in the database: customers, orders, and line items. For each customer, there may be several orders, but for each order, there is just one customer. Likewise, for each order, there may be many line items, but each line item occurs in just one order. (This is the schema for the database.) So, each customer record is the root of a tree, with the orders as children. The children of the orders are the line items. Note: Instead of keeping separate files of Customers, Orders, and Line Items, the DBMS can store orders immediately after customers. If this is done, it can result in very efficient processing. 28 | P a g e
Problem: What if we also want to maintain Product information in the database, and keep track of the orders for each product? Now there is a relationship between orders and line items (each of which refers to a single product), and between products and line items. We no longer have a tree structure, but a directed graph, in which a node can have more than one parent. In a hierarchical DBMS, this problem is solved by introducing pointers. All line items for a given product can be linked on a linked list. Line items become "logical children" of products. In an IMS database, there may be logical child pointers, parent pointers, and physical child pointers NETWORK DATA MODEL A member record type in the Network Model can have that role in more than one set; hence the multivalent concept is supported. An owner record type can also be a member or owner in another set. The data model is a simple network, and link and intersection record types (called junction records by IDMS) may exist, as well as sets between them . Thus, the complete network of relationships is represented by several pair wise sets; in each set some (one) record type is owner (at the tail of the network arrow) and one or more record types are members (at the head of the relationship arrow). Usually, a set defines a 1:M relationship, although 1:1 is permitted. The CODASYL network model is based on mathematical set theory.
NETWORK DATA MODEL Advantages Conceptual Simplicity Ease of data access Data Integrity and capability to handle more relationship types Data independence Database standards Disadvantages System complexity Absence of structural independence 1
29 | P a g e
Instead of trees, schemas may be acyclic directed graphs. In the network model, there are two main abstractions: records (record types) and sets. A set represents a one-to-many relationship between record types. The database diagrammed above would be implemented using four records (customer, order, part, and line item) and three sets (customer-order, order-line item, and part-line item). This would be written in a schema for the database in the network DDL. Network database systems use linked lists to represent one-to-many relationships. For example, if a customer has several orders, then the customer record will contain a pointer to the head of a linked list containing all of those orders. The network model allows any number of one-to-many relationships to be represented, but there is still a problem with many-to-many relationships. Consider, for example, a database of students and courses. Each student may be taking several courses. Each course enrolls many students. So the linked list method of implementation breaks down (Why?) The way this is handled in the network model is to decompose the many-to-many relationship into two one-to-many relationships by introducing an additional record type called an "interesection record". In this case, we would have one intersection record for each instance of a student enrolled in a course. This gives a somewhat better tool for designing databases. The database can be designed by creating a diagram showing all the record types and the relationships between them. If necessary, intersection record types may be added. (In the hierarchical model, the designer must explicitly indicate the extra pointer fields needed to represent "out of tree" relationships.) In general, these products were very successful, and were considered the state of the art throughout the 1970s and 1980s. They are still in use today. But -- there are still some problems. There is an insufficient level of data abstraction. Database designers and programmers must still be cognizant of the underlying physical structure. Pointers are embedded in the records themselves. That makes it more difficult to change the logical structure of the database. Processing is "one record at a time". Application programs must "navigate" their way through the database. This leads to complexity in the applications. The result is inflexibility and difficulty of use. Performing a query to extract information from a database requires writing a new application program. There is no user-oriented query language. Because of the embedded pointers, modifying a schema requires modification of the physical structure of the database, which means rebuilding the database, which is costly.
Domain: A (usually named) set/universe of atomic values, where by "atomic" we mean simply that, from the point of view of the database, each value in the domain is indivisible (i.e., cannot be broken down into component parts).
30 | P a g e
Examples of domains o o o o o SSN: string of digits of length nine Name: string of characters beginning with an upper case letter GPA: a real number between 0.0 and 4.0 Sex: a member of the set { female, male } Dept_Code: a member of the set { CMPS, MATH, ENGL, PHYS, PSYC, ... }
These are all logical descriptions of domains. For implementation purposes, it is necessary to provide descriptions of domains in terms of concrete data types (or formats) that are provided by the DBMS (such as String, int, boolean), in a manner analogous to how programming languages have intrinsic data types.
Attribute: the name of the role played by some value (coming from some domain) in the context of a relational schema. The domain of attribute A is denoted dom(A). Tuple: A tuple is a mapping from attributes to values drawn from the respective domains of those attributes. A tuple is intended to describe some entity (or relationship between entities) in the miniworld. As an example, a tuple for a PERSON entity might be
{ Name --> "Keerthy", Sex --> Male, IQ --> 786 }
Relation: A (named) set of tuples all of the same form (i.e., having the same set of attributes). The term table is a loose synonym. Relational Schema: used for describing (the structure of) a relation. E.g., R(A1, A2, ..., An) says that R is a relation with attributes A1, ... An. The degree of a relation is the number of attributes it has, here n. Example: STUDENT(Name, SSN, Address) One would think that a "complete" relational schema would also specify the domain of each attribute.
Relational Database: A collection of relations, each one consistent with its specified relational schema.
Characteristics of Relations
Ordering of Tuples: A relation is a set of tuples; hence, there is no order associated with them. That is, it makes no sense to refer to, for example, the 5th tuple in a relation. When a relation is depicted as a table, the tuples are necessarily listed in some order, of course, but you should attach no significance to that order. Similarly, when tuples are represented on a storage device, they must be organized in some fashion, and it may be advantageous, from a performance standpoint, to organize them in a way that depends upon their content. Ordering of Attributes: A tuple is best viewed as a mapping from its attributes (i.e., the names we give to the roles played by the values comprising the tuple) to the corresponding values. Hence, the order in which the attributes are listed in a table is irrelevant. (Note that, unfortunately, the set theoretic operations in relational algebra (at least how Elmasri& Navathe define them) make implicit use of the order of the attributes. Hence, E & N view attributes as being arranged as a sequence rather than a set.) The Null value: used for don't know, not applicable.
31 | P a g e
Interpretation of a Relation: Each relation can be viewed as a predicate and each tuple an assertion that that predicate is satisfied (i.e., has value true) for the combination of values in it. In other words, each tuple represents a fact. Keep in mind that some relations represent facts about entities whereas others represent facts about relationships (between entities).
inherent model-based: Example: no two tuples in a relation can be duplicates (because a relation is a set of tuples) schema-based: can be expressed using DDL; this kind is the focus of this section. application-based: are specific to the "business rules" of the miniworld and typically difficult or impossible to express and enforce within the data model. Hence, it is left to application programs to enforce.
Elaborating upon schema-based constraints: Domain Constraints: All the values that appear in a column of a relation must be taken from the same domain. A domain usually consists of the following components. 1 1. Domain Name 2 2. Meaning 3 3. Data Type 4 4. Size or length 5 5. Allowable values or Allowable range( if applicable) 6 6.Data Format Key Constraints: A relation is a set of tuples, and each tuple's "identity" is given by the values of its attributes. Hence, it makes no sense for two tuples in a relation to be identical (because then the two tuples are actually one and the same tuple). That is, no two tuples may have the same combination of values in their attributes. Usually the miniworld dictates that there be (proper) subsets of attributes for which no two tuples may have the same combination of values. Such a set of attributes is called a superkey of its relation. From the fact that no two tuples can be identical, it follows that the set of all attributes of a relation constitutes a superkey of that relation. A key is a minimal superkey, i.e., a superkey such that, if we were to remove any of its attributes, the resulting set of attributes fails to be a superkey. Example: Suppose that we stipulate that a faculty member is uniquely identified by Name and Address and also by Name and Department, but by no single one of the three attributes mentioned. Then { Name, Address, Department } is a (non-minimal) superkey and each of { Name, Address } and { Name, Department } is a key (i.e., minimal superkey).
Candidate key: any key! Primary key: a key chosen to act as the means by which to identify tuples in a relation. Typically, one prefers a primary key to be one having as few attributes as possible. Relational Databases and Relational Database Schemas 32 | P a g e
A relational database schema is a set of schemas for its relations together with a set of integrity constraints. A relational database state/instance/snapshot is a set of states of its relations such that no integrity constraint is violated. Entity Integrity, Referential Integrity, and Foreign Keys Entity Integrity Constraint: In a tuple, none of the values of the attributes forming the relation's primary key may have the (non-)value null. Referential Integrity Constraint: A foreign key of relation R is a set of its attributes intended to be used (by each tuple in R) for identifying/referring to a tuple in some relation S. (R is called the referencing relation and S the referenced relation.) For this to make sense, the set of attributes of R forming the foreign key should "correspond to" some superkey of S. Indeed, by definition we require this superkey to be the primary key of S. This constraint says that, for every tuple in R, the tuple in S to which it refers must actually be in S. Note that a foreign key may refer to a tuple in the same relation and that a foreign key may be part of a primary key . A foreign key may have value null in which case it does not refer to any tuple in the referenced relation. A set of attributes FK in relation schema R1 is a foreign key of R1 that references relation R2 if it satisfies following two rules 1. The attributes in FK fave the same domain(s) as the primary key attributes PK of R2;the attributes FK are said to reference or refer to the relation R2 2. A value of FK in a tuple t1 of the current state r1(R1) either occurs as a value of PK for some tuple t2 in the current state r2(R2) or is null.In the former case ,we have t1[FK]=t2[PK], and we say that the tuple t1 references or refers to the tuple t2. Semantic Integrity Constraints: application-specific restrictions that are unlikely to be expressible in DDL. Examples:
salary of a supervisee cannot be greater than that of her/his supervisor salary of an employee cannot be lowered
domain constraint violation: some attribute value is not of correct domain entity integrity violation: key of new tuple is null key constraint violation: key of new tuple is same as existing one referential integrity violation: foreign key of new tuple refers to non-existent tuple
Ways of dealing with it: reject the attempt to insert! Or give user opportunity to try again with different attribute values. Delete:
Three options for dealing with it: Reject the deletion Attempt to cascade (or propagate) by deleting any referencing tuples (plus those that reference them, etc., etc.) 33 | P a g e
modify the foreign key attribute values in referencing tuples to null or to some valid value referencing a different tuple
Update:
Key constraint violation: primary key is changed so as to become same as another tuple's referential integrity violation: o foreign key is changed and new one refers to nonexistent tuple o primary key is changed and now other tuples that had referred to this one violate the constraint
Example: We create the relations EMPLOYEE, DEPARTMENT, and PROJECT in the relational schema corresponding to the regular entities in the ER diagram. SSN, DNUMBER, and PNUMBER are the primary keys for the relations EMPLOYEE, DEPARTMENT, and PROJECT as shown. Step 2: Mapping of Weak Entity Types For each weak entity type W in the ER schema with owner entity type E, create a relation R and include all simple attributes (or simple components of composite attributes) of W as attributes of R. In addition, include as foreign key attributes of R the primary key attribute(s) of the relation(s) that correspond to the owner entity type(s). The primary key of R is the combination of the primary key(s) of the owner(s) and the partial key of the weak entity type W, if any.
Example: Create the relation DEPENDENT in this step to correspond to the weak entity type DEPENDENT. Include the primary key SSN of the EMPLOYEE relation as a foreign key attribute of DEPENDENT (renamed to ESSN). The primary key of the DEPENDENT relation is the combination {ESSN, DEPENDENT_NAME} because DEPENDENT_NAME is the partial key of DEPENDENT.
34 | P a g e
Step 3: Mapping of Binary 1:1 Relation Types For each binary 1:1 relationship type R in the ER schema, identify the relations S and T that correspond to the entity types participating in R. There are three possible approaches: 35 | P a g e
(1) Foreign Key approach: Choose one of the relations-S, say-and include a foreign key in S the primary key of T. It is better to choose an entity type with total participation in R in the role of S. Example: 1:1 relation MANAGES is mapped by choosing the participating entity type DEPARTMENT to serve in the role of S, because its participation in the MANAGES relationship type is total. (2) Merged relation option: An alternate mapping of a 1:1 relationship type is possible by merging the two entity types and the relationship into a single relation. This may be appropriate when both participations are total. (3) Cross-reference or relationship relation option: The third alternative is to set up a third relation R for the purpose of cross-referencing the primary keys of the two relations S and T representing the entity types. Step 4: Mapping of Binary 1:N Relationship Types. For each regular binary 1:N relationship type R, identify the relation S that represent the participating entity type at the N-side of the relationship type. Include as foreign key in S the primary key of the relation T that represents the other entity type participating in R. Include any simple attributes of the 1:N relation type as attributes of S.
Example: 1:N relationship types WORKS_FOR, CONTROLS, and SUPERVISION in the figure. For WORKS_FOR we include the primary key DNUMBER of the DEPARTMENT relation as foreign key in the EMPLOYEE relation and call it DNO. Step 5: Mapping of Binary M:N Relationship Types. For each regular binary M:N relationship type R, create a new relation S to represent R. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types; their combination will form the primary key of S. Also include any simple attributes of the M:N relationship type (or simple components of composite attributes) as attributes of S.
Example: The M:N relationship type WORKS_ON from the ER diagram is mapped by creating a relation WORKS_ON in the relational database schema. The primary keys of the PROJECT and EMPLOYEE relations are included as foreign keys in WORKS_ON and renamed PNO and ESSN, respectively. Attribute HOURS in WORKS_ON represents the HOURS attribute of the relation type. The primary key of the WORKS_ON relation is the combination of the foreign key attributes {ESSN, PNO}. Step 6: Mapping of Multivalued attributes. For each multivalued attribute A, create a new relation R. This relation R will include an attribute corresponding to A, plus the primary key attribute K-as a foreign key in R-of the relation that represents the entity type of relationship type that has A as an attribute. The primary key of R is the combination of A and K. If the multivalued attribute is composite, we include its simple components.
Example: The relation DEPT_LOCATIONS is created. The attribute DLOCATION represents the multivalued attribute LOCATIONS of DEPARTMENT, while DNUMBER-as foreign key-represents the primary key of the DEPARTMENT relation. The primary key of R is the combination of {DNUMBER, DLOCATION}.
36 | P a g e
Step 7: Mapping of N-ary Relationship Types. For each n-ary relationship type R, where n>2, create a new relationship S to represent R. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types. Also include any simple attributes of the n-ary relationship type (or simple components of composite attributes) as attributes of S.
Example: The relationship type SUPPY in the ER below. This can be mapped to the relation SUPPLY shown in the relational schema, whose primary key is the combination of the three foreign keys {SNAME, PARTNO, PROJNAME}
37 | P a g e
SUMMARY -: Correspondence between ER and Relational Models ER Model Entity type 1:1 or 1:N relationship type M:N relationship type n-ary relationship type Simple attribute Composite attribute Multivalued attribute Value set Key attribute Relational Model Entity relation Foreign key (or relationship relation) Relationshiprelation and two foreign keys Relationship relation and n foreign keys Attribute Set of simple component attributes Relation and foreign key Domain Primary (or secondary) key
Relational Algebra A brief introduction 1 2 3 4 5 6 7 Relational algebra and relational calculus are formal languages associated with the relational model. Informally, relational algebra is a (high-level) procedural language and relational calculus a nonprocedural language. However, formally both are equivalent to one another. A language that produces a relation that can be derived using relational calculus is relationally complete. Relational algebra operations work on one or more relations to define another relation without changing the original relations. Both operands and results are relations, so output from one operation can become input to another operation. Allows expressions to be nested, just as in arithmetic. This property is called closure.
What? Why? 1 2 3
Similar
to normal algebra (as in 2+3*x-y), except we use relations as values instead of numbers. as a query language in actual DBMSs. (SQL instead.)
Not used
The inner, lower-level operations of a relational DBMS are, or are similar to, relational algebra operations. We need to know about relational algebra to understand query execution and optimization in a relational DBMS. Some advanced SQL queries requires explicit relational algebra operations, most commonly outer join.
38 | P a g e
5 1 2 3 4 5
SQL
is declarative, which means that you tell the DBMS what you want, but not how it is to be calculated. A C++ or Java program is procedural, which means that you have to state, step by step, exactly how the result should be calculated. Relational algebra is (more) procedural than SQL. (Actually, relational algebra is mathematical expressions.)
It It
provides a formal foundation for operations on relations. is used as a basis for implementing and optimizing queries in DBMS software. programs add more operations which cannot be expressed in the relational algebra.
DBMS
Relational calculus (tuple and domain calculus systems) also provides a foundation, but is more difficult to use. Well skip these for now.
Relational Algebra : Relational algebra is the basic set of operations for the relational model These operations enable a user to specify basic retrieval requests (or queries) The result of an operation is a new relation, which may have been formed from one or more input relations o This property makes the algebra closed (all objects in relational algebra are relations) The algebra operations thus produce new relations o These can be further manipulated using operations of the same algebra A sequence of relational algebra operations forms a relational algebra expression o The result of a relational algebra expression is also a relation that represents the result of a database query (or retrieval request)
Relational Algebra consists of several groups of operations Unary Relational Operations SELECT (symbol: (sigma)) PROJECT (symbol: (pi)) RENAME (symbol: (rho)) Relational Algebra Operations From Set Theory UNION ( ), INTERSECTION ( ), DIFFERENCE (or MINUS, ) CARTESIAN PRODUCT ( x ) Binary Relational Operations JOIN (several variations of JOIN exist) DIVISION Additional Relational Operations OUTER JOINS, OUTER UNION AGGREGATE FUNCTIONS (These compute summary of information: for example, SUM, COUNT, AVG, MIN, MAX) All examples discussed below refer to the COMPANY database shown here.
39 | P a g e
Relational Algebra Operations: Unary Relational Operations: SELECT The SELECT operation (denoted by (sigma)) is used to select a subset of the tuples from a relation based on a selection condition. o The selection condition acts as a filter o Keeps only those tuples that satisfy the qualifying condition o Tuples satisfying the condition are selected whereas the other tuples are discarded (filtered out) Examples: o Select the EMPLOYEE tuples whose department number is 4: DNO = 4 (EMPLOYEE) o Select the employee tuples whose salary is greater than $30,000: SALARY > 30,000 (EMPLOYEE) In general, the select operation is denoted by <selection condition>(R) where the symbol (sigma) is used to denote the select operator the selection condition is a Boolean (conditional) expression specified on the attributes of relation R tuples that make the condition true are selected appear in the result of the operation tuples that make the condition false are filtered out discarded from the result of the operation SELECT Operation Properties o The SELECT operation <selection condition>(R) produces a relation S that has the same schema (same attributes) as R o SELECT is commutative: <condition1>( < condition2> (R)) = <condition2> ( < condition1> (R)) o Because of commutativity property, a cascade (sequence) of SELECT operations may be applied in any order: <cond1>(<cond2> (<cond3> (R)) = <cond2> (<cond3> (<cond1> ( R))) o A cascade of SELECT operations may be replaced by a single selection with a conjunction of all the conditions: <cond1>(< cond2> (<cond3>(R)) = <cond1> AND < cond2> AND < cond3>(R))) 40 | P a g e
o The number of tuples in the result of a SELECT is less than (or equal to) the number of tuples in the input relation R Unary Relational Operations: PROJECT o PROJECT Operation is denoted by (pi) o This operation keeps certain columns (attributes) from a relation and discards the other columns. o PROJECT creates a vertical partitioning The list of specified columns (attributes) is kept in each tuple The other attributes in each tuple are discarded o Example: To list each employees first and last name and salary, the following is used: LNAME, FNAME,SALARY(EMPLOYEE) The general form of the project operation is: o <attribute list>(R) o (pi) is the symbol used to represent the project operation o <attribute list> is the desired list of attributes from relation R. o The project operation removes any duplicate tuples o This is because the result of the project operation must be a set of tuples Mathematical sets do not allow duplicate elements. PROJECT Operation Properties o The number of tuples in the result of projection <list>(R) is always less or equal to the number of tuples in R If the list of attributes includes a key of R, then the number of tuples in the result of PROJECT is equal to the number of tuples in R o PROJECT is not commutative <list1> ( <list2> (R) ) = <list1> (R) as long as <list2> contains the attributes in <list1> Unary Relational Operations: RENAME o The RENAME operator is denoted by (rho) o In some cases, we may want to rename the attributes of a relation or the relation name or both o Useful when a query requires multiple operations o Necessary in some cases (see JOIN operation later) o The general RENAME operation can be expressed by any of the following forms: o S (B1, B2, , Bn )(R) changes both: the relation name to S, and the column (attribute) names to B1, B1, ..Bn o S(R) changes: the relation name only to S o (B1, B2, , Bn )(R) changes: the column (attribute) names only to B1, B1, ..Bn o For convenience, we also use a shorthand for renaming attributes in an intermediate relation: o If we write: RESULT FNAME, LNAME, SALARY (DEP5_EMPS) RESULT will have the same attribute names as DEP5_EMPS (same attributes as EMPLOYEE) o If we write: RESULT (F, M, L, S, B, A, SX, SAL, SU, DNO) RESULT (F.M.L.S.B,A,SX,SAL,SU, DNO)(DEP5_EMPS) The 10 attributes of DEP5_EMPS are renamed to F, M, L, S, B, A, SX, SAL, SU, DNO, respectively Relational Algebra Operations from Set Theory: UNION 41 | P a g e
UNION Operation o Binary operation, denoted by o The result of R S, is a relation that includes all tuples that are either in R or in S or in both R and S o Duplicate tuples are eliminated o The two operand relations R and S must be type compatible (or UNION compatible) R and S must have same number of attributes Each pair of corresponding attributes must be type compatible (have same or compatible domains Example: o To retrieve the social security numbers of all employees who either work in department 5 (RESULT1 below) or directly supervise an employee who works in department 5 (RESULT2 below) o We can use the UNION operation as follows: DEP5_EMPS DNO=5 (EMPLOYEE) RESULT1 SSN(DEP5_EMPS) RESULT2(SSN) SUPERSSN(DEP5_EMPS) RESULT RESULT1 RESULT2 o The union operation produces the tuples that are in either RESULT1 or RESULT2 or both o Type Compatibility of operands is required for the binary set operation UNION , (also for INTERSECTION , and SET DIFFERENCE , see next slides) o R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) are type compatible if: o they have the same number of attributes, and o the domains of corresponding attributes are type compatible (i.e. dom(Ai)=dom(Bi) for i=1, 2, ..., n). o The resulting relation for R1R2 (also for R1R2, or R1R2, see next slides) has the same attribute names as the first operand relation R1 (by convention)
Relational Algebra Operations from Set Theory: INTERSECTION o INTERSECTION is denoted by o The result of the operation R S, is a relation that includes all tuples that are in both R and S o The attribute names in the result will be the same as the attribute names in R o The two operand relations R and S must be type compatible Relational Algebra Operations from Set Theory: MINUS o SET DIFFERENCE (also called MINUS or EXCEPT) is denoted by o The result of R S, is a relation that includes all tuples that are in R but not in S o The attribute names in the result will be the same as the attribute names in R o The two operand relations R and S must be type compatible Some properties of UNION, INTERSECT, and DIFFERENCE o Notice that both union and intersection are commutative operations; that is o R S = S R, and R S = S R o Both union and intersection can be treated as n-ary operations applicable to any number of relations as both are associative operations; that is 42 | P a g e
o R (S T) = (R S) T o (R S) T = R (S T) o The minus operation is not commutative; that is, in general o RSSR Relational Algebra Operations from Set Theory: CARTESIAN PRODUCT CARTESIAN (or CROSS) PRODUCT Operation o This operation is used to combine tuples from two relations in a combinatorial fashion. o Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm) o Result is a relation Q with degree n + m attributes: Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order. o The resulting relation state has one tuple for each combination of tuplesone from R and one from S. o Hence, if R has nR tuples (denoted as |R| = nR ), and S has nS tuples, then R x S will have nR * nS tuples. o The two operands do NOT have to be "type compatible o Generally, CROSS PRODUCT is not a meaningful operation o Can become meaningful when followed by other operations o Example (not meaningful): o FEMALE_EMPS SEX=F(EMPLOYEE) o EMPNAMES FNAME, LNAME, SSN (FEMALE_EMPS) o EMP_DEPENDENTS EMPNAMES x DEPENDENT o EMP_DEPENDENTS will contain every combination of EMPNAMES and DEPENDENT o whether or not they are actually related o To keep only combinations where the DEPENDENT is related to the EMPLOYEE, we add a SELECT operation as follows o Example (meaningful): o FEMALE_EMPS SEX=F(EMPLOYEE) o EMPNAMES FNAME, LNAME, SSN (FEMALE_EMPS) o EMP_DEPENDENTS EMPNAMES x DEPENDENT o ACTUAL_DEPS SSN=ESSN(EMP_DEPENDENTS) o RESULT FNAME, LNAME, DEPENDENT_NAME (ACTUAL_DEPS) o RESULT will now contain the name of female employees and their dependents Binary Relational Operations: JOIN o JOIN Operation (denoted by ) o The sequence of CARTESIAN PRODECT followed by SELECT is used quite commonly to identify and select related tuples from two relations o A special operation, called JOIN combines this sequence into a single operation o This operation is very important for any relational database with more than a single relation, because it allows us combine related tuples from various relations o The general form of a join operation on two relations R(A1, A2, . . ., An) and S(B1, B2, . . ., Bm) is: R <join condition>S
o where R and S can be any relations that result from general relational algebra expressions. Example: Suppose that we want to retrieve the name of the manager of each department. 43 | P a g e
o To get the managers name, we need to combine each DEPARTMENT tuple with the EMPLOYEE tuple whose SSN value matches the MGRSSN value in the department tuple. o We do this by using the join operation. DEPT_MGR DEPARTMENT MGRSSN=SSN EMPLOYEE MGRSSN=SSN is the join condition Combines each department record with the employee who manages the department The join condition can also be specified as DEPARTMENT.MGRSSN= EMPLOYEE.SSN Consider the following JOIN operation: o R(A1, A2, . . ., An) S(B1, B2, . . ., Bm) o R.Ai=S.Bj o Result is a relation Q with degree n + m attributes: o Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order. o The resulting relation state has one tuple for each combination of tuplesr from R and s from S, but only if they satisfy the join condition r[Ai]=s[Bj] o Hence, if R has nR tuples, and S has nS tuples, then the join result will generally have less than nR * nS tuples. o Only related tuples (based on the join condition) will appear o o o o
Some properties of JOIN o The general case of JOIN operation is called a Theta-join: R S o The join condition is called theta o Theta can be any general boolean expression on the attributes of R and S; for example: o R.Ai<S.Bj AND (R.Ak=S.Bl OR R.Ap<S.Bq) o Most join conditions involve one or more equality conditions ANDed together; for example: o R.Ai=S.Bj AND R.Ak=S.Bl AND R.Ap=S.Bq Binary Relational Operations: EQUIJOIN Operation o The most common use of join involves join conditions with equality comparisons only o Such a join, where the only comparison operator used is =, is called an EQUIJOIN. o In the result of an EQUIJOIN we always have one or more pairs of attributes (whose names need not be identical) that have identical values in every tuple. o The JOIN seen in the previous example was an EQUIJOIN. NATURAL JOIN Operation o Another variation of JOIN called NATURAL JOIN denoted by * was created to get rid of the second (superfluous) attribute in an EQUIJOIN condition. o because one of each pair of attributes with identical values is superfluous o The standard definition of natural join requires that the two join attributes, or each pair of corresponding join attributes, have the same name in both relations o If this is not the case, a renaming operation is applied first Example: To apply a natural join on the DNUMBER attributes of DEPARTMENT and DEPT_LOCATIONS, it is sufficient to write: o DEPT_LOCS DEPARTMENT * DEPT_LOCATIONS Only attribute with the same name is DNUMBER 44 | P a g e
An implicit join condition is created based on this attribute: o DEPARTMENT.DNUMBER=DEPT_LOCATIONS.DNUMBER Another example: Q R(A,B,C,D) * S(C,D,E) o The implicit join condition includes each pair of attributes with the same name, ANDed together: o R.C=S.C AND R.D.S.D Result keeps only one attribute of each such pair: o Q(A,B,C,D,E) DIVISION Operation o The division operation is applied to two relations o R(Z) S(X), where X subset Z. Let Y = Z - X (and hence Z = X Y); that is, let Y be the set of attributes of R that are not attributes of S. o The result of DIVISION is a relation T(Y) that includes a tuple t if tuples tR appear in R with tR [Y] = t, and with o tR [X] = ts for every tuple ts in S. o For a tuple t to appear in the result T of the DIVISION, the values in t must appear in R in combination
Additional Relational Operations: Aggregate Functions and Grouping o A type of request that cannot be expressed in the basic relational algebra is to specify mathematical aggregate functions on collections of values from the database. o Examples of such functions include retrieving the average or total salary of all employees or the total number of employee tuples. o These functions are used in simple statistical queries that summarize information from the database tuples. o Common functions applied to collections of numeric values include o SUM, AVERAGE, MAXIMUM, and MINIMUM. o The COUNT function is used for counting tuples or values. o Use of the Aggregate Functional operation 45 | P a g e
o o o o
o MAX Salary (EMPLOYEE) retrieves the maximum salary value from the EMPLOYEE relation o MIN Salary (EMPLOYEE) retrieves the minimum Salary value from the EMPLOYEE relation o SUM Salary (EMPLOYEE) retrieves the sum of the Salary from the EMPLOYEE relation o COUNT SSN, AVERAGE Salary (EMPLOYEE) computes the count (number) of employees and their average salary Note: count just counts the number of rows, without removing duplicates The previous examples all summarized one or more attributes for a set of tuples o Maximum Salary or Count (number of) Ssn Grouping can be combined with Aggregate Functions Example: For each department, retrieve the DNO, COUNT SSN, and AVERAGE SALARY A variation of aggregate operation allows this: o Grouping attribute placed to left of symbol o Aggregate functions to right of symbol o DNO COUNT SSN, AVERAGE Salary (EMPLOYEE) Above operation groups employees by DNO (department number) and computes the count of employees and average salary per department
The OUTER JOIN Operation o In NATURAL JOIN and EQUIJOIN, tuples without a matching (or related) tuple are eliminated from the join result Tuples with null in the join attributes are also eliminated This amounts to loss of information. o A set of operations, called OUTER joins, can be used when we want to keep all the tuples in R, or all those in S, or all those in both relations in the result of the join, regardless of whether or not they have matching tuples in the other relation. o The left outer join operation keeps every tuple in the first or left relation R in R S; if no matching tuple is found in S, then the attributes of S in the join result are filled or padded with null values. o A similar operation, right outer join, keeps every tuple in the second or right relation S in the result of R S. o A third operation, full outer join, denoted by keeps all tuples in both the left and the right relations when no matching tuples are found, padding them with null values as needed. OUTER UNION Operations o The outer union operation was developed to take the union of tuples from two relations if the relations are not type compatible. o This operation will take the union of tuples in two relations R(X, Y) and S(X, Z) that are partially compatible, meaning that only some of their attributes, say X, are type compatible. o The attributes that are type compatible are represented only once in the result, and those attributes that are not type compatible from either relation are also kept in the result relation T(X, Y, Z). o Example: An outer union can be applied to two relations whose schemas are STUDENT(Name, SSN, Department, Advisor) and INSTRUCTOR(Name, SSN, Department, Rank). o Tuples from the two relations are matched based on having the same combination of values of the shared attributes Name, SSN, Department. o If a student is also an instructor, both Advisor and Rank will have a value; otherwise, one of these two attributes will be null. o The result relation STUDENT_OR_INSTRUCTOR will have the following attributes: STUDENT_OR_INSTRUCTOR (Name, SSN, Department, Advisor, Rank)
46 | P a g e
Get supplier numbers for suppliers who supply part P2. An algebraic version of this query might follow these steps: 1. Form the natural join of relations S and SP on S#; 2. Next, restrict the result of that join to tuples for part P2; 3. Finally, project the result of that restriction on S#. A calculus formulation might look like: Get S# for suppliers such that there exists a shipment SP with the same S# value and with P# value P2. The calculus formation is descriptive while the algebraic one is prescriptive.
Codd proposed the concept of a relational calculus (applied predicate calculus tailored to relational databases). o A relational calculus expression creates a new relation, which is specified in terms of variables that range over rows of the stored database relations (in tuple calculus) or over columns of the stored relations (in domain calculus). o In a calculus expression, there is no order of operations to specify how to retrieve the query result a calculus expression specifies only what information the result should contain. o This is the main distinguishing feature between relational algebra and relational calculus. o Relational calculus is considered to be a nonprocedural or declarative language. o This differs from relational algebra, where we must write a sequence of operations to specify a retrieval request; hence relational algebra can be considered as a procedural way of stating a query.
i.e. the set of tuples for which predicate is true. 3. We also use the notation o to indicate the value of tuple on attribute . o to show that tuple is in relation . Example: To find the first and last names of all employees whose salary is above $50,000, we can write the following tuple calculus expression: o {t.FNAME, t.LNAME | EMPLOYEE(t) AND t.SALARY>50000} o The condition EMPLOYEE(t) specifies that the range relation of tuple variable t is EMPLOYEE. o The first and last name (PROJECTION FNAME, LNAME) of each EMPLOYEE tuple t that satisfies the condition t.SALARY>50000 (SELECTION SALARY >50000) will be retrieved.
are not interested in from R itself. The expression not(x.DNUM=5) evaluates to true all tuples x that are in the project relation but are not controlled by department 5. o Finally, we specify a condition that must hold on all the remaining tuples in R. ( ( w)(WORKS_ON(w) and w.ESSN=e.SSN and x.PNUMBER=w.PNO)
Examples o Retrieve the birthdate and address of the employee whose name is John B. Smith. o Query : {uv | ( q) ( r) ( s) ( t) ( w) ( x) ( y) ( z) (EMPLOYEE(qrstuvwxyz) and q=John and r=B and s=Smith)} o Abbreviated notation EMPLOYEE(qrstuvwxyz) uses the variables without the separating commas: EMPLOYEE(q,r,s,t,u,v,w,x,y,z) o Ten variables for the employee relation are needed, one to range over the domain of each attribute in order. Of the ten variables q, r, s, . . ., z, only u and v are free. o Specify the requested attributes, BDATE and ADDRESS, by the free domain variables u for BDATE and v for ADDRESS. o Specify the condition for selecting a tuple following the bar ( | )namely, that the sequence of values assigned to the variables qrstuvwxyz be a tuple of the employee relation and that the values for q (FNAME), r (MINIT), and s (LNAME) be John, B, and Smith, respectively.
Database Design
Requirements Analysis user needs; what must database do? Conceptual Design high level description (often done with ER model) Logical Design 50 | P a g e
consistency,normalization
Physical Design - indexes, disk layout Security Design - who accesses what Good Database Design no redundancy of FACT (!) no inconsistency no insertion, deletion or update anomalies no information loss no dependency loss
Informal Design Guidelines for Relational Databases 1. Semantics of the Relation Attributes 2. Redundant Information in Tuples and Update Anomalies 3. Null Values in Tuples 4. Spurious Tuples 1:Semantics of the Relation Attributes GUIDELINE 1: Informally, each tuple in a relation should represent one entity or relationship instance. (Applies to individual relations and their attributes). o Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should not be mixed in the same relation o Only foreign keys should be used to refer to other entities o Entity and relationship attributes should be kept apart as much as possible. Design a schema that can be explained easily relation by relation. The semantics of attributes should be easy to interpret. 2:Redundant Information in Tuples and Update Anomalies Information is stored redundantly o Wastes storage o Causes problems with update anomalies Insertion anomalies Deletion anomalies Modification anomalies Consider the relation: EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours) Insertion anomalies Cannot insert a project unless an employee is assigned to it. Deletion anomalies 51 | P a g e
a. When a project is deleted, it will result in deleting all the employees who work on that project. b. Alternately, if an employee is the sole employee on a project, deleting that employee would result in deleting the corresponding project. Modification anomalies Changing the name of project number P1 from Billing to Customer-Accounting may cause this update to be made for all 100 employees working on project P1. GUIDELINE 2: Design a schema that does not suffer from the insertion, deletion and update anomalies. If there are any anomalies present, then note them so that applications can be made to take them into account. 3:Null Values in Tuples GUIDELINE 3: Relations should be designed such that their tuples will have as few NULL values as possible Attributes that are NULL frequently could be placed in separate relations (with the primary key) Reasons for nulls: Attribute not applicable or invalid Attribute value unknown (may exist) Value known to exist, but unavailable 4:Spurious Tuples Bad designs for a relational database may result in erroneous results for certain JOIN operations The "lossless join" property is used to guarantee meaningful results for join operations GUIDELINE 4: The relations should be designed to satisfy the lossless join condition. No spurious tuples should be generated by doing a natural-join of any relations. Normalization: The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into smaller relations Normalization is used to design a set of relation schemas that is optimal from the point of view of database updating Normalization starts from a universal relation schema
1NF
Attributes must be atomic: they can be chars, ints, strings they cant be 1. _ tuples 2. _ sets 3. _ relations 4. _ composite 5. _ multivalued Considered to be part of the definition of relation 52 | P a g e
Unnormalised Relations Name PaperList SWETHA EENADU, HINDU,DC PRASANNA EENADU,VAARTHA,HINDU This is not ideal. Each person is associated with an unspecified number of papers. The items in the PaperList column do not have a consistent form. Generally, RDBMS cant cope with relations like this. Each entry in a table needs to have a single data item in it. This is an unnormalised relation. All RDBMS require relations not to be like this - not to havemultiple values in any column (i.e. no repeating groups) Name SWETHA SWETHA SWETHA PRASANNA PRASANNA PRASANNA PaperList EENADU HINDU DC HINDU EENADU VAARTHA
This clearly contains the same information. And it has the property that we sought. It is in First Normal Form (1NF). A relation is in 1NF if no entry consists of more than one value (i.e. does not have repeating groups) So this will be the first requirement in designing our databases:
Obtaining 1NF
1NF is obtained by Splitting composite attributes splitting the relation and propagating the primary key to remove multi valued attributes
There are three approaches to removing repeating groups from unnormalized tables: 1. Removes the repeating groups by entering appropriate data in the empty columns of rows containing the repeating data. 2. Removes the repeating group by placing the repeating data, along with a copy of the original key attribute(s), in a separate relation. A primary key is identified for the new relation. 3.By finding maximum possible values for the multi valued attribute and adding that many attributes to the relation
53 | P a g e
Example:-
The DEPARTMENT schema is not in 1NF because DLOCATION is not a single valued attribute. The relation should be split into two relations. A new relation DEPT_LOCATIONS is created and the primary key of DEPARTMENT, DNUMBER, becomes an attribute of the new relation. The primary key of this relation is {DNUMBER, DLOCATION} Alternative solution: Leave the DLOCATION attribute as it is. Instead, we have one tuple for each location of a DEPARTMENT. Then, the relation is in 1NF, but redundancy exists.
A super key of a relation schema R = {A1, A2, ...., An} is a set of attributes S subset-of R with the property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S] A key K is a super key with the additional property that removal of any attribute from K will cause K not to be a super key any more. If a relation schema has more than one key, each is called a candidate key. One of the candidate keys is arbitrarily designated to be the primary key, and the others are called secondary keys. A Prime attribute must be a member of some candidate key A Nonprime attribute is not a prime attributethat is, it is not a member of any candidate key
54 | P a g e
Functional dependency describes the relationship between attributes in a relation. For example, if A and B are attributes of relation R, and B is functionally dependent on A ( denoted A B), if each value of A is associated with exactly one value of B. ( A and B may each consist of one or more attributes.)
Trivial functional dependency means that the right-hand side is a subset ( not necessarily a proper subset) of the left- hand side. Main characteristics of functional dependencies in normalization Have a one-to-one relationship between attribute(s) on the left- and right- hand side of a dependency; hold for all time; are nontrivial. A set of all functional dependencies that are implied by a given set of functional dependencies X is called closure of X, written X+. A set of inference rule is needed to compute X+ from X. Inference Rules (RATPUP) 1. Relfexivity: If B is a subset of A, them A B
2. Augmentation:If A B, then A, C B,C 3. Transitivity: If A B and B C, then A C 4. Projection: If A B,C then A B and A C 5. Union: If A B and A C, then A B,C 6. psudotransitivity: If A B and C D, then A,C B, Example:-
55 | P a g e
F = {SSN {ENAME, BDATE, ADDRESS, DNUMBER}, DNUMBER {DNAME, DMGRSSN}} From F of above example we can infer: SSN {DNAME, DMGRSSN}, SSN SSN, DNUMBER DNAME Full functional dependency indicates that if A and B are attributes of a relation, B is fully functionally dependent on A if B is functionally dependent on A, but not on any proper subset of A. A functional dependency AB is partially dependent if there is some attributes that can be removed from A and the dependency still holds.
2NF
Second normal form (2NF) is a relation that is in first normal form and every non--key attribute is fully functionally dependent on the key. The normalization of 1NF relations to 2NF involves the removal of partial dependencies. If a partial dependency exists, we remove the functional dependent attributes from the relation by placing them in a new relation along with a copy of their determinant.
Obtaining 2NF
_ If a nonprime attribute is dependent only on a proper part of a key, then we take the given attribute as well as the key attributes that determine it and move them all to a new relation _ We can bundle all attributes determined by the same subset of the key as a unit Transitive dependency A condition where A, B, and C are attributes of a relation such that if A B and B C, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C). Third normal form (3NF) A relation that is in first and second normal form, and in which no non-primary-key attribute is transitively dependent on the primary key. The normalization of 2NF relations to 3NF involves the removal of transitive dependencies by placing the attribute(s) in a new relation along with a copy of the determinant
3NF
R is in 3NF if and only if if X A then _ X is a superkey of R, or _ A is a key attribute of R 56 | P a g e
3NF: Alternative Definition R is in 3NF if every nonprime attribute of R is fully functionally dependent on every key of R, and non transitively dependent on every key of R. Obtaining 3NF Split off the attributes in the FD that causes trouble and move them, so there are two relations for each such FD The determinant of the FD remains in the original relation
Boyce-Codd normal form (BCNF) A relation is in BCNF, if and only if, every determinant is a key. The difference between 3NF and BCNF is that for a functional dependency A B, 3NF allows this dependency in a relation if B is a key attribute and A is not a super key, where as BCNF insists that for this dependency to remain in a relation, A must be a super key.
57 | P a g e
BCNF
R is in Boyce-Codd Normal Form iff if X A then X is a superkey of R more restrictive than 3NF , preferablehas fewer anomalies
Obtaining BCNF
As usual, split the schema to move the attributes of the troublesome FD to another relation, leaving its determinant in the original so they remain connected
Decomposition: 58 | P a g e
The process of decomposing the universal relation schema R into a set of relation schemas D = {R1,R2, , Rm} that will become the relational database schema by using the functional dependencies. Attribute preservation condition: Each attribute in R will appear in at least one relation schema Ri in the decomposition so that no attributes are lost. Dependency Preservation Property of a Decomposition: Definition: Given a set of dependencies F on R, the projection of F on Ri, denoted by pRi(F) where Ri is a subset of R, is the set of dependencies X Y in F+ such that the attributes in X Y are all contained in Ri. Hence, the projection of F on each relation schema Ri in the decomposition D is the set of functional dependencies in F+, the closure of F, such that all their left- and right-hand-side attributes are in Ri. Dependency Preservation Property: A decomposition D = {R1, R2, ..., Rm} of R is dependency-preserving with respect to F if the union of the projections of F on each Ri in D is equivalent to F; that is ((R1(F)) . . . (Rm(F)))+ = F+ Lossless (Non-additive) Join Property of a Decomposition: Definition: Lossless join property: a decomposition D = {R1, R2, ..., Rm} of R has the lossless (nonadditive) join property with respect to the set of dependencies F on R if, for every relation state r of R that satisfies F, the following holds, where * is the natural join of all the relations in D: * ( R1(r), ..., Rm(r)) = r Multi-valued dependency (MVD) represents a dependency between attributes (for example, A, B and C) in a relation, such that for each value of A there is a set of values for B and a set of value for C. However, the set of values for B and C are independent of each other. A multi-valued dependency can be further defined as being trivial or nontrivial. A MVD A > B in relation R is defined as being trivial if B is a subset of A or AUB=R A MVD is defined as being nontrivial if neither of the above two conditions is satisfied. Fourth normal form (4NF) A relation that is in Boyce-Codd normal form and contains no nontrivial multi-valued dependencies. A relation schema R is in 4NF with respect to a set of dependencies F (that includes functional dependencies and multivalued dependencies) if, for every nontrivial multivalued dependency X >> Y in F+, X is a superkey for R. Definition: A join dependency (JD), denoted by JD(R1, R2, ..., Rn), specified on relation schema R, specifies a constraint on the states r of R. 59 | P a g e
The constraint states that every legal state r of R should have a non-additive join decomposition into R1, R2, ..., Rn; that is, for every such r we have * (R1(r), R2(r), ..., Rn(r)) = r Note: an MVD is a special case of a JD where n = 2. A join dependency JD(R1, R2, ..., Rn), specified on relation schema R, is a trivial JD if one of the relation schemas Ri in JD(R1, R2, ..., Rn) is equal to R. Fifth normal form (5NF) Definition: A relation schema R is in fifth normal form (5NF) (or Project-Join Normal Form (PJNF)) with respect to a set F of functional, multivalued, and join dependencies if, for every nontrivial join dependency JD(R1, R2, ..., Rn) in F+ (that is, implied by F), every Ri is a superkey of R.
Each normal form is strictly stronger than the previous one Every 2NF relation is in 1NF Every 3NF relation is in 2NF Every BCNF relation is in 3NF Every 4NF relation is in BCNF Every 5NF relation is in 4NF 60 | P a g e
Normalization
UNF is a table that contains one or more repeating groups 1NF is a relation in which the intersection of each row and column contains one
and only one value
BCNF is a relation in which every determinant is a candidate key 4NF is a relation that is in BCNF and contains no trivial multi-valued
dependency
61 | P a g e
Basic 2-tier Client-Server Architectures Specialized Servers with Specialized functions Print server File server DBMS server Web server Email server Clients can access the specialized servers as needed
62 | P a g e
Clients Provide appropriate interfaces through a client software module to access and utilize the various server resources. Clients may be diskless machines or PCs or Workstations with disks with only the client software installed. Connected to the servers via some form of a network. (LAN: local area network, wireless network, etc.) DBMS Server Provides database query and transaction services to the clients Relational DBMS servers are often called SQL servers, query servers, or transaction servers Applications running on clients utilize an Application Program Interface (API) to access server databases via standard interface such as: ODBC: Open Database Connectivity standard JDBC: for Java programming access Client and server must install appropriate client module and server module software for ODBC or JDBC 1. A client program may connect to several DBMSs, sometimes called the data sources. 2. In general, data sources can be files or other non-DBMS software that manages data. 3. Other variations of clients are possible: e.g., in some object DBMSs, more functionality is transferred to clients including data dictionary functions, optimization and recovery across multiple servers, etc. Three Tier Client-Server Architecture Common for Web applications Intermediate Layer called Application Server or Web Server: Stores the web connectivity software and the business logic part of the application used to access the corresponding data from the database server Acts like a conduit for sending partially processed data between the database server and the client. Three-tier Architecture Can Enhance Security: Database server only accessible via middle tier Clients cannot directly access database server
63 | P a g e
TRANSACTION PROCESSING
Transaction :Def 1: Logical unit of database processing that includes one or more access operations (read -retrieval, write - insert or update, delete). Def 2: Transaction is an executing program forming a logical unit of database access operations that involves one or more database operations (read -retrieval, write - insert or update, delete). A transaction (set of operations) may be stand-alone specified in a high level language like SQL submitted interactively, or may be embedded within a program. Transaction boundaries: o Begin and End transaction. An application program may contain several transactions separated by the Begin and End transaction boundaries. Basic operations are read and write o read_item(X): Reads a database item named X into a program variable. To simplify our notation, we assume that the program variable is also named X. o write_item(X): Writes the value of program variable X into the database item named X.
Why Do We Need Transactions? Its all about fast query response time and correctness DBMS is a multi-user systems Many different requests Some against same data items Figure out how to interleave requests to shorten response time while guaranteeing correct result How does DBMS know which actions belong together? Solution: Group database operations that must be performed together into transactions Either execute all operations or none 64 | P a g e
READ AND WRITE OPERATIONS: Basic unit of data transfer from the disk to the computer main memory is one block. In general, a data item (what is read or written) will be the field of some record in the database, although it may be a larger unit such as a record or even a whole block.
read_item(X) command includes the following steps: 1. Find the address of the disk block that contains item X. 2. Copy that disk block into a buffer in main memory (if that disk block is not already in some main memory buffer). 3. Copy item X from the buffer to the program variable named X. write_item(X) command includes the following steps: 1. Find the address of the disk block that contains item X. 2. Copy that disk block into a buffer in main memory (if that disk block is not already in some main memory buffer). 3. Copy item X from the program variable named X into its correct location in the buffer. 4. Store the updated block from the buffer back to disk (either immediately or at some later point in time). Example Transactions:
by implementing interleaved concurrency the following problems may occur: 1. The Lost Update Problem This occurs when two transactions that access the same database items have their operations interleaved in a way that makes the value of some database item incorrect. 2. The Temporary Update (or Dirty Read) Problem This occurs when one transaction updates a database item and then the transaction fails for some reason ,The updated item is accessed by another transaction before it is changed back to its original value. 3. The Incorrect Summary Problem If one transaction is calculating an aggregate summary function on a number of records while other transactions are updating some of these records, the aggregate function may calculate some values before they are updated and others after they are updated. 4 Unrepeatable Read problem Transaction A reads the value of a data item multiple times during the transaction's life, however, transaction B is allowed to update the data item in between A's reads.(Some sources also call this the repeatable read problem. The definition is the same.)
65 | P a g e
What causes a Transaction to fail -: 1. A computer failure (system crash): A hardware or software error occurs in the computer system during transaction execution. If the hardware crashes, the contents of the computers internal memory may be lost. 2. A transaction or system error: Some operation in the transaction may cause it to fail, such as integer overflow or division by zero. Transaction failure may also occur because of erroneous parameter values or because of a logical programming error. In addition, the user may interrupt the transaction during its execution. 3. Local errors or exception conditions detected by the transaction: 66 | P a g e
Certain conditions necessitate cancellation of the transaction. For example, data for the transaction may not be found. A condition, such as insufficient account balance in a banking database, may cause a transaction, such as a fund withdrawal from that account, to be canceled. A programmed abort in the transaction causes it to fail. 4. Concurrency control enforcement: The concurrency control method may decide to abort the transaction, to be restarted later, because it violates serializability or because several transactions are in a state of deadlock 5. Disk failure: Some disk blocks may lose their data because of a read or write malfunction or because of a disk read/write head crash. This may happen during a read or a write operation of the transaction. 6. Physical problems and catastrophes: This refers to an endless list of problems that includes power or air-conditioning failure, fire, theft, sabotage, overwriting disks or tapes by mistake, and mounting of a wrong tape by the operator.
A transaction is an atomic unit of work that is either completed in its entirety or not done at all. For recovery purposes, the system needs to keep track of when the transaction starts, terminates, and commits or aborts. Transaction states: Active state Partially committed state Committed state Failed state Terminated State Recovery manager keeps track of the following operations: o begin_transaction: This marks the beginning of transaction execution. o read or write: These specify read or write operations on the database items that are executed as part of a transaction. o end_transaction: This specifies that read and write transaction operations have ended and marks the end limit of transaction execution. At this point it may be necessary to check whether the changes introduced by the transaction can be permanently applied to the database or whether the transaction has to be aborted because it violates concurrency control or for some other reason. o commit_transaction: This signals a successful end of the transaction so that any changes (updates) executed by the transaction can be safely committed to the database and will not be undone. o rollback (or abort): This signals that the transaction has ended unsuccessfully, so that any changes or effects that the transaction may have applied to the database must be undone. Recovery techniques use the following operators: o undo: Similar to rollback except that it applies to a single operation rather than to a whole transaction. o redo: This specifies that certain transaction operations must be redone to ensure that all the operations of a committed transaction have been applied successfully to the database. Transaction States A transaction can be in one of several states: Active - Reading and Writing data items ,if something wrong happens during reading and writing aborts to Failed. Partially Committed - All reading and writing operations are done aborts to Failed when rollback occurs or committed when commit occurs. Committed - Transaction successfully completed and all write operations made permanent in the Database Failed - Transaction halted and all operations rolled back Terminated - terminates either commits or failed 67 | P a g e
The System Log o Log or Journal: The log keeps track of all transaction operations that affect the values of database items. This information may be needed to permit recovery from transaction failures. The log is kept on disk, so it is not affected by any type of failure except for disk or catastrophic failure. In addition, the log is periodically backed up to archival storage (tape) to guard against such catastrophic failures. T in the following discussion refers to a unique transaction-id that is generated automatically by the system and is used to identify each transaction: o Types of log record: [start_transaction,T]: Records that transaction T has started execution. [write_item,T,X,old_value,new_value]: Records that transaction T has changed the value of database item X from old_value to new_value. [read_item,T,X]: Records that transaction T has read the value of database item X. [commit,T]: Records that transaction T has completed successfully, and affirms that its effect can be committed (recorded permanently) to the database. [abort,T]: Records that transaction T has been aborted. If the system crashes, we can recover to a consistent database state by examining the log .Because the log contains a record of every write operation that changes the value of some database item, it is possible to undo the effect of these write operations of a transaction T by tracing backward through the log and resetting all items changed by a write operation of T to their old_values.We can also redo the effect of the write operations of a transaction T by tracing forward through the log and setting all items changed by a write operation of T (that did not get done permanently) to their new_values. Commit Point of a Transaction: Definition a Commit Point: o A transaction T reaches its commit point when all its operations that access the database have been executed successfully and the effect of all the transaction operations on the database has been recorded in the log. o Beyond the commit point, the transaction is said to be committed, and its effect is assumed to be permanently recorded in the database. o The transaction then writes an entry [commit,T] into the log.
68 | P a g e
Roll Back of transactions: o Needed for transactions that have a [start_transaction,T] entry into the log but no commit entry [commit,T] into the log. Redoing transactions: o Transactions that have written their commit entry in the log must also have recorded all their write operations in the log; otherwise they would not be committed, so their effect on the database can be redone from the log entries. (Notice that the log file must be kept on disk. o At the time of a system crash, only the log entries that have been written back to disk are considered in the recovery process because the contents of main memory may be lost.) Force writing a log: o Before a transaction reaches its commit point, any portion of the log that has not been written to the disk yet must now be written to the disk. o This process is called force-writing the log file before committing a transaction.
ACID properties: 1. Atomicity: A transaction is an atomic unit of processing; it is either performed in its entirety or not performed at all. 2. Consistency preservation: A correct execution of the transaction must take the database from one consistent state to another. 3. Isolation: A transaction should not make its updates visible to other transactions until it is committed; this property, when enforced strictly, solves the temporary update problem and makes cascading rollbacks of transactions unnecessary . 4. Durability or permanency: Once a transaction changes the database and the changes are committed, these changes must never be lost because of subsequent failure. Recall the database state is the set of data and values at a given point in time. A consistent state is one in which the present set of values do not violate constraints on the schema. Level of Isolation - The level to which other transactions are isolated from the effects of a currently executing transaction. A transaction T has isolation Level... 1. Level 0 - If transaction T does not overwrite data items that have been read by another transaction (no dirty reads). 2. Level 1 - If Transaction T does not overwrite data items that have been written by another transaction (no lost updates). 3. Level 2 - no dirty reads or lost updates. 4. Level 3 (true isolation) - same as Level 2 with the ability to repeatedly read a data item throughout the life of the transaction.
EXAMPLE ILLUSTRATING ACID PROPERTIES: Transaction to transfer $50 from account A to account B: 1. read(A) 2. A := A 50 3. write(A) 4. read(B) 5. B := B + 50 6. write(B) Atomicity requirement o if the transaction fails after step 3 and before step 6, money will be lost leading to an inconsistent database state Failure could be due to software or hardware 69 | P a g e
o the system should ensure that updates of a partially executed transaction are not reflected in the database Durability requirement once the user has been notified that the transaction has completed (i.e., the transfer of the $50 has taken place), the updates to the database by the transaction must persist even if there are software or hardware failures. Consistency requirement in above example: the sum of A and B is unchanged by the execution of the transaction In general, consistency requirements include Explicitly specified integrity constraints such as primary keys and foreign keys Implicit integrity constraints e.g. sum of balances of all accounts, minus sum of loan amounts must equal value of cash-in-hand o A transaction must see a consistent database. o During transaction execution the database may be temporarily inconsistent. o When the transaction completes successfully the database must be consistent Erroneous transaction logic can lead to inconsistency Isolation requirement if between steps 3 and 6, another transaction T2 is allowed to access the partially updated database, it will see an inconsistent database (the sum A+Bwillbelessthanitshouldbe). T1 T2 read(A) A := A 50 write(A) read(A), read(B), print(A+B) read(B) B := B + 50 write(B Isolation can be ensured trivially by running transactions serially, that is, one after the other. However, executing multiple transactions concurrently has significant benefits, as we will see later. Transaction Schedule: Transaction schedule or history: o When transactions are executing concurrently in an interleaved fashion, the order of execution of operations from the various transactions forms what is known as a transaction schedule (or history). A schedule (or history) S of n transactions T1, T2, , Tn: o It is an ordering of the operations of the transactions subject to the constraint that, for each transaction Ti that participates in S, the operations of T1 in S must appear in the same order in which they occur in T1. o Note, however, that operations from other transactions Tj can be interleaved with the operations of Ti in S.
1. 2. 3. 4. 5. 6.
T1: r(X); w(X); r(Y); w(Y); c T2: r(X); w(X); c Sample schedule: S: r1(X); r2(X); w1(X); r1(Y); w2(X); w1(Y); c1; c2 Two operations conflict if they satisfy ALL three conditions: a. they belong to different transactions AND b. they access the same item AND 70 | P a g e
c. at least one is a write_item()operation Ex.: S: r1(X); r2(X); w1(X); r1(Y); w2(X); w1(Y); A Schedule S of n transactions T1,T2,Tn is said to be COMPLETE SCHEDULE if the following conditions hold: 1. The operations in S are exactly those operations in T1,T2,Tn including a commit or abort operation as the last operation for each transaction in the schedule. 2. For any pair if operations from the same transaction Ti,their order of appearance in S is the same as their order in Ti. 3. For any two conflicting operations ,one of the two must occur before the other in the schedule. Schedules classified on recoverability: Recoverable schedule: One where no transaction needs to be rolled back. A schedule S is recoverable if no transaction T in S commits until all transactions T that have written an item that T reads have committed. Cascade less schedule: One where every transaction reads only the items that are written by committed transactions. Schedules requiring cascaded rollback: A schedule in which uncommitted transactions that read an item from a failed transaction must be rolled back. Strict Schedules: A schedule in which a transaction can neither read nor write an item X until the last transaction that wrote X has committed. All strict schedules are cascade less and all cascade less are recoverable schedules. Schedules classified on Serializability:Serial schedule: A schedule S is serial if, for every transaction T participating in the schedule, all the operations of T are executed consecutively in the schedule. Otherwise, the schedule is called non serial schedule. Serializable schedule: A schedule S is serializable if it is equivalent to some serial schedule of the same n transactions. Assumption: Every serial schedule is correct Goal: Like to find non-serial schedules which are also correct, because in serial schedules one transaction have to wait for another transaction to complete, Hence serial schedules are unacceptable in practice. Result equivalent: Two schedules are called result equivalent if they produce the same final state of the database.Problem: May produce same result by accident! S1 read_item(X); X:=X+10; write_item(X); S2 read_item(X); X:=X*1.1; write_item(X);
Conflict equivalent: 71 | P a g e
Two schedules are said to be conflict equivalent if the order of any two conflicting operations is the same in both schedules.
Conflict serializable: A schedule S is said to be conflict serializable if it is conflict equivalent to some serial schedule S. Can reorder the non-conflicting operations to improve efficiency Non-conflicting operations: Reads and writes from same transaction Reads from different transactions Reads and writes from different transactions on different data items Conflicting operations: Reads and writes from different transactions on same data item Test for Serializability: Construct a directed graph, precedence graph, G = (V, E) V: set of all transactions participating in schedule E: set of edges Ti Tj for which one of the following holds: Ti executes a write_item(X) before Tj executes read_item(X) Ti executes a read_item(X) before Tj executes write_item(X) Ti executes a write_item(X) before Tj executes write_item(X) An edge Ti Tj means that in any serial schedule equivalent to S, Ti must come before Tj If G has a cycle, than S is not conflict serializable If not, use topological sort to obtain serialiazable schedule (linear order consistent with precedence order of graph) FIGURE 17.7 Constructing the precedence graphs for schedules A and D from Figure 17.5 to test for conflict serializability. Precedence graph for serial schedule A. Precedence graph for serial schedule B. Precedence graph for schedule C (not serializable). Precedence graph for schedule D (serializable, equivalent to schedule A).
View equivalence: A less restrictive definition of equivalence of schedules View serializability: definition of serializability based on view equivalence. A schedule is view serializable if it is view equivalent to a serial schedule. 72 | P a g e
Two schedules are said to be view equivalent if the following three conditions hold: 1. The same set of transactions participates in S and S, and S and S include the same operations of those transactions. 2. For any operation Ri(X) of Ti in S, if the value of X read by the operation has been written by an operation Wj(X) of Tj (or if it is the original value of X before the schedule started), the same condition must hold for the value of X read by operation Ri(X) of Ti in S. 3. If the operation Wk(Y) of Tk is the last operation to write item Y in S, then Wk(Y) of Tk must also be the last operation to write item Y in S. The premise behind view equivalence: As long as each read operation of a transaction reads the result of the same write operation in both schedules, the write operations of each transaction musr produce the same results. The view: the read operations are said to see the the same view in both schedules. Relationship between view and conflict equivalence: The two are same under constrained write assumption which assumes that if T writes X, it is constrained by the value of X it read; i.e., new X = f(old X) Conflict serializability is stricter than view serializability. With unconstrained write (or blind write), a schedule that is view serializable is not necessarily conflict serialiable. Any conflict serializable schedule is also view serializable, but not vice versa. Consider the following schedule of three transactions T1: r1(X), w1(X); T2: w2(X); and T3: w3(X): Schedule Sa: r1(X); w2(X); w1(X); w3(X); c1; c2; c3; In Sa, the operations w2(X) and w3(X) are blind writes, since T1 and T3 do not read the value of X. Sa is view serializable, since it is view equivalent to the serial schedule T1, T2, T3. However, Sa is not conflict serializable, since it is not conflict equivalent to any serial schedule.
Transaction Support in SQL2 A single SQL statement is always considered to be atomic. o Either the statement completes execution without error or it fails and leaves the database unchanged. With SQL, there is no explicit Begin Transaction statement. o Transaction initiation is done implicitly when particular SQL statements are encountered. Every transaction must have an explicit end statement, which is either a COMMIT or ROLLBACK. Characteristics specified by a SET TRANSACTION statement in SQL2: Access mode: o READ ONLY or READ WRITE. The default is READ WRITE unless the isolation level of READ UNCOMITTED is specified, in which case READ ONLY is assumed. Diagnostic size n, specifies an integer value n, indicating the number of conditions that can be held simultaneously in the diagnostic area. Isolation level <isolation>, where <isolation> can be READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ or SERIALIZABLE. The default is SERIALIZABLE. o With SERIALIZABLE: the interleaved execution of transactions will adhere to our notion of serializability. o However, if any transaction executes at a lower level, then serializability may be violated. 73 | P a g e
If T1 is repeated, then T1 will see a row that previously did not exist, called a phantom.
Possible violation of serializabilty: Type of Violation Isolation Dirty nonrepeatable level read read phantom _______________________________________________________ READ UNCOMMITTED yes yes yes READ COMMITTED no yes yes REPEATABLE READ no no yes SERIALIZABLE no no no Serializable default Repeatable read only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable it may find some records inserted by a transaction but not find others. Read committed only committed records can be read, but successive reads of record may return different (but committed) values. Read uncommitted even uncommitted records may be read. Introduction to Concurrency What is concurrency? Concurrency in terms of databases means allowing multiple users to access the data contained within a database at the same time. If concurrent access is not managed by the Database Management System (DBMS) so that simultaneous operations don't interfere with one another problems can occur when various transactions interleave, resulting in an inconsistent database. Concurrency is achieved by the DBMS, which interleaves actions (reads/writes of DB objects) of various transactions. Each transaction must leave the database in a consistent state if the DB is consistent when the transaction begins. Concurrent execution of user programs is essential for good DBMS performance. Because disk accesses are frequent, and relatively slow, it is important to keep the CPU humming by working on several user programs concurrently. Interleaving actions of different user programs can lead to inconsistency: e.g., check is cleared while account balance is being computed. DBMS ensures such problems dont arise: users can pretend they are using a single-user system. Purpose of Concurrency Control o To enforce Isolation (through mutual exclusion) among conflicting transactions. o To preserve database consistency through consistency preserving execution of transactions. o To resolve read-write and write-write conflicts. o Example:----In concurrent execution environment if T1 conflicts with T2 over a data item A, then the existing concurrency control decides if T1 or T2 should get the A and if the other transaction is rolled-back or waits. LOCK Definition : Lock is a variable associated with data item which gives the status whether the possible operations can be applied on it or not. Two-Phase Locking Techniques: Binary locks: Locked/unlocked 74 | P a g e
The simplest kind of lock is a binary on/off lock. This can be created by storing a lock bit with each database item. If the lock bit is on (e.g. = 1) then the item cannot be accessed by any transaction either for reading or writing, if it is off (e.g. = 0) then the item is available. Enforces mutual
exclusion
Binary locks are: Simple but are restrictive. Transactions must lock every data item that is read or written No data item can be accessed concurrently
Locking is an operation which secures (a) permission to Read (b) permission to Write a data item for a transaction. Example: Lock (X). Data item X is locked in behalf of the requesting transaction. Unlocking is an operation which removes these permissions from the data item. Example:Unlock (X): Data item X is made available to all other transactions. Lock and Unlock are Atomic operations. Lock Manager: Managing locks on data items. Lock table: Lock manager uses it to store the identify of transaction locking a data item, the data item, lock mode . One simple way to implement a lock table is through linked list. < locking_transaction ,data item, LOCK > The following code performs the lock operation: B: if LOCK (X) = 0 (*item is unlocked*) then LOCK (X) 1 (*lock the item*) else begin wait (until lock (X) = 0) and the lock manager wakes up the transaction); goto B end;
The following code performs the unlock operation: LOCK (X) 0 (*unlock the item*) if any transactions are waiting then wake up one of the waiting the transactions; Multiple-mode locks: Read/write a.k.a. Shared/Exclusive Three operations read_lock(X) write_lock(X) unlock(X) Each data item can be in one of three lock states Three locks modes: (a) shared (read) (b) exclusive (write) (c) unlock(release) Shared mode: shared lock (X) More than one transaction can apply share lock on X for reading its value but no write lock can be applied on X by any other transaction. Exclusive mode: Write lock (X) Only one write lock on X can exist at any time and no shared lock can be applied by any other transaction on X. _ Unlock mode: Unlock(X) 75 | P a g e
issuing this.
Lock table: Lock manager uses it to store the identify of transaction locking a data item, the data item, lock mode and no of transactions that are currently reading the data item . It looks like as below <data item,read_ LOCK,nooftransactions,transaction id >
This protocol isnt enough to guarantee serializability. If locks are released too early, you can create problems. This usually happens when a lock is released before another lock is acquired.
The following code performs the read operation: B: if LOCK (X) = unlocked then begin LOCK (X) read-locked; no_of_reads (X) 1; end else if LOCK (X) read-locked then no_of_reads (X) no_of_reads (X) +1 else begin wait (until LOCK (X) = unlocked and the lock manager wakes up the transaction); go to B end; The following code performs the write lock operation: B: if LOCK (X) = unlocked then LOCK (X) write-locked; else begin wait (until LOCK (X) = unlocked and the lock manager wakes up the transaction); go to B end; The following code performs the unlock operation: if LOCK (X) = write-locked then begin LOCK (X) unlocked; wakes up one of the transactions, if any end else if LOCK (X) read-locked then begin no_of_reads (X) no_of_reads (X) -1 if no_of_reads (X) = 0 then begin LOCK (X) = unlocked; wake up one of the transactions, if any end end; 76 | P a g e
Lock conversion:----Lock upgrade: existing read lock to write lock if Ti has a read-lock (X) and Tj has no read-lock (X) (i j) then convert read-lock (X) to write-lock (X) else force Ti to wait until Tj unlocks X Lock downgrade: existing write lock to read lock Ti has a write-lock (X) (*no transaction can have any lock on X*) convert write-lock (X) to read-lock (X) Two-Phase Locking Techniques: The algorithm The timing of locks is also important in avoiding concurrency problems. A simple requirement to ensure transactions are serializable is that all read and write locks in a transaction are issued before the first unlock operation known as a two-phase locking protocol. Transaction divided into 2 phases: growing - new locks acquired but none released shrinking - existing locks released but no new ones acquired During the shrinking phase no new locks can be acquired! Downgrading ok Upgrading is not Rules of 2PL are as follows: If T wants to read an object it needs a read_lock If T wants to write an object, it needs a write_lock Once a lock is released, no new ones can be acquired.
The 2PL protocol guarantees serializability Any schedule of transactions that follow 2PL will be serializable We therefore do not need to test a schedule for serializability But, it may limit the amount of concurrency since transactions may have to hold onto locks longer than needed, creating the new problem of deadlocks. Two-Phase Locking Techniques: The algorithm Here is a example without 2PL:T1 read_lock (Y); read_item (Y); unlock (Y); write_lock (X); read_item (X); X:=X+Y; write_item (X); unlock (X); T1 read_lock (Y); 77 | P a g e T2 T2 read_lock (X); read_item (X); unlock (X); Write_lock (Y); read_item (Y); Y:=X+Y; write_item (Y); unlock (Y); Result Initial values: X=20; Y=30 Result of serial execution T1 followed by T2 X=50, Y=80. Result of serial execution T2 followed by T1 X=70, Y=50
read_item (Y); unlock (Y); read_lock (X); read_item (X); unlock (X); write_lock (Y); read_item (Y); Y:=X+Y; write_item (Y); unlock (Y); write_lock (X); read_item (X); X:=X+Y; write_item (X); unlock (X);
Here is a example with 2PL:T1 read_lock (Y); read_item (Y); write_lock (X); unlock (Y); read_item (X); X:=X+Y; write_item (X); unlock (X); T2 read_lock (X); read_item (X); Write_lock (Y); unlock (X); read_item (Y); Y:=X+Y; write_item (Y); unlock (Y); Result T1 and T2 follow two-phase policy but they are subject to deadlock, which must be dealt with.
Two-phase policy generates four locking algorithms:1. 2. 3. 4. BASIC CONSERVATIVE STRICT RIGOUROUS
Previous technique knows as basic 2PL Conservative 2PL (static) 2PL: Lock all items needed BEFORE execution begins by predeclaring its read and write set If any of the items in read or write set is already locked (by other transactions), transaction waits (does not acquire any locks) Deadlock free but not very realistic Strict 2PL: Transaction does not release its write locks until AFTER it aborts/commits Not deadlock free but guarantees recoverable schedules (strict schedule: transaction can neither read/write X until last transaction that wrote X has committed/aborted) Most popular variation of 2PL Rigorous 2PL: No lock is released until after abort/commit Transaction is in its expanding phase until it ends DEAD LOCKS:Two Problems with Locks Deadlock 78 | P a g e
Starvation Dead locks occurs when each transaction Ti in a set of two or more is waiting on an item locked by some other transaction Tj in the set
Dead lock example: T1 read_lock (Y); read_item (Y); write_lock (X); (waits for X) T2 T1 and T2 did follow two-phase policy but they are deadlock read_lock (X); read_item (X); write_lock (Y); (waits for Y)
Deadlock Prevention Locking as deadlock prevention leads to very inefficient schedules (e.g., conservative 2PL) Better, use transaction timestamp TS(T) TS is unique identifier assigned to each transaction if T1 starts before T2, then TS(T1) < TS(T2) (older has smaller timestamp value) Wait-die and wound-wait schemes Wait-Die Scheme Assume Ti tries to lock X which is locked by Tj If TS(Ti) < TS(Tj) (Ti older than Tj), then Ti is allowed to wait Otherwise, Ti younger than Tj, abort Ti (Ti dies) and restart later with SAME timestamp Older transaction is allowed to wait on younger transaction Younger transaction requesting an item held by older transaction is aborted and restarted
Wound-Wait Scheme Assume Ti tries to lock X which is locked by Tj If TS(Ti) < TS(Tj) (Ti older than Tj), abort Tj (Ti wounds Tj) and restart later with SAME timestamp Otherwise, Ti younger than Tj, Ti is allowed to wait Younger transaction is allowed to wait on older transaction Older transaction requesting item held by younger transaction preempts younger one by aborting it Both schemes abort younger transaction that may be involved in deadlock Both deadlock free but may cause needless aborts
More Deadlock Prevention Waiting schemes (require no timestamps) No waiting: if transaction cannot obtain lock, aborted immediately and restarted after time t needless restarts Cautious waiting: 79 | P a g e
Suppose Ti tries to lock item X which is locked by Tj If Tj is not blocked, Ti is blocked and allowed to wait O.w. abort Ti Cautious waiting is deadlock-free
Deadlock Detection DBMS checks if deadlock has occurred Works well if few short transactions with little interference O.w., use deadlock prevention Two approaches to deadlock detection: 1. Wait-for graph If cycle, abort one of the transactions (victim selection) 2. Timeouts Starvation Transaction cannot continue for indefinite amount of time while others proceed normally When? Unfair waiting scheme with priorities for certain transactions E.g., in deadlock detection, if we choose victim always based on cost factors, same transaction may always be picked as victim Include rollbacks in cost factor Timestamp based concurrency control algorithm Timestamp A monotonically increasing variable (integer) indicating the age of an operation or a transaction. A larger timestamp value indicates a more recent event or operation. Timestamp based algorithm uses timestamp to serialize the execution of concurrent transactions. Basic Timestamp Ordering 1. Transaction T issues a write_item(X) operation: If read_TS(X) > TS(T) or if write_TS(X) > TS(T), then an younger transaction has already read the data item so abort and roll-back T and reject the operation. If the condition in part (a) does not exist, then execute write_item(X) of T and set write_TS(X) to TS(T). 2. Transaction T issues a read_item(X) operation: If write_TS(X) > TS(T), then an younger transaction has already written to the data item so abort and roll-back T and reject the operation. If write_TS(X) TS(T), then execute read_item(X) of T and set read_TS(X) to the larger of TS(T) and the current read_TS(X). Strict Timestamp Ordering 1. Transaction T issues a write_item(X) operation: If TS(T) > read_TS(X), then delay T until the transaction T that wrote or read X has terminated (committed or aborted). 2. Transaction T issues a read_item(X) operation: If TS(T) > write_TS(X), then delay T until the transaction T that wrote or read X has terminated (committed or aborted). Multiversion concurrency control techniques o This approach maintains a number of versions of a data item and allocates the right version to a read operation of a transaction. Thus unlike other mechanisms a read operation in this mechanism is never rejected. o Side effect:
80 | P a g e
Significantly more storage (RAM and disk) is required to maintain multiple versions. To check unlimited growth of versions, a garbage collection is run when some criteria is satisfied. o This approach maintains a number of versions of a data item and allocates the right version to a read operation of a transaction. Thus unlike other mechanisms a read operation in this mechanism is never rejected. Multiversion technique based on timestamp ordering Assume X1, X2, , Xn are the version of a data item X created by a write operation of transactions. With each Xi a read_TS (read timestamp) and a write_TS (write timestamp) are associated. read_TS(Xi): The read timestamp of Xi is the largest of all the timestamps of transactions that have successfully read version Xi. write_TS(Xi): The write timestamp of Xi that wrote the value of version Xi. A new version of Xi is created only by a write operation. To ensure serializability, the following two rules are used. 1. If transaction T issues write_item (X) and version i of X has the highest write_TS(Xi) of all versions of X that is also less than or equal to TS(T), and read _TS(Xi) > TS(T), then abort and roll-back T; otherwise create a new version Xi and read_TS(X) = write_TS(Xj) = TS(T). 2. If transaction T issues read_item (X), find the version i of X that has the highest write_TS(Xi) of all versions of X that is also less than or equal to TS(T), then return the value of Xi to T, and set the value of read _TS(Xi) to the largest of TS(T) and the current read_TS(Xi). Rule 2 guarantees that a read will never be rejected.
Multiversion Two-Phase Locking Using Certify Locks Concept o Allow a transaction T to read a data item X while it is write locked by a conflicting transaction T. o This is accomplished by maintaining two versions of each data item X where one version must always have been written by some committed transaction. This means a write operation always creates a new version of X. Multiversion Two-Phase Locking Using Certify Locks Steps 1. X is the committed version of a data item. 2. T creates a second version X after obtaining a write lock on X. 3. Other transactions continue to read X. 4. T is ready to commit so it obtains a certify lock on X. 5. The committed version X becomes X. 6. T releases its certify lock on X, which is X now. Compatibility tables for basic 2pl and 2pl with certify locks:-
Note: In multiversion 2PL read and write operations from conflicting transactions can be processed concurrently. This improves concurrency but it may delay transaction commit because of obtaining certify locks on all its writes. It avoids cascading abort but like strict two phase locking scheme conflicting transactions may get deadlocked. Validation (Optimistic) Concurrency Control Schemes In this technique only at the time of commit serializability is checked and transactions are aborted in case of non-serializable schedules. Three phases: 1. Read phase 2. Validation phase 3. Write phase 1. Read phase: A transaction can read values of committed data items. However, updates are applied only to local copies (versions) of the data items (in database cache). 2.Validation phase: Serializability is checked before transactions write their updates to the database. o This phase for Ti checks that, for each transaction Tj that is either committed or is in its validation phase, one of the following conditions holds: Tj completes its write phase before Ti starts its read phase. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no items in common with the write_set of Tj Both the read_set and write_set of Ti have no items in common with the write_set of Tj, and Tj completes its read phase. When validating Ti, the first condition is checked first for each transaction Tj, since (1) is the simplest condition to check. If (1) is false then (2) is checked and if (2) is false then (3 ) is checked. If none of these conditions holds, the validation fails and Ti is aborted. 3. Write phase: On a successful validation transactions updates are applied to the database; otherwise, transactions are restarted. Granularity of data items and Multiple Granularity Locking A lockable unit of data defines its granularity. Granularity can be coarse (entire database) or it can be fine (a tuple or an attribute of a relation). Data item granularity significantly affects concurrency control performance. Thus, the degree of concurrency is low for coarse granularity and high for fine granularity. Example of data item granularity: 1. A field of a database record (an attribute of a tuple) 2. A database record (a tuple or a relation) 3. A disk block 4. An entire file DB 5. The entire database The following diagram illustrates a hierarchy of granularity f2 from coarse (database) to fine f1 (record).
p11
82 | P a g e
p12
...
...
To manage such hierarchy, in addition to read and write, three additional locking modes, called intention lock modes are defined: o Intention-shared (IS): indicates that a shared lock(s) will be requested on some descendent nodes(s). o Intention-exclusive (IX): indicates that an exclusive lock(s) will be requested on some descendent node(s). o Shared-intention-exclusive (SIX): indicates that the current node is locked in shared mode but an exclusive lock(s) will be requested on some descendent nodes(s).
IX yes yes no no no
X no no no no no
The set of rules which must be followed for producing serializable schedule are 1. The lock compatibility must adhered to. 2. The root of the tree must be locked first, in any mode.. 3. A node N can be locked by a transaction T in S or IX mode only if the parent node is already locked by T in either IS or IX mode. 4. A node N can be locked by T in X, IX, or SIX mode only if the parent of N is already locked by T in either IX or SIX mode. 5. T can lock a node only if it has not unlocked any node (to enforce 2PL policy). 6. T can unlock a node, N, only if none of the children of N are currently locked by T. Granularity of data items and Multiple Granularity Locking: An example of a serializable execution: T1 T2 T3 IX(db) IX(f1) IX(db) IS(db) IS(f1) IS(p11) IX(p11) X(r111) 83 | P a g e
IX(f1) X(p12) S(r11j) IX(f2) IX(p21) IX(r211) Unlock (r211) Unlock (p21) Unlock (f2) S(f2)
84 | P a g e