Database normalization

2011-02-03

http://en.wikipedia.org/wiki/Database_normalization

Goals
- Free the database from modification anomalies
  - Update anomaly: If a table has (employee id, employee address, employee skill) columns, updating an employee’s address involves updating every row that references them (one row for each skill). If the update changes some rows but not all, then the database is left inconsistent.
  - Insertion anomaly: If a table has (faculty id, faculty name, faculty hire date, course code), you can’t insert a newly hired faculty member until they’ve been scheduled to teach a course except by setting the course code to null.
  - Deletion anomaly: Cancelling the only/last course for a faculty member on a (faculty id, faculty name, faculty hire date, course code) table involves in the worst case deleting the entire row, which loses important data on the faculty member unnecessarily. In the best case, special logic must be written to check first that the row is not that faculty member’s only row, before nulling out the course code.
- Minimize redesign when extending the DB structure
- Make the data model more informative to users
- Avoid bias toward particular querying patterns
  - Unanticipated query goals should be supported without extra work or tools.
1NF
- Requirements
  - There are no intrinsic, meaningful ordering to rows or columns.
  - There are no duplicate rows
  - Every row & column intersection contains exactly one value from the applicable domain (and nothing else)
    - controversially, this indicates no nullable columns
    - e.g. no telephone number column should have NUM_1\0NUM_2 as a value
    - e.g. should not have multiple telephone number columns to accommodate multiple numbers (since some of these would have to be nullable)
  - All columns are regular. (unsure what this means)
2NF
- Requirement: For any candidate key K and any attribute A that is not a constituent of a candidate key, A depends on the whole of K rather than just part of it.
- A table is not in 2NF if there is redundant data which is directly dependent on the table’s key.
- 2NF does, however, permit redundant data which is transitively dependent on the table’s key.
- If in table T some attribute A depends on one candidate key K but not on another, then that (K, A) linkage must be listed in some other table to count as 2NF, even if K is not the primary key for T.
3NF
- Requirement, as delineated by Codd: Every non-prime attribute of table T is non-transitively dependent on every candidate key of T.
- Requirements, as delineated by Zaniolo: table T is in 3NF iff, for each of its functional dependencies X -> A, at least one of the following holds:
  - X -> A is a trivial functional dependency
  - X is a superkey (a set of columns such that the values uniquely determine a row)
  - A-X is a prime attribute (A-X is contained within a candidate key)
- Requirement, as delineated by Kent: every non-key attribute “must provide a fact about the key, the whole key, and nothing but the key.”
  - non-key attributes depending on the whole key ensures the table is in 2NF.
  - non-key attributes depending on nothing but the key ensures the table is in 3NF.
  - If changed to “each attribute”, it summarizes Boyce-Codd Normal Form (BCNF)
BCNF
- Requirement: for every nontrivial dependency X -> Y, X is a superkey (i.e., X is either a candidate key or a superset thereof)
- BCNF is not always acheivable.
- A 3NF table which does not have overlapping candidate keys is guaranteed to be in BCNF.
4NF
- Requirement: for every non-trivial multivalued dependency X ->-> Y, X is a superkey.
  - A multivalued dependency X ->-> Y in a table with columns X, Y, Z signifies that for each value x in X, the set of Y values is the same for any z in Z.
- Example of non-4NF table is (pizza restaurant, pizza crust type, delivery area), where restaurant determines delivery area and restaurant determines crust type, so 2 tables are better than 1. With 2 tables, the (crust type, delivery area) permutations can be generated from (pizza restaurant, crust type) and (pizza restaurant, delivery area) tables and adding/removing a crust type or delivery area requires only one operation.
5NF
- Requirement: every join dependency in it is implied by the candidate keys.
- Canonical example is (agent, company, product). Suppose the data obeys this semantic rule: if agent a sells any product of type t, and a represents company c, and c produces products of type t, then a sells all products of type t created by c. Then for 5NF this relationship must be represented as (agent, company), (agent, product), (company, product) tables, rather than a single table.
- 5NF reduces redundancy by increasing additively in size as new facts emerge.
- Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form. (http://www.bkent.net/Doc/simple5.htm)
Database Denormalization
- Denormalization involves adding redundant data, or grouping data, to optimize read performance.
- Normalization typically involves many separate tables, and queries that join many tables may become prohibitively time-consuming.
- Sometimes this problem is solved via views rather than denormalization.
- Denormalization techniques:
  - remove normalization from some tables.
    - Inconsistencies can be staved off with constraints. Constraints are complex to craft and maintain, and slow down writes.
  - Materialized views
  - Star schemas
  - Prebuilt summarization