Normalization in Databases
Normalization is essentially breaking up a table into new related tables and assigning attributes to these new tables. This is done in order to MINIMIZE data redundancies/repetitions of attribute values and to ultimately avoid anomalies (note the term minimize. This is because we sometimes deliberately repeat data in order to make relational databases work — for example, foreign keys (FKs) are repeated in order to make relationships in databases work. These are then called controlled redundancies).
Anomalies
An anomaly is essentially when something is not quite the way it should be — think of anomalies as a ‘problems’ that arise in a database without normalization.
Three types of anomalies (we will use the table below to better explain these anomalies):
- Update anomaly. This is when duplicated attribute values are updated but not all. Imagine the spreadsheet above had thousands of records. If you want to update the TillType_id and you miss just one row when updating, this would lead to inconsistent data.
- Insertion anomaly. This is when attribute values cannot be inserted into the database without inserting other attributes. Suppose you want to insert some data in the SaleTotal column. This would be impossible if you do not have the customer and the teller details. Suppose Teller with Teller_id T01 decides to resign from working at the store and you want to delete all her details. Deleting her details would lead to deletion of the customer details and well as the sale details. This is deletion anomaly.
- Deletion anomaly. This is when certain attribute values are lost due to deleting other attribute values.
To make normalization explanations better, we will use the same table as the one above as an example:
First, lets look at some rules of normalization:
- Each table must represent a single thing
- Attributes should not be unnecessarily stored in more than one table (note the term unnecessarily).
- All attributes should depend on a primary key
Before we start normalizing databases, lets first look at dependencies:
The best way to understand what dependencies are, is keeping in mind that ‘if I know X, I can determine Y’, for which X becomes the determinant. This is often determined by an arrow: X → Y.
Three types of dependencies:
- Proper functional dependency. This is when all dependent attributes are dependent on the entire primary key. We should always aim for proper functional dependencies when normalizing databases.
- Partial dependency. This is when a dependent attribute is only dependent on only a part of a composite primary key.
- Transitive dependency. This when a dependent attribute is not dependent on any part of a primary key.
Normalization hierarchy:
- First Normal Form (1NF)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce-Codd Normal Form (BCNF)
- Fourth Normal Form (4NF)
This article will only explain how to convert up to the 3NF, and is broken up into 3 more parts: