Technical level: Intermediate
|| Date: 19th January 2004|| Author:
John Blank, Wankyu Choi, Allan Kent, Ganesh
Prasad, Chris Ullman
Chapter Index
Introduction
Normalization
Foreign Keys
Denormalization
Referential Integrity
Summary
This is a free sample chapter from Beginning PHP 4
Databases published by Wrox Press.
Introduction
As we have briefly seen in the previous chapter, there
are many possible ways to structure data, such as
trees, lists, and objects. In relational databases,
data is structured as relations. Relations are expressed
as sets of tuples, which are implemented in the form
of tables. Conceptually, tables are easy to grasp,
since it is a form that is familiar to most people.
Anyone who has read a spreadsheet, a train timetable,
or even a television guide is already familiar with
the organization of data into columns and rows. In
this chapter, we will lay out the basic concepts of
relational databases and describe the process of organizing
data in a relational manner. The topics covered in
this chapter are:
Schema normalization
The process by which redundancy and inconsistency
are eliminated from the database
Keys
Fields or combinations of fields that are used to
identify records
Referential integrity
A state of consistency for database schemata
Entity relationship diagrams
Models used to design databases
Tabular Data
As mentioned earlier, a table is a data structure
that uses columns and rows to organize the data. As
an example, consider the following ledger sheet sample
of a charity containing details of the donations:
Donor Donation 1 Donation 2 Donation 3
Marco Pinelli $200 Solar
Scholars
Victor Gomez $100 Pear Creek $100 Danube
Land Trust $50 Forest Asia
Seung Yong Lee $150 Forest Asia
Tables represent entities, which are unique items
like people, objects, and relationships about which
we wish to store data. Each row represents an instance
of the entity. In the above example, each row represents
an instance of one donor. In relational database terminology,
an instance is known as a record, but the terms row
or tuple are also used.
Each column represents an attribute of the entity,
or something about the entity. In this case, each
column represents a donation made by the donor, listing
the amount of the donation and the project to which
the money is donated. In relational database terminology,
an attribute is known as a field, but the term column
is also very common. Adding or removing columns would
change both the data stored in the table and the actual
structure of the table, whereas adding or removing
rows would only change data stored in the table. In
other words, removing a column removes information
about entities whereas removing a row only removes
one instance of an entity but no information about
them in general.
As we shall see in the next chapter, each field in
a table is assigned a data type. The type indicates
what sort of data will be stored in that field: text
data, integer data, boolean (true or false) data,
and so on. The assigned type then applies to that
field's value for every record in the table.
Keys
We create databases because we need to store information.
For the information in the database to be useful,
we need to be able to perform certain operations on
it. These operations fall broadly into two categories:
reading the data, and changing the data. Whether one
wishes to read a record, update it, or delete it,
one first needs to identify the record in a way that
distinguishes it from the other records in the table.
This is where keys come in. A key is a field or a
combination of fields whose value identifies a record
for a given purpose. One type of key is a unique key,
which can be used to identify a single record. For
example, every book has a unique ISBN (International
Standard Book Number) that marks the book unmistakably.
If a table of information about books includes an
ISBN field, then that field can serve as a unique
key.
A table might have more than one unique key. Suppose
that each book in our table also has a unique product
ID. While there is no problem with the existence of
more than one unique key, it is considered desirable
to have one that stands out as the primary key - the
key that is considered the foremost means of identifying
a record. In this case each of the unique keys is
known as a candidate key, since each has the possibility
of serving as the primary key. It is then up to the
database designer to designate the primary key from
among the candidate keys.
In our example from the previous section, the Donor
field is a candidate key, if we accept (for now) that
each donor is unique within the table. We shall revisit
the topic of keys later in this chapter.
A Few Inadequacies
At first glance, a simple table such as the one shown
in the chapter seems to meet all of our needs for
storing data. When designing and filling the table,
we may add as many fields and records as we like to
accommodate large amounts of data. But after some
examination, we are likely to encounter quite a few
failings with our table. What if a donor makes more
than three donations? We can add more fields to the
table, of course, but to change the structure of a
database once it is in use is extremely inconvenient.
Also it is difficult to know in advance how many donations
would be enough. What if one donor makes dozens of
donations?
What if we wish to store more information about a
project, such as a description? Or even more information
about a donation, such as the date? Again, while it
is conceivable that additional columns could address
this issue, such a solution would be awkward and wasteful.
If columns named DonationDate1 and DonationDate2 are
added, the same uncertainty over the appropriate number
of columns exists. Adding a description after every
project name produces a lot of redundant data, since
each project appears in the table multiple times.
Every time a new donation is made, the description
of the project would have to be repeated. Such redundancy
is very inefficient as seen in the following table:
Donor Amount1 Project1 Description1 Amount2 Project2
Description2
Marco Pinelli $200 Solar Scholars Powering schools
with solar panels
Victor Gomez $100 Pear Creek cleanup Cleaning up litter
and pollutants from Pear Creek $100 Danube Land Trust
Purchasing and preserving land in the Danube watershed
Seung Yong Lee $150 Forest Asia Planting trees in
Asia
The underlying problem is that a table is two-dimensional.
It consists of columns and rows. Real-world data is
usually multi-dimensional. We wish to store not only
the data relevant to the donors and donations, but
also data that relates to details in the table, such
as additional information about the projects. There
is a solution to our problem. Relational databases
allow us to create multiple tables of related data.
The database designer uses the relationships between
these tables to represent multi-dimensional data.
This is also why they are called relational databases.
Let's now look at the process of normalization in
relational databases.