CSCE 235
Handout
13: Sets: Searching the World
February 11, 2004
Assume that you are an expert in database management. You are given the following database. Your responsibilities include writing a program that allow users to query the database by specifying the country name, the capital name, the country’s area, the country’s population, and the continent of the country. For example, the users should be able to ask for “Give me the countries on the continent of North America that each has more than 200,000,000 people.” Or, “Give me the countries whose capital shares the same name as the name of the country.” “Give me the countries on the Oceania continent.” “Give me the countries with fewer than 10,000,000 people in its population.” “Give me the countries with an area larger than 100,000 sq mi yet with a population fewer than 200,000,000 people.”
|
Country |
Capital |
Area |
Population |
Continent |
|
Algeria |
Algiers |
919,595 sq mi |
30,774,000 |
Africa |
|
Angola |
Luanda |
481,354 sq mi |
12,479,000 |
Africa |
|
Indonesia |
Jakarta |
741,101 sq mi |
211,806,000 |
Asia |
|
China |
Beijing |
3,705,820 sq mi |
1,254,062,000 |
Asia |
|
U.S.A. |
Washington, D.C. |
3,717,796 sq mi |
270,933,000 |
North America |
|
Canada |
Ottawa |
3,849,670 sq mi |
30,589,000 |
North America |
|
Brazil |
Brasilia |
3,286,488 sq mi |
167,988,000 |
South America |
|
Argentina |
Buenos Aires |
1,068,302 sq mi |
36,568,000 |
South America |
|
United Kingdom |
London |
94,248 sq mi |
59,364,000 |
Europe |
|
France |
Paris |
210,026 sq mi |
59,067,000 |
Europe |
|
Australia |
Canberra |
2,966,153 sq mi |
18,981,000 |
Oceania |
|
New Zealand |
Wellington |
103,883 sq mi |
3,817,000 |
Oceania |
|
India |
New Delhi |
1,269,346 sq mi |
986,611,000
|
Asia |
|
Turkey |
Ankara |
300,948 sq mi |
65,869,000 |
Asia |
|
Belgium |
Brussels |
11,783 sq mi |
10,225,000 |
Europe |
|
Belarus |
Minsk |
80,154 sq mi |
10,167,000 |
Europe |
|
Russia |
Moscow |
6,592,692 sq mi |
146,519,000 |
Europe |
|
Denmark |
Copenhagen |
16,638 sq mi |
5,325,000 |
Europe |
|
Italy |
Rome |
116,324 sq mi |
57,717,000 |
Europe |
|
Netherlands |
Amsterdam |
16,023 sq mi |
15,799,000 |
Europe |
|
Malaysia |
Kuala Lumpur |
127,317 sq mi |
22,710,000 |
Asia |
|
Singapore |
Singapore |
239 sq mi |
3,999,000 |
Asia |
|
Cameroon |
Yaounde |
183,569 sq mi |
15,456,000 |
Africa |
|
Egypt |
Cairo |
386,662 sq mi |
66,924,000 |
Africa |
|
Senegal |
Dakar |
75,955 sq mi |
9,240,000 |
Africa |
|
Cuba |
Havana |
42,804 sq mi |
11,178,000 |
North America |
|
Peru |
Lima |
496,225 sq mi |
26,624,000 |
South America |
|
Chile |
Santiago |
292,135 sq mi |
15,018,000 |
South America |
|
Mexico |
Mexico City |
756,066 sq mi |
99,734,000 |
North America |
Let us use what we learn about sets to accomplish the above queries.
Take “Give me the countries on the continent of North
America that each has more than 200,000,000 people.” First, we define the set NA = { x: x is a
country on the continent of North America }, and P200M = { x : x
is a country with more than 200,000,000 people }. To obtain the results for the query, we simply intersect the two
sets:
.
Take “Give me the countries with an area larger than 100,000
sq mi yet with a population fewer than or equal to 200,000,000 people.” How do we define this? First, suppose that A100K = { x
: x is a country with an area larger than 100,000 sq. mi }, and we can
use the same definition of P200M above.
To obtain the results for the query, we simply do a difference of the
two sets:
. What about
? What does this
difference say? Be very careful about
this. In the latter set, it gives you
“all countries that each has a population greater than 200,000,000 people but
each has an are smaller than 100,000 sq mi.”
So, be very careful on how you form this set.
Now, given the above examples, is it possible for you to
find a unique combination of sets for each country listed in the above
table? For example, can you find a
unique combination of sets for Mexico such that if you type in the query using
the combination, Mexico is the only result returned; or the only member of the
set? Of course we could do something
like this: T = { x : x
is a country with 99,734,000 people }.
However, this approach would be too specific and useless to truly
differentiate Mexico from other countries.
In data mining, we look for the simplest and the most general way of
defining uniqueness. So, what about
this: Suppose that A3M = { x : x is a country with an area
larger than 3,000,000 sq. mi } and the definition of NA holds. Then, we could retrieve Mexico by
.
Ultimately, in the area of data mining, we see each definition of a set as a rule. The key is to have as fewest rules as possible to describe the entire database—where each country, in our example, can be queried and retrieved by combining these rules.