CSCE 235

Handout 13: Sets:  Searching the World

February 11, 2004 

 

Assume that you are an expert in database management.  You are given the following database.  Your responsibilities include writing a program that allow users to query the database by specifying the country name, the capital name, the country’s area, the country’s population, and the continent of the country.  For example, the users should be able to ask for “Give me the countries on the continent of North America that each has more than 200,000,000 people.”  Or, “Give me the countries whose capital shares the same name as the name of the country.”  “Give me the countries on the Oceania continent.”  “Give me the countries with fewer than 10,000,000 people in its population.” “Give me the countries with an area larger than 100,000 sq mi yet with a population fewer than 200,000,000 people.”

 

Country

Capital

Area

Population

Continent

Algeria

Algiers

919,595 sq mi

30,774,000

Africa

Angola

Luanda

481,354 sq mi

12,479,000

Africa

Indonesia

Jakarta

741,101 sq mi

211,806,000

Asia

China

Beijing

3,705,820 sq mi

1,254,062,000

Asia

U.S.A.

Washington, D.C.

3,717,796 sq mi

270,933,000

North America

Canada

Ottawa

3,849,670 sq mi

30,589,000

North America

Brazil

Brasilia

3,286,488 sq mi

167,988,000

South America

Argentina

Buenos Aires

1,068,302 sq mi

36,568,000

South America

United Kingdom

London

94,248 sq mi

59,364,000

Europe

France

Paris

210,026 sq mi

59,067,000

Europe

Australia

Canberra

2,966,153 sq mi

18,981,000

Oceania

New Zealand

Wellington

103,883 sq mi

3,817,000

Oceania

India

New Delhi

1,269,346 sq mi

986,611,000

Asia

Turkey

Ankara

300,948 sq mi

65,869,000

Asia

Belgium

Brussels

11,783 sq mi

10,225,000

Europe

Belarus

Minsk

80,154 sq mi

10,167,000

Europe

Russia

Moscow

6,592,692 sq mi

146,519,000

Europe

Denmark

Copenhagen

16,638 sq mi

5,325,000

Europe

Italy

Rome

116,324 sq mi

57,717,000

Europe

Netherlands

Amsterdam

16,023 sq mi

15,799,000

Europe

Malaysia

Kuala Lumpur

127,317 sq mi

22,710,000

Asia

Singapore

Singapore

239 sq mi

3,999,000

Asia

Cameroon

Yaounde

183,569 sq mi

15,456,000

Africa

Egypt

Cairo

386,662 sq mi

66,924,000

Africa

Senegal

Dakar

75,955 sq mi

9,240,000

Africa

Cuba

Havana

42,804 sq mi

11,178,000

North America

Peru

Lima

496,225 sq mi

26,624,000

South America

Chile

Santiago

292,135 sq mi

15,018,000

South America

Mexico

Mexico City

756,066 sq mi

99,734,000

North America

 

Let us use what we learn about sets to accomplish the above queries. 

 

Take “Give me the countries on the continent of North America that each has more than 200,000,000 people.”  First, we define the set NA = { x: x is a country on the continent of North America }, and P200M = { x : x is a country with more than 200,000,000 people }.  To obtain the results for the query, we simply intersect the two sets:  . 

 

Take “Give me the countries with an area larger than 100,000 sq mi yet with a population fewer than or equal to 200,000,000 people.”  How do we define this?  First, suppose that A100K = { x : x is a country with an area larger than 100,000 sq. mi }, and we can use the same definition of P200M above.  To obtain the results for the query, we simply do a difference of the two sets: .  What about ?  What does this difference say?  Be very careful about this.  In the latter set, it gives you “all countries that each has a population greater than 200,000,000 people but each has an are smaller than 100,000 sq mi.”  So, be very careful on how you form this set.

 

Now, given the above examples, is it possible for you to find a unique combination of sets for each country listed in the above table?  For example, can you find a unique combination of sets for Mexico such that if you type in the query using the combination, Mexico is the only result returned; or the only member of the set?  Of course we could do something like this:  T = { x : x is a country with 99,734,000 people }.  However, this approach would be too specific and useless to truly differentiate Mexico from other countries.  In data mining, we look for the simplest and the most general way of defining uniqueness.  So, what about this: Suppose that A3M = { x : x is a country with an area larger than 3,000,000 sq. mi } and the definition of NA holds.  Then, we could retrieve Mexico by . 

 

Ultimately, in the area of data mining, we see each definition of a set as a rule.  The key is to have as fewest rules as possible to describe the entire database—where each country, in our example, can be queried and retrieved by combining these rules.