CSCE 410/810
Homework
Assignment 4
October 23, 2003
Problem
Incorporate a stemming algorithm and a
stopword-filtering algorithm with an actual textual database to obtain keywords
for each document, based on the Vector Space model.
Perform experiments in some basic document analyses:
document similarity within the same collection and diversity of different
collections.
The goals of this exercise are (a) to learn how to
extract keywords from a textual database and build a system that holds the
information (keywords, term frequencies, inverse document frequencies, etc.)
for later analysis, (b) to evaluate the similarity between two documents, and
(c) to evaluate the diversity of different document collections.
This assignment counts 10% towards your grade.
Requirements
(1)
Use the Tipster topic
document collection stored in the file “topics.1-50” and “topics.51-100”.
Download
the files from www.cse.unl.edu/~lksoh/Classes/CSCE410_810_Fall03. These are two files of actual topic document
categorizations used for the Text Retrieval Conference (TREC) experiments. In each file, there are 50 document topics. Each document topic has a set of
“tags”:
|
Tags |
Meaning |
|
<top> |
Beginning of a document topic |
|
<head> |
Header of the document topic |
|
<num> |
Index number for the document topic |
|
<dom> |
The domain of the topic |
|
<title> |
The topic title |
|
<desc> |
The description of the topic |
|
<narr> |
The long version of the description |
|
<con> |
Various concepts that the topic covers |
|
<fac> |
Possible important indicators |
|
<def> |
Definitions of some of the concepts |
|
</top> |
End of a document topic |
(2)
Incorporate the programs
stemmer and stopper into your program. Note
that these two programs are written in ‘C’.
As a result, you are strongly encouraged to implement your solution for
this homework in ‘C’ or ‘C++’. Download
the two programs from the class website.
These two programs will parse the above Tipster topic files for
you. Use the “stop.wrd” file in the stopper
program as your list of stopwords. The
following words are not to be used as keywords:
top, domain, topic, head, num,
number, dom, domain, desc, title, description, narr, narrative, con, concept,
fac, factor, def, definition
(3)
Compute document similarity
between two documents. First, for each document, obtain
its set of unique keywords and the frequency of occurrence of each
keyword. Second, obtain a list of all
unique keywords and calculate the inverse document frequency,
, where
is the inverse of the
number of hits of keyword i.
Third, compute the document similarity between each pair of documents
using:

where the term weight is:
,
where
is the weight of
keyword i in representing
document j,
is simply the
frequency of occurrence of keyword i in document j, and
is what we have
discussed above.
(4)
Write the following
modules to analyze the documents.
(a) A module that computes the similarity
values for all pairs of documents
(b) A module that computes the average
document similarity values for a document and all other documents.
(c)
A
module that finds the document with the smallest average document similarity
value.
(d)
A
module that finds similar documents for a particular document. A document is considered to be similar to
another document if the document similarity between the two is greater than a
threshold. In this case, set the threshold
at 0.10.
The following is an actual example
of “topics.1-50”.
[0]-[1]: similarity [0.143786]
[3]-[5]: similarity [0.605412]
[4]-[18]: similarity [0.113802]
[9]-[23]: similarity [0.137449]
[15]-[16]: similarity [0.219045]
[18]-[4]: similarity [0.113802]
[23]-[9]: similarity [0.137449]
[27]-[28]: similarity [0.159268]
[28]-[47]: similarity [0.186928]
[29]-[30]: similarity [0.345532]
[32]-[35]: similarity [0.140970]
Note: Some of the above modules may be implemented
as one module.
Hand In
(1) Discuss the strategy you used to
incorporate the stemmer and stopper programs into your
program. Did you change any of the
programs to work the way you wanted it?
Did you use any tricks to avoid changing the two programs?
(2) Separation
Analysis. For each document in “topics.1-50”,
find its least similar document. Hand
in the list. Compute the separation value
of each document as the similarity value between the document and its least
similar document (other than itself).
Which document has the highest separation from its peers? Repeat for “topics.51-100”.
(3) Diversity
Analysis. Compute and hand in the list of all
document similarity values that are greater than 0.10 (with associated document
pair) for “topics.1-50”. Do not include
document similarity values for comparing a document with itself, and do not
include the reversed pair. For example,
if you get
[29]-[30]: similarity [0.345532]
[30]-[29]: similarity [0.345532]
[32]-[35]: similarity [0.140970]
[35]-[35]: similarity [1.000000]
Then, the revised list is:
[29]-[30]: similarity [0.345532]
[32]-[35]: similarity [0.140970]
Compute the diversity of
“topics.1-50” as 1/(sum of similarity values of the above list). Repeat for “topics.51-100”. Which document set is more diverse?
(4) Topic
Analysis. For the list that you have obtained for item
(3) above, find out the tag values for <dom> and <title> of each
pair of similar documents. What can you
say about the domain topics and the topic titles between the similar
documents? Do they make sense? If yes, why? If not, why? And if not,
then why are the two documents computed to be similar? Do this for all document pairs for the list
of “topics.1-50”. Based on your
analysis, discuss whether the above document similarity measure is appropriate.
(5) For
Graduates Only: Topic Analysis 2. Repeat item (3) for “topics.51-100”. In addition, correlate your findings to
justify your answer to item (4). Do the
diversity measures reflect correctly your intuition? Do they correlate with your evaluation? Is the diversity equation valid? Are domain topics
important? Are topic titles
important? Are they indicative of what
the documents describe? Which document
set would be able to give better retrieval results in which precision is
important? Why?
(6) A brief report that includes: (a) the
description of your implementation approach, (b) the instructions on how to run
your programs, (c) answers/responses to items (1)-(5) above, and (e) the
printout of your programs as the appendix.
(7) Turn in your homework electronically
using the handin account and a hardcopy of your brief report and test
results in class.
(8) You
must document your programs clearly.
The assignment is due 9:30 a.m. November 4, 2003 in
the beginning of the class. The following
table specifies the penalties for late homework.
|
Time Turned In |
Penalty |
|
9:30 a.m. – 9:35 a.m. (11/4/2003) |
None |
|
9:35 a.m. – 10:45 a.m. (11/4/2003) |
Lose 10% |
|
10:45 a.m. – 5:00 p.m. (11/4/2003) |
Lose 20% |
|
Later than 5:00 p.m. (11/4/2003) |
Not accepted |
Grading
(1) 40%
Program Correctness (including the accessibility of the programs)
(2)
10%
Software Design
(3)
10%
Programming Style
(4)
10%
Testing
(5)
30%
Documentation (in-program documentation and brief report)