CSCE 410/810

Homework Assignment 4

 

October 23, 2003

 

Problem

 

Incorporate a stemming algorithm and a stopword-filtering algorithm with an actual textual database to obtain keywords for each document, based on the Vector Space model.

 

Perform experiments in some basic document analyses: document similarity within the same collection and diversity of different collections.

 

The goals of this exercise are (a) to learn how to extract keywords from a textual database and build a system that holds the information (keywords, term frequencies, inverse document frequencies, etc.) for later analysis, (b) to evaluate the similarity between two documents, and (c) to evaluate the diversity of different document collections. 

 

This assignment counts 10% towards your grade.

 

Requirements

 

(1)               Use the Tipster topic document collection stored in the file “topics.1-50” and “topics.51-100”.  Download the files from www.cse.unl.edu/~lksoh/Classes/CSCE410_810_Fall03.  These are two files of actual topic document categorizations used for the Text Retrieval Conference (TREC) experiments.  In each file, there are 50 document topics.  Each document topic has a set of “tags”: 

 

Tags

Meaning

<top>

Beginning of a document topic

<head>

Header of the document topic

<num>

Index number for the document topic

<dom>

The domain of the topic

<title>

The topic title

<desc>

The description of the topic

<narr>

The long version of the description

<con>

Various concepts that the topic covers

<fac>

Possible important indicators

<def>

Definitions of some of the concepts

</top>

End of a document topic

 

(2)               Incorporate the programs stemmer and stopper into your program.  Note that these two programs are written in ‘C’.  As a result, you are strongly encouraged to implement your solution for this homework in ‘C’ or ‘C++’.  Download the two programs from the class website.  These two programs will parse the above Tipster topic files for you.  Use the “stop.wrd” file in the stopper program as your list of stopwords.   The following words are not to be used as keywords:

 

top, domain, topic, head, num, number, dom, domain, desc, title, description, narr, narrative, con, concept, fac, factor, def, definition

 

(3)               Compute document similarity between two documents.  First, for each document, obtain its set of unique keywords and the frequency of occurrence of each keyword.  Second, obtain a list of all unique keywords and calculate the inverse document frequency, , where  is the inverse of the number of hits of keyword i.  Third, compute the document similarity between each pair of documents using:

 

 

where the term weight is:

 

,

 

where  is the weight of keyword i  in representing document j,  is simply the frequency of occurrence of keyword i in document j, and  is what we have discussed above.

 

(4)               Write the following modules to analyze the documents. 

 

(a)        A module that computes the similarity values for all pairs of documents

(b)        A module that computes the average document similarity values for a document and all other documents.

(c)                A module that finds the document with the smallest average document similarity value.

(d)               A module that finds similar documents for a particular document.  A document is considered to be similar to another document if the document similarity between the two is greater than a threshold.  In this case, set the threshold at 0.10.

 

The following is an actual example of “topics.1-50”.

 

[0]-[1]: similarity [0.143786]

[3]-[5]: similarity [0.605412]

[4]-[18]: similarity [0.113802]

[9]-[23]: similarity [0.137449]

[15]-[16]: similarity [0.219045]

[18]-[4]: similarity [0.113802]

[23]-[9]: similarity [0.137449]

[27]-[28]: similarity [0.159268]

[28]-[47]: similarity [0.186928]

[29]-[30]: similarity [0.345532]

[32]-[35]: similarity [0.140970]

 

Note:    Some of the above modules may be implemented as one module.

 

Hand In

 

(1)        Discuss the strategy you used to incorporate the stemmer and stopper programs into your program.  Did you change any of the programs to work the way you wanted it?  Did you use any tricks to avoid changing the two programs?

 

(2)        Separation Analysis.  For each document in “topics.1-50”, find its least similar document.  Hand in the list.  Compute the separation value of each document as the similarity value between the document and its least similar document (other than itself).  Which document has the highest separation from its peers?  Repeat for “topics.51-100”.

 

(3)        Diversity Analysis.  Compute and hand in the list of all document similarity values that are greater than 0.10 (with associated document pair) for “topics.1-50”.  Do not include document similarity values for comparing a document with itself, and do not include the reversed pair.  For example, if you get

 

[29]-[30]: similarity [0.345532]

[30]-[29]: similarity [0.345532]

[32]-[35]: similarity [0.140970]

[35]-[35]: similarity [1.000000]

           

Then, the revised list is:

 

[29]-[30]: similarity [0.345532]

[32]-[35]: similarity [0.140970]  

 

Compute the diversity of “topics.1-50” as 1/(sum of similarity values of the above list).   Repeat for “topics.51-100”.  Which document set is more diverse?

 

(4)        Topic Analysis.  For the list that you have obtained for item (3) above, find out the tag values for <dom> and <title> of each pair of similar documents.  What can you say about the domain topics and the topic titles between the similar documents?  Do they make sense?  If yes, why?  If not, why?  And if not, then why are the two documents computed to be similar?  Do this for all document pairs for the list of “topics.1-50”.  Based on your analysis, discuss whether the above document similarity measure is appropriate. 

 

(5)        For Graduates Only:  Topic Analysis 2.  Repeat item (3) for “topics.51-100”.  In addition, correlate your findings to justify your answer to item (4).  Do the diversity measures reflect correctly your intuition?  Do they correlate with your evaluation?  Is the diversity equation valid? Are domain topics important?  Are topic titles important?  Are they indicative of what the documents describe?  Which document set would be able to give better retrieval results in which precision is important?  Why?

 

(6)        A brief report that includes: (a) the description of your implementation approach, (b) the instructions on how to run your programs, (c) answers/responses to items (1)-(5) above, and (e) the printout of your programs as the appendix.

 

(7)        Turn in your homework electronically using the handin account and a hardcopy of your brief report and test results in class.

 

(8)        You must document your programs clearly.

 

The assignment is due 9:30 a.m. November 4, 2003 in the beginning of the class.  The following table specifies the penalties for late homework.

 

Time Turned In

Penalty

9:30 a.m. – 9:35 a.m. (11/4/2003)

None

9:35 a.m. – 10:45 a.m. (11/4/2003)

Lose 10%

10:45 a.m. – 5:00 p.m. (11/4/2003)

Lose 20%

Later than 5:00 p.m. (11/4/2003)

Not accepted

 

Grading

 

(1)        40% Program Correctness (including the accessibility of the programs)

(2)               10% Software Design

(3)               10% Programming Style

(4)               10% Testing

(5)               30% Documentation (in-program documentation and brief report)