CSCE 410/810

Homework Assignment 3

 

September 30, 2003

 

Problem

 

Create a document reference retrieval system based on a simple Vector Space Model using an inverted file indexing scheme.

 

Perform experiments in (a) parametric-tuning of your vector-space-modeled IR system, and (2) computing for recall and precision values and plotting precision vs. recall curves.

 

The goal of this exercise is to learn how to implement a retrieval system based on a simple Vector Space Model and evaluate the retrieval results in terms of recall and precision.

 

This assignment counts 10% towards your grade.

 

Requirements

 

(1)               Use the Keen document collection stored in the file “Keen.dat”.  Download this file from the class website at www.cse.unl.edu/~lksoh/Classes/CSCE410_810_Fall03.  You have used this document from Homework Assignment 2.  Also, you are required to use the programs that you have created for the inverted index file structure from Homework Assignment 2.

 

(2)               Create for each index (or keyword) a weight: the importance of a keyword in representing a particular document.

 

 

            where  is the weight of keyword i  in representing document j,  is the term frequency of keyword i in document j, and  is the inverse document frequency of keyword i.   So, in our case, the term frequency is either 0 or 1 since the file “Keen.dat” does not contain the term frequency information.  Also, suppose a keyword i occurs in 4 documents.  The  is then ¼ = 0.25.  You are required to use these weights for the various features of your retrieval system in this assignment.  Note:  If a keyword does not appear in any of the documents, then its weight is 0.

 

(3)               Create a similarity measurement for each document j given a query (with a set of search words).  Each query q has a same-length weight vector:

 

 

            where N is the total number of keywords in the Keen.dat file.  Now, a query q consists of several search words.  As a result, only those search words of the query have non-zero weight values in the weight vector .  For example, suppose the total number of keywords in the entire document collection is 5, and the keywords are {college football Nebraska ncaa cornhuskers}, then N = 5.  Now, suppose the query B consists of {college ncaa}, then the weight vector for query B is:

 

.

 

The weight values are either 0s or 1s since we do not weight our search words.  Similarly, each document j has a same-length weight vector:

 

 

 

where N is once again the total number of keywords in the Keen.dat file.  Now, a document j consists of several keywords.  As a result, only those keywords of the document have non-zero weight values in the weight vector .  Using the same example, suppose that the document C has keywords {college football cornhuskers}, then the weight vector for document j is:

 

.

 

where  is given by the equation described in item (2) above. 

 

Now, the similarity measure of a document j to a query q is:

 

.                                                (1)

 

            After computing the similarity measure, one can compare it against a threshold: if the measure is greater than or equal to the threshold, then the document is relevant; otherwise, it is not.  In this manner, one is able to retrieve relevant documents and output them in a ranked order based on their respective similarity values.

 

            Using the same example, we can compute the similarity of document C to query B, and assuming that

 

 

(4)               Create a menu-based query interface.  The menu should consist of the following options: (a) enter a query, (b) change the similarity threshold, (c) retrieve top N documents, and (d) quit. 

 

(a)        You no longer need to input “or” or “and” operators between keywords.  You are required to output a ranked list of the retrieved documents. Each retrieved document must be accompanied with its similarity value. 

 

(b)        User can specify the similarity threshold described above.  You are required to use that similarity threshold as the cutoff point in your retrieval.

 

(c)        User can specify the number, N, of top-ranked documents to retrieve.  You are required to retrieve at most that number of top-ranked documents.  Note:  If after the thresholding, the remaining documents are fewer than N, then you only return those documents.

 

            The following is an example:

 

          $ run-my-IR-system

            Welcome to my IR system. 

            ------------------------------------------------

Please choose from one of the following options:

(1)            Enter a query

(2)            Change the similarity threshold

(3)            Retrieve top N documents

(4)            Quit

Please enter an option number:

1

            Please enter a query:

13 78

            Ranked Retrieved documents are:

1.                588 2.39

2.                600 2.15

3.                572 1.57

4.                487 1.35

------------------------------------------------

Please choose from one of the following options:

(1)            Enter a query

(2)            Change the similarity threshold

(3)            Retrieve top N documents

(4)            Quit

Please enter an option number:

2

            Please enter a similarity threshold:

2.5

            Similarity threshold has been set to 2.5

------------------------------------------------

Please choose from one of the following options:

(1)            Enter a query

(2)            Change the similarity threshold

(3)            Retrieve top N documents

(4)            Quit

Please enter an option number:

3

            Please enter the number of documents you want to retrieve:

3

            The number of documents to be retrieved has been set to 3

------------------------------------------------

Please choose from one of the following options:

(1)            Enter a query

(2)            Change the similarity threshold

(3)            Retrieve top N documents

(4)            Quit

Please enter an option number:

4

Thank you for using my IR system.  Goodbye.

           

(5)               Critical Analysis   Originally, the similarity value given in the assignment is as follows:

 

,                                                             (2)

 

where  such that  is always 1.  Furthermore, the above equation is not normalized.  Answer the following questions:

(a)    Is Eq.(2) better than Eq.(1) in some way?  If so, how?  If not, why? 

(b)    Also Eq.(2), is it more difficult or easier to select a suitable similarity threshold value?  (Hint: What are the bounds of the similarity values computed based on the above equation?) 

(c)    Is it possible to determine a universal similarity threshold value for all queries for our Keen.dat using Eq.(2)? 

(d)    Suppose the following: (i) there are three queries, , , and , and (ii) a document C that has only one keyword: .  Now the question is:  Which query among the three yields the highest similarity when matched against document C (using Eq.2)?  Is there something wrong with Eq.(2)?

(e)  Suppose the following: (i) there are five keywords in the entire document collection: {college football Nebraska ncaa cornhuskers}, (ii) a query E has three search words: {college football ncaa}, (iii) a document P has two keywords:  {college football} with , and (iv) a document R has one keyword: {ncaa} with .  Now the question is: Using Eq.(1), given the query E, which document is more relevant? Document P or document R?  Does it make sense? Then, repeat using Eq.(2).  What does this tell you about Eq.(1)?  Is there something wrong with Eq.(1)?  Does it make sense?

 

Hand In

 

(1)               A retrieval session log for the following queries (with similarity threshold 0.1 and number of documents to be retrieved 10)

 

a.       38 73 281 290

b.      73 281

c.       281 290

d.      340 468 645

e.       340 468

 

(2)        Repeat the above for similarity thresholds 0.2 and 0.3.  Plot a graph that shows the actual numbers of documents retrieved versus the similarity values.  Note:  You may use Matlab, Excel, etc. to plot the graph and the graph has to be included in your brief report.  Your retrieval session log should show the ability for the user to change the similarity threshold.

 

(3)        Experiments:  Recall and Precision              Use the Boolean-modeled IR system that you built in Homework Assignment 2 for the following queries:

 

a.       38 or 73 or 281 or 290

b.      73 or 281

c.       281 or 290

d.      340 or 468 or 645

e.       340 and 468

 

Analyze the retrieved results in the following manner. 

 

(a)                First, assume that the retrieved documents using your Boolean-modeled IR system as the ground-truth data.  That is, if query a returns documents 5 documents (12, 22, 34, 90, 144), then those documents become the set of known relevant documents.

 

(b)        Then given the retrieved results (from your vector-space-modeled IR system that uses 0.2 as the similarity threshold and 10 as the number of documents to be retrieved), compute the recall and precision values.  This means that you will have to plot the recall vs. precision curves for the 11 standard recall levels.  You are required to generate five graphs, one for each query. 

 

            For example, using the same example in (3.a), but now suppose query a returns the following ranked documents: 22, 349, and 12 for the vector-space-modeled IR system.  Then, you have 100% precision at 20% recall, 66.7% precision at 40% recall, and so on.

 

(c)        For graduate students only: In the above experiment, which query (a, b, c, d, and e) do you think returns the weakest result in terms of your confidence in the relevance of the retrieved documents, the number of documents retrieved, and the precision at the standard recall levels, and so on?  Please support your response with arguments and findings.  (Hint:  Does the Boolean operator have anything to do with it?)  Similarly, do so for the query that you think returns the strongest result.   You may use other measures such as the ‘Average Precision at Seen Relevant Documents,” “R-Precision,” and “Harmonic Mean,” as discussed in class to help justify your response.

 

(4)        A brief report that includes: (a) the description of your implementation approach, (b) the instructions on how to run your programs, (c) the number of documents retrieved versus the similarity value graph, (d) the graphs of recall vs. precision curves and a discussion of the experiments, (e) the response to the Critical Analysis requirement (#5), and (f) the printout of your programs as the appendix. 

 

(5)        Turn in your homework electronically using the handin account and a hardcopy of your brief report and test results in class.

 

(6)        You must document your programs clearly.

 

The assignment is due 9:30 a.m. October 14, 2003 in the beginning of the class.  The following table specifies the penalties for late homework.

 

Time Turned In

Penalty

9:30 a.m. – 9:35 a.m. (10/14/2003)

None

9:35 a.m. – 10:45 a.m. (10/14/2003)

Lose 10%

10:45 a.m. – 5:00 p.m. (10/14/2003)

Lose 20%

Later than 5:00 p.m. (10/14/2003)

Not accepted

 

Grading

 

(1)        40% Program Correctness (including the accessibility of the programs)

(2)               10% Software Design

(3)               10% Programming Style

(4)               10% Testing

(5)               30% Documentation (in-program documentation and brief report)