CSCE 410/810
Homework
Assignment 3
September 30, 2003
Problem
Create a document reference retrieval system based
on a simple Vector Space Model using an inverted file indexing scheme.
Perform experiments in (a) parametric-tuning of your
vector-space-modeled IR system, and (2) computing for recall and precision
values and plotting precision vs. recall curves.
The goal of this exercise is to learn how to
implement a retrieval system based on a simple Vector Space Model and evaluate
the retrieval results in terms of recall and precision.
This assignment counts 10% towards your grade.
Requirements
(1)
Use the Keen document
collection stored in the file “Keen.dat”. Download this file from the class
website at www.cse.unl.edu/~lksoh/Classes/CSCE410_810_Fall03. You have used this document from Homework
Assignment 2. Also, you are required to
use the programs that you have created for the inverted index file structure
from Homework Assignment 2.
(2)
Create for each index
(or keyword) a weight: the importance of a keyword in representing a particular
document.
![]()
where
is the weight of
keyword i in representing
document j,
is the term frequency
of keyword i in document j, and
is the inverse
document frequency of keyword i. So, in our case, the term frequency is either 0 or 1 since the
file “Keen.dat” does not contain the term frequency information. Also, suppose a keyword i occurs in 4
documents. The
is then ¼ =
0.25. You are required to use these
weights for the various features of your retrieval system in this
assignment. Note: If a keyword does not appear in any of the
documents, then its weight is 0.
(3)
Create a similarity
measurement for each document j given a query (with a set of search
words). Each query q has a same-length weight vector:
![]()
where N is the total number
of keywords in the Keen.dat file. Now,
a query q consists of several search words. As a result, only those search words of the query have non-zero
weight values in the weight vector
. For example,
suppose the total number of keywords in the entire document collection is 5,
and the keywords are {college football Nebraska ncaa cornhuskers}, then N =
5. Now, suppose the query B
consists of {college ncaa}, then the weight vector for query B is:
.
The weight values are either 0s or
1s since we do not weight our search words.
Similarly, each document j has a same-length weight vector:
where N is once again the
total number of keywords in the Keen.dat file.
Now, a document j consists of several keywords. As a result, only those keywords of the
document have non-zero weight values in the weight vector
. Using the same
example, suppose that the document C has keywords {college football
cornhuskers}, then the weight vector for document j is:
.
where
is given by the
equation described in item (2) above.
Now, the similarity measure of a
document j to a query q is:
. (1)
After computing the similarity
measure, one can compare it against a threshold: if the measure is greater than
or equal to the threshold, then the document is relevant; otherwise, it is
not. In this manner, one is able to
retrieve relevant documents and output them in a ranked order based on their
respective similarity values.
Using the same example, we can compute
the similarity of document C to query B, and assuming that ![]()

(4)
Create a menu-based
query interface. The menu should consist of the following
options: (a) enter a query, (b) change the similarity threshold, (c) retrieve
top N documents, and (d) quit.
(a) You no longer need to input “or” or “and”
operators between keywords. You are
required to output a ranked list of the retrieved documents. Each retrieved
document must be accompanied with its similarity value.
(b) User can specify the similarity
threshold described above. You are
required to use that similarity threshold as the cutoff point in your
retrieval.
(c) User can specify the number, N,
of top-ranked documents to retrieve.
You are required to retrieve at most that number of top-ranked
documents. Note: If after the thresholding, the remaining
documents are fewer than N, then you only return those documents.
The
following is an example:
$ run-my-IR-system
Welcome to my IR system.
------------------------------------------------
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the similarity threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
1
Please enter a query:
13 78
Ranked
Retrieved documents are:
1.
588 2.39
2.
600 2.15
3.
572 1.57
4.
487 1.35
------------------------------------------------
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the similarity threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
2
Please enter a similarity threshold:
2.5
Similarity threshold has been set to
2.5
------------------------------------------------
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the similarity threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
3
Please enter the number of documents
you want to retrieve:
3
The number of documents to be
retrieved has been set to 3
------------------------------------------------
…
…
…
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the similarity threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
4
Thank you for using my IR
system. Goodbye.
(5)
Critical Analysis Originally,
the similarity value given in the assignment is as follows:
, (2)
where
such that
is always 1. Furthermore, the above equation is not
normalized. Answer the following
questions:
(a)
Is
Eq.(2) better than Eq.(1) in some way?
If so, how? If not, why?
(b)
Also
Eq.(2), is it more difficult or easier to select a suitable similarity
threshold value? (Hint: What are the
bounds of the similarity values computed based on the above equation?)
(c)
Is it
possible to determine a universal similarity threshold value for all queries
for our Keen.dat using Eq.(2)?
(d)
Suppose
the following: (i) there are three queries,
,
, and
,
and (ii) a document C that has only one keyword:
. Now the question is: Which query among the three yields
the highest similarity when matched against document C (using
Eq.2)? Is there something wrong with
Eq.(2)?
(e) Suppose the following: (i) there are five
keywords in the entire document collection: {college football Nebraska ncaa
cornhuskers}, (ii) a query E has three search words: {college football
ncaa}, (iii) a document P has two keywords: {college football} with
, and (iv) a document R has one keyword: {ncaa} with
. Now the question
is: Using Eq.(1), given the query E, which document is more relevant?
Document P or document R?
Does it make sense? Then, repeat using Eq.(2). What does this tell you about Eq.(1)? Is there something wrong with Eq.(1)? Does it make sense?
Hand In
(1)
A
retrieval session log for the following queries (with similarity threshold 0.1
and number of documents to be retrieved 10)
a.
38 73
281 290
b.
73
281
c.
281
290
d.
340
468 645
e.
340
468
(2) Repeat the above for similarity
thresholds 0.2 and 0.3. Plot a graph
that shows the actual numbers of documents retrieved versus the similarity
values. Note: You may use Matlab, Excel, etc. to plot the graph and the graph
has to be included in your brief report.
Your retrieval session log should show the ability for the user to
change the similarity threshold.
(3) Experiments: Recall and Precision Use
the Boolean-modeled IR system that you built in Homework Assignment 2 for the
following queries:
a.
38 or
73 or 281 or 290
b.
73 or
281
c.
281
or 290
d.
340
or 468 or 645
e.
340
and 468
Analyze the retrieved results in
the following manner.
(a)
First,
assume that the retrieved documents using your Boolean-modeled IR system as the
ground-truth data. That is, if
query a returns documents 5 documents (12, 22, 34, 90, 144), then those
documents become the set of known relevant documents.
(b) Then given the retrieved results (from
your vector-space-modeled IR system that uses 0.2 as the similarity threshold
and 10 as the number of documents to be retrieved), compute the recall and
precision values. This means that you
will have to plot the recall vs. precision curves for the 11 standard recall levels.
You are required to generate five graphs, one for each query.
For example, using the same example
in (3.a), but now suppose query a returns the following ranked
documents: 22, 349, and 12 for the vector-space-modeled IR system. Then, you have 100% precision at 20% recall,
66.7% precision at 40% recall, and so on.
(c) For
graduate students only:
In the above experiment, which query (a, b, c, d,
and e) do you think returns the weakest result in terms of your
confidence in the relevance of the retrieved documents, the number of documents
retrieved, and the precision at the standard recall levels, and so on? Please support your response with arguments
and findings. (Hint: Does the Boolean operator have anything to
do with it?) Similarly, do so for the
query that you think returns the strongest result. You may use other measures such as the
‘Average Precision at Seen Relevant Documents,” “R-Precision,” and “Harmonic
Mean,” as discussed in class to help justify your response.
(4) A brief report that includes: (a) the
description of your implementation approach, (b) the instructions on how to run
your programs, (c) the number of documents retrieved versus the similarity
value graph, (d) the graphs of recall vs. precision curves and a discussion of
the experiments, (e) the response to the Critical Analysis requirement (#5),
and (f) the printout of your programs as the appendix.
(5) Turn in your homework electronically
using the handin account and a hardcopy of your brief report and test
results in class.
(6) You
must document your programs clearly.
The assignment is due 9:30 a.m. October 14, 2003 in
the beginning of the class. The
following table specifies the penalties for late homework.
|
Time Turned In |
Penalty |
|
9:30 a.m. – 9:35 a.m. (10/14/2003) |
None |
|
9:35 a.m. – 10:45 a.m. (10/14/2003) |
Lose 10% |
|
10:45 a.m. – 5:00 p.m. (10/14/2003) |
Lose 20% |
|
Later than 5:00 p.m. (10/14/2003) |
Not accepted |
Grading
(1) 40%
Program Correctness (including the accessibility of the programs)
(2)
10%
Software Design
(3)
10%
Programming Style
(4)
10%
Testing
(5)
30%
Documentation (in-program documentation and brief report)