CSCE 410/810
Homework
Assignment 2
February 14, 2006
Problem
Create a document reference retrieval system based
on a simple Vector Space Model using an inverted file indexing scheme.
Perform experiments in (a) parametric-tuning of your
vector-space-modeled IR system, and (2) computing for recall and precision
values and plotting precision vs. recall curves.
The goal of this exercise is to learn how to
implement a retrieval system based on a simple Vector Space Model and evaluate
the retrieval results in terms of recall and precision.
This assignment counts 10% towards your grade.
Requirements
(1)
Use the Keen
document collection stored in the file “Keen.dat”.
Download
this file from the class website at
www.cse.unl.edu/~lksoh/Classes/CSCE410_810_Fall03. You have used this document from Homework
Assignment 1. Also, you are required to
use the programs that you have created for the inverted index file structure
from Homework Assignment 1.
(2)
Create for each
index (or keyword) a weight: the importance of a keyword in representing a
particular document.
![]()
where
is the weight of
keyword i in representing
document j,
is the term frequency
of keyword i in document j, and
is the inverse
document frequency of keyword i. So, in our case, the term frequency is either
0 or 1 since the file “Keen.dat” does not contain the term frequency
information. Also, suppose a keyword i
occurs in 4 documents. The
is then ¼ = 0.25. You are required to use these weights for the
various features of your retrieval system in this assignment. Note:
If a keyword does not appear in any of the documents, then its weight is
0.
(3)
Create a relevance
measurement for each document j given a query (with a set of search
words). Each query q has a same-length weight vector:
![]()
where N is the total number
of keywords in the Keen.dat file. Now, a
query q consists of several search words. As a result, only those search words of the
query have non-zero weight values in the weight vector
. For example, suppose
the total number of keywords in the entire document collection is 5, and the
keywords are <college football
.
The weight values are either 0s or
1s since we do not weight our search words.
Similarly, each document j has a same-length weight vector:
where N is once again the
total number of keywords in the Keen.dat file.
Now, a document j consists of several keywords. As a result, only those keywords of the
document have non-zero weight values in the weight vector
.
Now, the relevance measure of a
document j to a query q is:
. (1)
After computing the relevance
measure, one can compare it against a threshold: if the measure is greater than
or equal to the threshold, then the document is relevant; otherwise, it is
not. In this manner, one is able to
retrieve relevant documents and output them in a ranked order based on their
respective relevance values.
(4)
Create a
menu-based query interface. The menu should consist of the
following options: (a) enter a query, (b) change the relevance threshold, (c)
retrieve top N documents, and (d) quit.
(a) You no longer need to input “or” or “and”
operators between keywords. You are
required to output a ranked list of the retrieved documents. Each retrieved
document must be accompanied with its relevance value.
(b) User can specify the relevance threshold
described above. You are required to use
that relevance threshold as the cutoff point in your retrieval.
(c) User can specify the number, N,
of top-ranked documents to retrieve. You
are required to retrieve at most that number of top-ranked documents. Note:
If after the thresholding, the remaining documents are fewer than N,
then you only return those documents.
The
following is an example:
$ run-my-IR-system
Welcome to my IR system.
------------------------------------------------
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the relevance
threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
1
Please enter a query:
13 78
Ranked
Retrieved documents are:
1.
588 2.39
2.
600 2.15
3.
572 1.57
4.
487 1.35
------------------------------------------------
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the relevance
threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
2
Please enter a relevance threshold:
2.5
Relevance threshold has been set to
2.5
------------------------------------------------
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the relevance
threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
3
Please enter the number of documents
you want to retrieve:
3
The number of documents to be
retrieved has been set to 3
------------------------------------------------
…
Please choose from one of the
following options:
(1)
Enter a query
(2)
Change the relevance
threshold
(3)
Retrieve top N documents
(4)
Quit
Please enter an option number:
4
Thank you for using my IR
system. Goodbye.
Hand In
(1)
A
retrieval session log for the following queries (with relevance threshold 0.1
and number of documents to be retrieved 10)
a. 38 73 281 290
b. 73 281
c. 281 290
d. 340 468 645
e. 340 468
(2) Repeat the above for relevance
thresholds 0.2 and 0.3. Plot a graph
that shows the actual numbers of documents retrieved versus the relevance
values. Note: You may use Matlab, Excel, etc. to plot the
graph and the graph has to be included in your brief report. Your retrieval session log should show the
ability for the user to change the relevance threshold.
(3) Experiments: Recall and Precision Use the Boolean-modeled IR system
that you built in Homework Assignment 1 for the following queries:
a. 38 or 73 or 281 or 290
b. 73 or 281
c. 281 or 290
d. 340 or 468 or 645
e. 340 and 468
Analyze the retrieved results in
the following manner.
(a)
First,
assume that the retrieved documents using your Boolean-modeled IR system as the
ground-truth data. That is, if
query a returns 5 documents (12, 22, 34, 90, 144), then those documents
become the set of known relevant documents.
(b) Then given the retrieved results (from
your vector-space-modeled IR system that uses 0.2 as the relevance threshold
and 10 as the number of documents to be retrieved), compute the recall and
precision values. This means that you
will have to plot the recall vs. precision curves for the 11 standard recall levels.
You are required to generate five graphs, one for each query.
For example, using the same example
in (3.a), but now suppose query a returns the following ranked
documents: 22, 349, and 12 for the vector-space-modeled IR system. Then, you have 100% precision at 20% recall,
66.7% precision at 40% recall, and so on.
(c) Critical Analysis. Which query
gives the best retrieval results? At low
recall? At high recall? Also, identify reasons for query results that
have low precision. Make suggestions as
to how to improve precision, given the same ground-truth data. (Hint: Should any keywords be removed? If yes, which ones and how? Should keyword weights be modified? If yes, which ones and how?) Justify your suggestions.
(d) For Graduate Students Only.
In the above experiment, which query (a, b, c, d,
and e) do you think returns the weakest result in terms of your
confidence in the relevance of the retrieved documents, the number of documents
retrieved, and the precision at the standard recall levels, and so on? Please support your response with arguments
and findings. (Hint: Does the Boolean operator have anything to do
with it?) Similarly, do so for the query
that you think returns the strongest result. You may use other measures such as the
‘Average Precision at Seen Relevant Documents,” “R-Precision,” and “Harmonic
Mean,” as discussed in class to help justify your response.
(4) A report that includes: (a) the
description of your implementation approach, (b) the instructions on how to run
your programs, (c) the number of documents retrieved versus the relevance value
graph, (d) the graphs of recall vs. precision curves and a discussion of the
experiments, (e) the response to the Critical Analysis requirement (#5), and
(f) the printout of your programs as the appendix. (Note: Since report will include your critical
analysis and also more discussions on the experiments, it is now weighed more
heavily compared to program correctness.)
(5) Turn in your homework electronically
using the handin account and a hardcopy of your brief report and test
results in class.
(6) You
must document your programs clearly.
The assignment is due 2:00 p.m. February 28, 2006 in
the beginning of the class.
Grading
(1) 30%
Program Correctness (including the accessibility of the programs)
(2)
10%
Software Design
(3)
10%
Programming Style
(4)
10%
Testing
(5)
40%
Documentation (in-program documentation and report)