CSCE 410/810
Homework
Assignment 4: Review
November 18, 2003
By Leen-Kiat Soh and Xin Li
Problem
The goals of this exercise are (a) to learn how to
extract keywords from a textual database and build a system that holds the
information (keywords, term frequencies, inverse document frequencies, etc.)
for later analysis, (b) to evaluate the similarity between two documents, and
(c) to evaluate the diversity of different document collections.
Review
Intra-similarity
and Precision:
In general,
if the documents in a document set are similar, then you could get high
recall—since you would be able to retrieve a high number of documents given a
query.
In general,
if the documents in a document set are dissimilar, then you could get high
precision--since you would be able to retrieve only a low number of documents,
increasing the chance of getting a high precision.
The key is
this. Suppose you are an IR system
designer. Your client wants you to
design an IR system with high precision.
What are the questions that you should ask about the documents? Are the documents in the document set very
similar or very dissimilar? If the
documents are very similar, will that make your task more difficult? If the documents are very dissimilar, will
that make your task easier? And if
similarly is related to diversity/separation, what can you say about them?
Importance
of Domains and Titles (Tipster Topics):
I have
observed several arguments on the importance of domains and titles.
The key is
this: Suppose pair 1 has the same title
(very similar) but different domains, and pair 2 has different titles but the
same domain (very similar), which pair would have a higher similarity?
If we do
this for all pairs (considered to be similar), and if pair 1 consistently has a
higher similarity value than pair 2, then what can we say about the
titles? Can we say that the experts
have done a better job at labeling (titles) the documents than at categorizing
(domains) the documents?
Of course,
to answer the above question with a higher confidence would be to not use the
titles and domains when we extract keywords from the documents—that is, we
extract keywords only from the Description and Narrative sections.
Think about
it.
Influence of Stoplist
to Similarity Analysis
In general, the similarity between two documents should be fixed. But even if the same tools for stemming and stoplist processing are used, the different stoplist can incur different similarity evaluation results. In topics.1-50 and topics.51-100, each document has a same set of tags. If these tags are not used as stop words, each pair of documents will have even very low similarity just because of these tags. In fact, this is not correct. Thus, we can get such a conclusion that in order to increase the precision of information retrieval, the stoplist should be built with general terms as many as possible to filter out more irrelevant information to the document itself.