CSCE 410/810

Homework Assignment 4:  Review

 

November 18, 2003

 

By Leen-Kiat Soh and Xin Li

 

Problem

 

The goals of this exercise are (a) to learn how to extract keywords from a textual database and build a system that holds the information (keywords, term frequencies, inverse document frequencies, etc.) for later analysis, (b) to evaluate the similarity between two documents, and (c) to evaluate the diversity of different document collections. 

 

Review

 

Intra-similarity and Precision:

 

In general, if the documents in a document set are similar, then you could get high recall—since you would be able to retrieve a high number of documents given a query.

 

In general, if the documents in a document set are dissimilar, then you could get high precision--since you would be able to retrieve only a low number of documents, increasing the chance of getting a high precision.

 

The key is this.  Suppose you are an IR system designer.  Your client wants you to design an IR system with high precision.  What are the questions that you should ask about the documents?  Are the documents in the document set very similar or very dissimilar?  If the documents are very similar, will that make your task more difficult?  If the documents are very dissimilar, will that make your task easier?  And if similarly is related to diversity/separation, what can you say about them?

 

Importance of Domains and Titles (Tipster Topics):

 

I have observed several arguments on the importance of domains and titles.

 

The key is this:  Suppose pair 1 has the same title (very similar) but different domains, and pair 2 has different titles but the same domain (very similar), which pair would have a higher similarity?

 

If we do this for all pairs (considered to be similar), and if pair 1 consistently has a higher similarity value than pair 2, then what can we say about the titles?  Can we say that the experts have done a better job at labeling (titles) the documents than at categorizing (domains) the documents?

 

Of course, to answer the above question with a higher confidence would be to not use the titles and domains when we extract keywords from the documents—that is, we extract keywords only from the Description and Narrative sections.

 

Think about it.

 

Influence of Stoplist to Similarity Analysis

 

In general, the similarity between two documents should be fixed. But even if the same tools for stemming and stoplist processing are used, the different stoplist can incur different similarity evaluation results. In topics.1-50 and topics.51-100, each document has a same set of tags. If these tags are not used as stop words, each pair of documents will have even very low similarity just because of these tags. In fact, this is not correct. Thus, we can get such a conclusion that in order to increase the precision of information retrieval, the stoplist should be built with general terms as many as possible to filter out more irrelevant information to the document itself.