CSCE 410/810

Homework Assignment 1:  Review

 

September 10, 2003

 

Problem

 

The goal of this exercise is to familiarize yourself with searching for scientific papers for downloads on the Web and the related issues/preferences/problems.  There are two exercises: background search and targeted search

 

Review on Searches

 

The following graph shows the average number of documents retrieved for the background search exercise.

 

Graph 1:  Results of Background Search

 

The following graph shows the individual total time spent by 11 students for the target search exercise.

 

Graph 2:  Results of Target Search

 

 

Useful sites for searching for articles in the area of information retrieval:

 

www.google.com

www.profusion.com

dblp.uni-trierde

citeseer.nj.nec.com/cs

iris.unl.edu

ciir.cs.umass.edu

www.acm.org/dl

www.acm.org/sigir

www.teoma.com

 

Useful keywords:

 

“pdf”

“papers”

 

Some conclusions from Students:

 

1.                  If it is difficult to find a paper by its title, try its author.  Normally we find the entire article or useful links from the author’s homepage.

2.                  When more than one keyword is provided to the search engine, one may use the symbol “+” and “-“, supported at google or profusion.

3.                  When looking for research papers, refer to the citations of the current paper to find more relevant ones easily.

4.                  The keywords you enter in your search will make or break your results.  I like to start out broad and then pull my search in tighter depending on the kinds of sites that appear unrelated to my query.

5.                  There are many search engines out there.  Some of them are good at searching for general information (e.g., google.com).  Some of them are good at searching for specific information (e.g., iteseer.com for CS-related papers).  As a user, we need to know which engine is good at what specialty and choose the right engine to use.

6.                  Try combining different search engines.  This way, a user could have more information and more precise queries.  Using the information we obtain from one search engine on another search engine could yield better results.

7.                  Results can be very different from what you have expected.  It is a challenging task (almost impossible) to create a perfect engine that will give the best result every time a user queries. 

8.                  Looking for very precise and well-defined information can be time consuming.  The more precision is required, the more difficult it is to locate the correct information.

9.                  Access is not free.  Viewing articles at www.elsevier.nl requires a membership.  “Information Retrieval” journal asks to “pay as you see” for their archives.  So articles published in 2001 are not free while those in 2003 are.

10.              In both exercises I found that the typical Yahoo or Google type search engines were useful.  Neither actually provided a good list of actual PDF articles but both could point me to somewhere that was a good source.  The most prevalent places to find the articles were the database/bibliography search engines such as Computer Science Bibliography at the University of Trier and CiteSeer.  These sites have already narrowed the number of articles down to those involved in computer science so find articles in here was easy.

11.              I found that using “Information Retrieval” alone as a keyword on Yahoo or Google was insufficient. “Information Retrieval” had to be accompanied by more specific terms otherwise a large number of matches would not be relevant.  However, when using the search engines such as the CS Bibliography at the University of Trier and CiteSeer searching for “Information Retrieval” was sufficient for producing many good hits. (Note:  Do you know why?)

12.       By using broad terms, I was able to retrieve many articles but they may not be very specific to the topic I was looking for.  By giving a long query with as much information that I had available, the information retrieval engine was able to give some good results on the first 10 hits out of millions.  (Note:  There is very important lesson here!  The more keywords I use, the more documents will be retrieved.  However, though I will be given too many documents, I will also be given a better set of top-ranked documents!  Do you know why?)

13.       I was surprised, however, at the amount of dead links on the site that I found the IR papers.  I believe that this can be explained at the “out-datedness” of the page (the page mentioned that the last time it was updated was in 1996).  If this is the case (i.e., if there is a correlation between information “corruption” and the “oldness” of the document), it is interesting that google.com returned a document so old to me as its highest rank.  (Note:  So, what does this imply?  Should we also use “oldness” as a retrieval criterion?)

14.       Although the specialized sites have a lot of scientific articles, they would be more useful if they had more advanced search tools.  Sometimes these sites are confusing to users because they do not offer easy links to the tools of simply do not have them.

15.       The use of different index terms must be carefully defined, and they influence dramatically in the results.  Scientific and academic terms are more useful in specialized search engines.  (Note:  Do you know why?)

 

Some observations:

 

1.                  Few students used homepage search sites to locate articles by particular authors.

2.                  Usually, students went through 3-4 levels of links to obtain articles.

3.                  Often, all 10 articles were downloaded from the same site, the so-called “jackpot”.

4.                  Some students were careful and actually reviewed the articles to make sure their relevance.  Most students trusted the sites and trusted the titles of the articles when retrieving relevant documents.

5.                  Most students adopted the same search strategy: a general search followed by several iterations of more-specific searches.

6.                  Many students failed to search the important, detailed information: page numbers of a paper, the volume and issue of a journal, etc.  These details would make the search time consuming!

7.                  Few broken links and dead-ends were encountered.

 

Some questions not addressed:

 

1.                  Was recall more important in this exercise, or was precision more important in this exercise?  Why?

2.                  How important was the ability to browse the retrieved documents in helping you search?  Did it allow you to pay less attention to the keywords you used? 

3.                  How important was the speed of the search engines in choosing your search strategy? 

4.                  How much did you rely on the interactions with the search engine to advance your search?  What if there weren’t any hyperlinks?  What if there were any additional information (caption) describing each retrieved document?

5.                  Was it easier to search for technical, scientific papers than to search for other non-technical, non-scientific items?  If so, why?

6.                  Is the target search closer to database retrieval or information retrieval?

 

Karen Sparck Jones (1999).  Information Retrieval and Artificial Intelligence, Artificial Intelligence, 114(1-2):257-281.