CSCE 410/810
Homework
Assignment 1: Review
September 10, 2003
Problem
The goal of this exercise is to familiarize yourself
with searching for scientific papers for downloads on the Web and the related
issues/preferences/problems. There are
two exercises: background search and targeted search
Review on Searches
The following graph shows the average number of
documents retrieved for the background search exercise.

Graph 1: Results of Background Search
The following graph shows the individual total time
spent by 11 students for the target search exercise.

Graph 2: Results of Target Search
Useful sites for searching for articles in the area of information
retrieval:
www.profusion.com
dblp.uni-trierde
citeseer.nj.nec.com/cs
iris.unl.edu
ciir.cs.umass.edu
www.acm.org/dl
www.acm.org/sigir
www.teoma.com
Useful keywords:
“pdf”
“papers”
Some conclusions from Students:
1.
If it
is difficult to find a paper by its title, try its author. Normally we find the entire article or
useful links from the author’s homepage.
2.
When
more than one keyword is provided to the search engine, one may use the symbol
“+” and “-“, supported at google or profusion.
3.
When
looking for research papers, refer to the citations of the current paper to
find more relevant ones easily.
4.
The
keywords you enter in your search will make or break your results. I like to start out broad and then pull my
search in tighter depending on the kinds of sites that appear unrelated to my
query.
5.
There
are many search engines out there. Some
of them are good at searching for general information (e.g., google.com). Some of them are good at searching for
specific information (e.g., iteseer.com for CS-related papers). As a user, we need to know which engine is
good at what specialty and choose the right engine to use.
6.
Try
combining different search engines.
This way, a user could have more information and more precise
queries. Using the information we
obtain from one search engine on another search engine could yield better
results.
7.
Results
can be very different from what you have expected. It is a challenging task (almost impossible) to create a perfect
engine that will give the best result every time a user queries.
8.
Looking
for very precise and well-defined information can be time consuming. The more precision is required, the more
difficult it is to locate the correct information.
9.
Access
is not free. Viewing articles at
www.elsevier.nl requires a membership.
“Information Retrieval” journal asks to “pay as you see” for their
archives. So articles published in 2001
are not free while those in 2003 are.
10.
In
both exercises I found that the typical Yahoo or Google type search engines
were useful. Neither actually provided
a good list of actual PDF articles but both could point me to somewhere that
was a good source. The most prevalent
places to find the articles were the database/bibliography search engines such
as Computer Science Bibliography at the University of Trier and CiteSeer. These sites have already narrowed the number
of articles down to those involved in computer science so find articles in here
was easy.
11.
I
found that using “Information Retrieval” alone as a keyword on Yahoo or Google
was insufficient. “Information Retrieval” had to be accompanied by more
specific terms otherwise a large number of matches would not be relevant. However, when using the search engines such
as the CS Bibliography at the University of Trier and CiteSeer searching for
“Information Retrieval” was sufficient for producing many good hits. (Note: Do you know why?)
12. By using broad terms, I was able to
retrieve many articles but they may not be very specific to the topic I was
looking for. By giving a long query
with as much information that I had available, the information retrieval engine
was able to give some good results on the first 10 hits out of millions. (Note: There is very important
lesson here! The more keywords I use,
the more documents will be retrieved.
However, though I will be given too many documents, I will also be given
a better set of top-ranked documents!
Do you know why?)
13. I was surprised, however, at the amount
of dead links on the site that I found the IR papers. I believe that this can be explained at the “out-datedness” of
the page (the page mentioned that the last time it was updated was in
1996). If this is the case (i.e., if
there is a correlation between information “corruption” and the “oldness” of
the document), it is interesting that google.com returned a document so old to
me as its highest rank. (Note: So, what does this imply?
Should we also use “oldness” as a retrieval criterion?)
14. Although the specialized sites have a lot
of scientific articles, they would be more useful if they had more advanced
search tools. Sometimes these sites are
confusing to users because they do not offer easy links to the tools of simply
do not have them.
15. The use of different index terms must be
carefully defined, and they influence dramatically in the results. Scientific and academic terms are more
useful in specialized search engines. (Note: Do you know why?)
Some observations:
1.
Few
students used homepage search sites to locate articles by particular authors.
2.
Usually,
students went through 3-4 levels of links to obtain articles.
3.
Often,
all 10 articles were downloaded from the same site, the so-called “jackpot”.
4.
Some
students were careful and actually reviewed the articles to make sure their
relevance. Most students trusted the
sites and trusted the titles of the articles when retrieving relevant
documents.
5.
Most
students adopted the same search strategy: a general search followed by several
iterations of more-specific searches.
6.
Many
students failed to search the important, detailed information: page numbers of
a paper, the volume and issue of a journal, etc. These details would make the search time consuming!
7.
Few
broken links and dead-ends were encountered.
Some questions not addressed:
1.
Was
recall more important in this exercise, or was precision more important in this
exercise? Why?
2.
How
important was the ability to browse the retrieved documents in helping you
search? Did it allow you to pay less
attention to the keywords you used?
3.
How
important was the speed of the search engines in choosing your search
strategy?
4.
How
much did you rely on the interactions with the search engine to advance your
search? What if there weren’t any
hyperlinks? What if there were any
additional information (caption) describing each retrieved document?
5.
Was
it easier to search for technical, scientific papers than to search for other
non-technical, non-scientific items? If
so, why?
6.
Is the target search
closer to database retrieval or information retrieval?
Karen Sparck Jones (1999). Information Retrieval and Artificial Intelligence, Artificial
Intelligence, 114(1-2):257-281.