CSCE
410/810 Information Retrieval
Forum
Questions
October
23, 2003
The following questions have been categorized into 10 groups:
·
Indexing
·
Relevance Computation and Ranking
·
Strategies and Models
·
Speed and Operational Issues
·
Multimedia
·
Updates
·
Storage
·
User Feedback
·
Future Directions
·
Miscellaneous
Indexing
CW1. When I searched my name in google, I always
find a lot of other people's websites ranked higher than mine. If I want to be
ranked NO 1, what shall I do? //The
feedback evaluation mechanism of google
CC4. What kind of indexing structure does google
use?
SW4. How to make my homepage be found and ranked
high in Google?
DL2. Are the categories and links in the Google Directory section automatically created or does someone oversee this and manually choose these?
Relevance Computation and
Ranking
CC17. What ranking algorithm does google using?
JK5. How do you
discover techniques that that exploit Google's ranking algorithms?
JK1. Please explain the
PageRank algorithm.
YL2. Google use the number of page-links as the criteria for the importance of a certain web page. We know this technique is novel and effective, especially for HTML web pages. There are three questions related with this technique:
(1) Now more and more web pages are written in XML. Is page-link still effective for XML?
(2) Is the number of page-links a fair criterion? You know, sometimes a unique idea is more valuable.
(3) For dynamic web pages, instead of static HTML web pages, how can you count their number of page-links? By saying dynamic web pages, I mean the web pages which are created dynamically by database queries.
SW2. How many factors are used in Google's search
algorithm to determine the ranking of search result? Among them, how many
depend on the linked structure of web pages?
SW3. Does Google rank a page only based on the
visible text? I mean are the meta tags and comments in the html file ignored?
Isn't it a big loss?
MN4. How does Google’s page rank work? Types of algorithms?
MN2. How
does Google try to combat keyword overloading (ie. hiding irrelevant words on a
webpage to generate hits)?
JM4. What
ranking algorithm is used by google.com? Is the ranking done by considering the
number of hits to a page?
JM8. What exactly is the PageRank technique? How does it take hyperlinks into account? Other than hyperlinks, does it take into account anything else like pictures, sounds etc?
JM9. In today’s web-based information retrieval how far do the traditional measures like recall and precision help in evaluating the efficiency of an information retrieval system?
JW1. According
to your website, at the core of Google is the PageRank algorithm that is
combined with text-matching techniques.
How are these two combined?
DL6. Can you explain the details of PageRank and how this data is stored within the google data structure?
Strategies and Models
JM1. As far as I know, the retrieval strategy for google.com is to retrieve as many documents as possible, many of which (especially the lower ranked ones) turn out to be rather unrelated to the query. Why is such a strategy chosen?
CC9. Does google use
clustering? If yes, explain the clustering mechanism?
CC10. How does google handle hierarchical structured
data? How does google judge at which level to retrieve?
CC12. Do you think the current IR is in the right
direction (more guessing what the user want)? Can Ontology help in IR system
(more understanding what user want)?
CC5. Did google use
other search engine to help search?
CC8. Does google use any distributed technology?
distributed database, distributed search engine?
CC16. Does google use any
web mining technology?
YL1. There are several models of Information Retrieval, such as Boolean Model, Vector Model and Probability Model. Which model is currently used by google? Or are you using a hybrid model?
YL4. We know the tradeoff between storage space and the efficiency of search is always a topic in Information Retrieval area. Based on my understanding, the storage is very cheap now, so I guess the more important thing of current information retrieval is the efficiency of search. Am I right?
LI3. Which retrieval models are used in google
search engine?
LI4. Which is more important, recall or
precision, for google?
Speed & Operational Issues
CC2. What technique has google used for such a
fast response time? Could you give a general talk about the structure of google
search engine?
MN6. How is Google able to organize web pages to return the results of my search over billions of pages within fractions of a second?
MN7. How many server locations does Google
use?
CC14. Do you think it is really necessary to return
over 10 pages of links in order to guarantee the high recall? Wasting user's
time?
Multimedia
CW3. Whether google will provide the
service of picture search in the future (people inputs picture, google returns
related pictures).
CC11. Does google research
on image retrieval area? if yes, what is the lastest news?
YL3. Google is very good at text-based search. How about image and video search? Do you use the same technique for image and video search, or do you use other novel techniques? What is the most difficult part in image and video search area?
SW1. Google does well in text retrieval. But the image retrieval is a little weak. It seems Google retrieves images only based on the image file names and the link strings, something similiar as texts. Does/Will Google use some computer version techniques to enhance the performance of its image retrieval?
JM5. For retrieval purpose, does google.com depend on the textual part of the document only or also pictures, sounds, etc? Is a different strategy used when searching for sound/picture files?
DL7. How is storing image data and ranking images different from storing text web pages? How are you able to search this?
LI1. Will google search engine deal with
multimedia files?
Updates
CW4. There are some new websites everyday, how
does google know this? In other words, if I build a good new website, how does
google first know this website?
CC1. Some web pages content changes everyday, how
a search engine keep up with the constant update and know what the paper is
about? If you rely on the meta-data, how do you filter the fake meta-data?
CC6. How frequently
google update database? how big is the database in terms of bytes?
OK1. How does Google discover new documents and
new servers?
OK3. Does Google ever check for broken links, if
yes, how and how often and what updates at Google follow?
MN3. How do spiders
work?
JW4. When spidering for new or updated pages, is
the live database updated in realtime or is a second offline database updated
then switched? Also, how do you deal
with the switchover or race conditions to database records while searches are
ongoing?
DL4. How does Google “scan” other sites and keep its database up to date?
Storage
CW5. Where is the data stored? Different countries
have their own databases or all the databases are in the US? Further, if
different places have their own databases, are the databases the same, or
different? //since people in different countries have different preference.
CW6. Which kind of database Google is using,
Oracle, DB2 or Sql server? It is stable?
CW2. Some web pages are cached in google, but
others are not. Which kind of pages will be cached?
JK6. How much of the IR system data structure at
Google is stored in memory and how much on disk?
CC3. What database are
google using? Oracle?
OK2. How much information about a document is
stored at Google and what is this information?
OK4. Overview of the data structure of
Goggle IR system. Actually I am also interested in implementation approaches.
JM3. Are
the documents to be searched by google.com stored in distributed databases? How
does that complicate speedy retrieval and how are these complications resolved?
JW2. The text-matching techniques used in the
Google engine appear to be a reverse indexing method. How is such a large database managed?
JW3. The backend for Google much be huge. How do you deal with possible data
corruption, backups, etc while keeping near 100% uptime?
DL5. How is the Google database organized? What type of structure does the system use to hold all of this data? How much computing power and storage does it take to run Google?
User Feedback
JK4. Does Google use query modification? If yes,
is it based on user feedback and what attributes are used?
CC13. What techniques does google use for collecting
user feedback?
MN8. Does
Google plan to implement the ability for a profile to be created for an
individual, and for further searches to be compared to the profile to allow
more individual results personalized for the user?
Future Directions
JK2. Are you limited in
what you can tell us about Google's IR techniques? If yes, why?
JK3. What do you believe is the next
revolutionary development in IR? What in the near future will advance it the
most?
CC7. What is the next
generation of search engine? What direction IR is developing?
YL5. This question is not related with information retrieval. Instead, it is about the research direction of google. This year, the researchers at google publish one paper about file system in ACM SOSP, which is the flagship conference in operating system area. Does this mean google is trying to explore other research directions beyond information retrieval?
MN5. Where do you see the future of search engines going?
JM10. What new techniques for more efficient retrieval are being explored now?
JW5. Google is expanding into other areas, such as
near realtime automated news rankings, images, business product searches, and
special topic searching (example Linux and Apple). What possible new areas are on the horizon for information
retrieval outside of Google on the Internet, peer-to-peer, etc?
JW6. An API has been described on Google. What
exactly does this include, and what projects are using this?
DL8. I understand that the Google programming API was released previously for a programming contest. Is this still available to the public or to classes such as our own?
LI5. What's the next goal of google search
engine?
Miscellaneous
CC15. How does google make
money?
MN1. What is the percentage of “paid inclusion” compared to free? (Percentage returned in searches and percentage of income for Google)
JM2. What preprocessing operations, if any, are done on the documents to be searched by google.com? What data-structures are used to store the index terms or keywords (if they are used)?
JM6. What all languages (like English, Italian etc) does google.com support? Does it follow the same strategies for all languages? Does it support languages with different scripts (e.g. Japanese)? Can a query in English retrieve documents written in other languages (Spanish)?
JM7. In what respects do you think google.com is a better search engine than the other search engines? For what purpose do you think it is best suited (e.g. searching for general information, a specific topic, journal papers, etc.)
JM11. The number of documents on the web is increasing almost exponentially everyday. Is the framework for google.com scalable enough to accommodate this expansion? What techniques are used to incorporate such scalability?
JM12. How can web-based information retrieval be made more intelligent?
DL1. Google seems to already have a pretty robust search engine. What do programmers and other people working on the information retrieval aspects of Google do?
DL3. Google seems to be available in many different languages. How is this done and more specifically how are searches in a different language handled? If I search in English for a keyword and then search in Spanish for the same keyword will I get the same results?
LI2. Will google search engine cooperate with
P2P searching?