CSCE 410/810 Information Retrieval

 

Forum Questions

 

October 23, 2003

 

The following questions have been categorized into 10 groups:

 

·        Indexing

·        Relevance Computation and Ranking

·        Strategies and Models

·        Speed and Operational Issues

·        Multimedia

·        Updates

·        Storage

·        User Feedback

·        Future Directions

·        Miscellaneous


Indexing

 

CW1.   When I searched my name in google, I always find a lot of other people's websites ranked higher than mine. If I want to be ranked NO 1, what shall I do?  //The feedback evaluation mechanism of google

 

CC4.    What kind of indexing structure does google use?

 

SW4.    How to make my homepage be found and ranked high in Google?

 

DL2.    Are the categories and links in the Google Directory section automatically created or does someone oversee this and manually choose these?


 

Relevance Computation and Ranking

 

CC17.  What ranking algorithm does google using?

 

JK5.     How do you discover techniques that that exploit Google's ranking algorithms?

 

JK1.     Please explain the PageRank algorithm.

 

YL2.    Google use the number of page-links as the criteria for the importance of a certain web page. We know this technique is novel and effective, especially for HTML web pages. There are three questions related with this technique:

(1)               Now more and more web pages are written in XML. Is page-link still effective for XML?

(2)               Is the number of page-links a fair criterion? You know, sometimes a unique idea is more valuable.

(3)               For dynamic web pages, instead of static HTML web pages, how can you count their number of page-links? By saying dynamic web pages, I mean the web pages which are created dynamically by database queries.

 

SW2.    How many factors are used in Google's search algorithm to determine the ranking of search result? Among them, how many depend on the linked structure of web pages?

 

SW3.    Does Google rank a page only based on the visible text? I mean are the meta tags and comments in the html file ignored? Isn't it a big loss?

 

MN4.   How does Google’s page rank work?  Types of algorithms?

 

MN2.   How does Google try to combat keyword overloading (ie. hiding irrelevant words on a webpage to generate hits)?

 

JM4.    What ranking algorithm is used by google.com? Is the ranking done by considering the number of hits to a page?

 

JM8.    What exactly is the PageRank technique? How does it take hyperlinks into account? Other than hyperlinks, does it take into account anything else like pictures, sounds etc?

 

JM9.    In today’s web-based information retrieval how far do the traditional measures like recall and precision help in evaluating the efficiency of an information retrieval system?

 

JW1.              According to your website, at the core of Google is the PageRank algorithm that is combined with text-matching techniques.  How are these two combined?

 

DL6.    Can you explain the details of PageRank and how this data is stored within the google data structure?


 

Strategies and Models

 

JM1.    As far as I know, the retrieval strategy for google.com is to retrieve as many documents as possible, many of which (especially the lower ranked ones) turn out to be rather unrelated to the query. Why is such a strategy chosen?

 

CC9.    Does google use clustering? If yes, explain the clustering mechanism?

 

CC10.  How does google handle hierarchical structured data? How does google judge at which level to retrieve?

 

CC12.  Do you think the current IR is in the right direction (more guessing what the user want)? Can Ontology help in IR system (more understanding what user want)?

 

CC5.    Did google use other search engine to help search?

 

CC8.    Does google use any distributed technology? distributed database, distributed search engine?

 

CC16.  Does google use any web mining technology?

 

YL1.    There are several models of Information Retrieval, such as Boolean Model, Vector Model and Probability Model. Which model is currently used by google? Or are you using a hybrid model?

 

YL4.    We know the tradeoff between storage space and the efficiency of search is always a topic in Information Retrieval area. Based on my understanding, the storage is very cheap now, so I guess the more important thing of current information retrieval is the efficiency of search. Am I right?

 

LI3.      Which retrieval models are used in google search engine?

 

LI4.      Which is more important, recall or precision, for google?


 

Speed & Operational Issues

 

CC2.    What technique has google used for such a fast response time? Could you give a general talk about the structure of google search engine?

 

MN6.   How is Google able to organize web pages to return the results of my search over billions of pages within fractions of a second?

 

MN7.   How many server locations does Google use? 

 

CC14.  Do you think it is really necessary to return over 10 pages of links in order to guarantee the high recall? Wasting user's time?


 

Multimedia

 

CW3.            Whether google will provide the service of picture search in the future (people inputs picture, google returns related pictures).

 

CC11.  Does google research on image retrieval area? if yes, what is the lastest news?

 

YL3.    Google is very good at text-based search. How about image and video search? Do you use the same technique for image and video search, or do you use other novel techniques? What is the most difficult part in image and video search area?

 

SW1.   Google does well in text retrieval. But the image retrieval is a little weak. It seems Google retrieves images only based on the image file names and the link strings, something similiar as texts. Does/Will Google use some computer version techniques to enhance the performance of its image retrieval?

 

JM5.    For retrieval purpose, does google.com depend on the textual part of the document only or also pictures, sounds, etc? Is a different strategy used when searching for sound/picture files?

 

DL7.    How is storing image data and ranking images different from storing text web pages?  How are you able to search this?

 

LI1.      Will google search engine deal with multimedia files?


 

Updates

 

CW4.   There are some new websites everyday, how does google know this? In other words, if I build a good new website, how does google first know this website?

 

CC1.    Some web pages content changes everyday, how a search engine keep up with the constant update and know what the paper is about? If you rely on the meta-data, how do you filter the fake meta-data?

 

CC6.    How frequently google update database? how big is the database in terms of bytes?

 

OK1.    How does Google discover new documents and new servers?

 

OK3.   Does Google ever check for broken links, if yes, how and how often and what updates at Google follow?

 

MN3.   How do spiders work?

 

JW4.    When spidering for new or updated pages, is the live database updated in realtime or is a second offline database updated then switched?  Also, how do you deal with the switchover or race conditions to database records while searches are ongoing?

 

DL4.    How does Google “scan” other sites and keep its database up to date?


 

Storage

 

CW5.   Where is the data stored? Different countries have their own databases or all the databases are in the US? Further, if different places have their own databases, are the databases the same, or different? //since people in different countries have different preference.

 

CW6.   Which kind of database Google is using, Oracle, DB2 or Sql server? It is stable?

 

CW2.   Some web pages are cached in google, but others are not. Which kind of pages will be cached?

 

JK6.     How much of the IR system data structure at Google is stored in memory and how much on disk?

 

CC3.    What database are google using? Oracle?

 

OK2.    How much information about a document is stored at Google and what is this information?

 

OK4.            Overview of the data structure of Goggle IR system. Actually I am also interested in implementation approaches.

 

JM3.     Are the documents to be searched by google.com stored in distributed databases? How does that complicate speedy retrieval and how are these complications resolved?

 

JW2.    The text-matching techniques used in the Google engine appear to be a reverse indexing method.  How is such a large database managed?

 

JW3.    The backend for Google much be huge.  How do you deal with possible data corruption, backups, etc while keeping near 100% uptime?

 

DL5.    How is the Google database organized?  What type of structure does the system use to hold all of this data?  How much computing power and storage does it take to run Google?


 

User Feedback

 

JK4.     Does Google use query modification? If yes, is it based on user feedback and what attributes are used?

 

CC13.  What techniques does google use for collecting user feedback?

 

MN8.   Does Google plan to implement the ability for a profile to be created for an individual, and for further searches to be compared to the profile to allow more individual results personalized for the user?


 

Future Directions

 

JK2.     Are you limited in what you can tell us about Google's IR techniques? If yes, why?

 

JK3.     What do you believe is the next revolutionary development in IR? What in the near future will advance it the most?

 

CC7.    What is the next generation of search engine? What direction IR is developing?

 

YL5.    This question is not related with information retrieval. Instead, it is about the research direction of google. This year, the researchers at google publish one paper about file system in ACM SOSP, which is the flagship conference in operating system area. Does this mean google is trying to explore other research directions beyond information retrieval?

 

MN5.   Where do you see the future of search engines going?

 

JM10.  What new techniques for more efficient retrieval are being explored now?

 

JW5.    Google is expanding into other areas, such as near realtime automated news rankings, images, business product searches, and special topic searching (example Linux and Apple).  What possible new areas are on the horizon for information retrieval outside of Google on the Internet, peer-to-peer, etc?

 

JW6.    An API has been described on Google. What exactly does this include, and what projects are using this?

 

DL8.    I understand that the Google programming API was released previously for a programming contest.  Is this still available to the public or to classes such as our own?

 

LI5.      What's the next goal of google search engine?


 

Miscellaneous

 

CC15.  How does google make money? 

 

MN1.   What is the percentage of “paid inclusion” compared to free?  (Percentage returned in searches and percentage of income for Google)

 

JM2.    What preprocessing operations, if any, are done on the documents to be searched by google.com? What data-structures are used to store the index terms or keywords (if they are used)?

 

JM6.    What all languages (like English, Italian etc) does google.com support? Does it follow the same strategies for all languages? Does it support languages with different scripts (e.g. Japanese)? Can a query in English retrieve documents written in other languages (Spanish)?

 

JM7.    In what respects do you think google.com is a better search engine than the other search engines? For what purpose do you think it is best suited (e.g. searching for general information, a specific topic, journal papers, etc.)

 

JM11.  The number of documents on the web is increasing almost exponentially everyday. Is the framework for google.com scalable enough to accommodate this expansion? What techniques are used to incorporate such scalability?

 

JM12.  How can web-based information retrieval be made more intelligent?

 

DL1.    Google seems to already have a pretty robust search engine.  What do programmers and other people working on the information retrieval aspects of Google do?

 

DL3.    Google seems to be available in many different languages.  How is this done and more specifically how are searches in a different language handled?  If I search in English for a keyword and then search in Spanish for the same keyword will I get the same results?

 

LI2.      Will google search engine cooperate with P2P searching?