Friday, February 10, 2006

Search Engine Class

It is hard to imagine how people lived in dark times before search engines were developed. Going back to the early 1990s we find the first search tools which where helpful in finding information stored on computer systems. The very first tool used for searching on the Internet was created in 1990 by by Alan Emtage, a student at McGill University in Montreal and was called Archie. This program was able to create database of file names, or index of files, but could not read or search the content of the files.

Few years later first Internet search engine called Wandex was created by the World Wide Web Wanderer, a web crawler developed by Matthew Gray at MIT in 1993. Another very early search engine, Aliweb, also appeared in 1993, and still runs today. The first full text crawler based search engine was WebCrawler, which came into use in 1994. Unlike its predecessors, it let users search for any word in any web page, which became the standard for all major search engines ever since. In 1994 Lycos was started at Carnegie Mellon University, and it became very popular and commercially successful enterprise.

WebCrawler innovative metasearch technology

Soon after, many search engines appeared and started to compete for popularity. These included Excite, Infoseek, Inktomi, Northern Light, and AltaVista. In January of 1994 Yahoo was started and it was first known as "Jerry's Guide to the World Wide Web". At first it was a directory of other sites, organized in a hierarchy (rather than a searchable index of pages). It was renamed "Yahoo!" shortly thereafter. Today Yahoo is the most visited website on the Internet with 412 million unique users and has $5 billion in revenues and 11,000 employees.

Search Engine Relationship Chart

Google started as a research project in early 1996 by Larry Page and Sergey Brin, who were postgraduate students at Stanford University in California. The existing search engines at that time ranked results according to how many times the search term appeared on a page. And that created a situation where someone could manipulate the search results by increasing the number of specific words in order to appear on top of the list. Google was the fist successful attempt to analyze the relationships and links between websites.

Convinced that the pages with the most links to them from other highly relevant web pages must be the most relevant pages associated with the search, Page and Brin tested their thesis as part of their studies, and laid the foundation for their search engine. Originally the search engine used the Stanford University website with the domain google.stanford.edu. The domain google.com was registered on September 14, 1997, and the company was incorporated as Google Inc. on September 7, 1998 at a friend's garage in Menlo Park, California.

The original Google website as it looked in 1996.

All search engines today work by storing information about tens of billions of web pages. These pages are retrieved by a web crawler, called also a web spider — an automated software agent which follows every link it sees. The contents of each page are then analyzed to determine how it should be indexed. In Google case the indexing of the web pages is performed by a program named Googlebot, which periodically requests new copies of web pages it already knows about. Data about web pages are stored in an index database for use in later queries. Storing of such large amount of information is very costly. Simply storing 10 billion pages of 10 kbytes each in size requires 100TB and another 100TB or so for indexes, giving a total hardware cost of around $200k: 400 500GB disk drives on 100 computers. By the end of 2005 Google claimed that its index has over 25 billion web pages and 1.3 billion images, 1 billion Usenet messages, 6,600 print catalogs, and 4,500 news sources.

Find what people are searching for with Zeitgeist

Google's popularity has grown as people were attracted to its simple and clear design. Most people prefer not to have visual distractions while entering searches on web pages. This appearance was not an original idea and imitated AltaVista's, but included Google's unique search capabilities. In 2000, Google began selling advertisements associated with search keywords. This strategy was important for creating a financialy strong company . The ads were text-based in order to maintain an uncluttered page design and to maximize page loading speed. Keywords were sold with bidding starting at $.05 per click. This model of selling keyword advertising was pioneered by Goto.com. However while many companies have failed in the new Internet advertising domain, Google was generating increasing profits.

The key concept behind Google is PageRank which is a method assigning the relative importance of pages on the Internet from value 0 to value 10. Where page with 0 value is least important and page with value 10 is the most important page. PageRank results from a "ballot" among all the other pages on the Internet about how important the page is. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it which are called incoming links. A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page. U.S. Patent 6,285,999 describing part of Google's ranking mechanism (PageRank) was granted on September 4, 2001. The patent was officially assigned to Stanford University and lists Lawrence Page as the inventor.

Many academic papers concerning PageRank have been published since Larry Page and Sergey Brin's original paper. In practice, the PageRank system has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank.

Who manipulates Google and why?

Advanced commands on Google the easy way.

Please give your feedback on search engine class by using comment system below.

Did the class meet your expectations?
Will you incorporate what you have learned?
Any other comments are welcomed.

3 Comments:

Anonymous Anonymous said...

Interesting insights of Google history.

2:32 PM  
Anonymous Anonymous said...

Quite comprehensive search engine history.

2:47 PM  
Anonymous Anonymous said...

It is very relevant to libraries.

6:14 PM  

Post a Comment

<< Home