Search Engines: A search engine is a program or information retrieval system designed to help one in retrieving a list of references or information, meeting a specific criterion from its own databases that are stored on a computer. The computer may be a public server on the World Wide Web, a computer inside a corporate or proprietary network, or a personal computer.
The earliest Internet search engine was Archie, which was created in 1990 by Alan Emtage a student at Mc Gill University in Montreal for anonymous FTP sites. This is the grandfather of all search engines. In 1993, the University of Navada System Computing Service group developed Veronica, which was created as a type of searching device similar to Archie but for gopher files. This is treated as the grandmother of search engines.
In June 1993, Matthew Gray, then at MIT, produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called “Wandex”. The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web's second search engine Archie-Like Indexing on the Web (Aliweb) appeared in November 1993 due to the effort of Martgn Koster. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format. ALIWEB is no longer maintained.
In December 2003, Orase (http://forums.searchenginewatch.com/showthread.php?t=1716) published the first version of its new real time search technology. It comes with many new functions and the performance increased a lot.
a) Component of a Search Engine: A general search engine typically functions by considering three components.
i) Crawler / Spider / robots: Web crawling is the process of locating, fetching, and storing Web pages. The Web crawler or spider or robot is a computer program. It starts from a seed pages to locate new pages by parsing the downloaded pages and extracting the hyperlinks within. Extracted hyperlinks are stored in a FIFO fetch queue for further retrieval. Crawling continues until the fetch queue gets empty or a satisfactory number of pages are downloaded. Each time a spider visits a web page it scans all the text and follows every link it sees.
Some search engines such as Google store all the scan pages but some like Altavista store only the words of the scan pages in an ever increasing databases. Theses store pages are known as cached pages. The contents of each page are then analyzed and it catalogues the URL and a list of words in an index database for use in later queries.
ii) Indexer: The downloaded content is concurrently parsed by an indexer and transformed into an inverted index. It represents the downloaded collection in a compact and efficiently queryable form. The indexes are regularly updated to operate quickly and efficiently. The database of search engine is most often created by spiders or robots automatically.
iii) Query Processor: The query processor is responsible for evaluating user queries and returning to the users the pages relevant to their query. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) into a search “box”. When a user makes a query typically by giving keywords the engine looks up the index and provides a listing of the best matching web page according to its criteria, usually with a short summary containing the document’s title and sometimes a part of the text. The list is often sorted with respect to some measure of relevance of the results. Because these databases are very large, search engines often return thousands of results.
b) Ranking of Site at Search Engine: Best matches and what order the results should be shown in, varies widely from one search engine to another. The method also changes over time as internet usage changes and new techniques evolve but people at large accepted Google to be more useful in this regard. But, researchers at NEC Research Institute claim to have improved upon Google’s patented page rank technology by using web crawler to find “Communities” of website. This technology instead of ranking pages uses an algorithm that follows link on a webpage to find other pages that link back to the first one and so on from page to page. The algorithms “remember” where it has been and index the number of cross links and relate these into grouping. In this way virtual communities of web pages are found.
c) Types of Search Engine: Configurable Unified Search Index (CUSI) search engines, like All-in One Search Page and W3 Search Engines are pages which list search engines.
The search engines can be categorized based on the coverage as-
i) Web Search Engine: It searches for information on the public Web.
ii) Enterprise Search Engines: They search on intranets.
iii) Personal Search Engines: It searches individual personal computers.
iv) Custom Search Engine: Search within the contents defined by the user(s).
v) Meta Search Engine: The content of search engines indexes and databases will vary. So if the same query is typed into several search engines it is likely to produce different results, Because of this in searching a topic a user often wants to see results from various sources. One way to compare the results of several search engines is to type and retype a query into individual search engines one at a time. However, this can be very time consuming. A Meta searcher helps to make this task more efficient by providing a central location where the query is typed in once and the result can be obtained from multiple search engines. Meta Crawler, Search.Com (http://www.search.com), etc. are examples of Meta search engines.
Based on the contents that are considered for search, search engine can be-
i) Web Search Engine: Search all types of contents over the web. Eg. Google (http://www.google.com).
ii) Discussion Group Search Engine: Search only discussion groups. For example Google groups (http://groups.google.co.in), Yahoo groups (http://in.groups.yahoo.com/).
iii) Blog Search Engines: Search only Blogs. For example, Google blogs (http://blogsearch.google.co.in), etc
iv) Image Search Engine: For example, Google images (http://images.google.co.in), etc.
v) Maps Search Engine: For example, Google maps (http://maps.google.co.in), etc.
vi) Video Search Engine: For example, blinkx (http://www.blinkx.com/), fooooo (http://en.fooooo.com/), Truveo (http://in.truveo.com/), Google videos (http://video.google.com/), etc.
vii) Hypermail Search Engine: It searches for mailing lists.
viii) Hypernews Search Engine: It searches for USENET newsgroups.
ix) News Search Engine: For example, Google news (http://news.google.co.in).
x) Books Search Engine: For example, Google books (http://books.google.co.in).
xi) Subject Directory Search Engine: They search Web directories which are maintained by human editors. They include a keyword search option which usually eliminates the need to work through numerous levels of topics and subtopics. For example, DMOZ.org, Yahoo! (http://www.yahoo.com/), Looksmart (http://www.looksmart.com/), etc.
Some other types of search engines are-
i) Crawler based Search Engine: WebCrawler that was launched in April 1994 was the first “robot” keyword search engine. Its robot program indexes the entire content of pages retrieved but not URLs embedded in those pages. WebCrawler acquired by America Online in June 1995. World Wide Web Word was also a robot based search engine; it indexes only HTML document titles, text explaining page links and URL’s.
ii) Human-Powered Search Engines: The Human-Powered Search engines search the pages or websites that are collected for index by the human. The examples of such type of search engine include: Anoox <http://www.anoox.com/>, ChaCha <http://www.chacha.com/>, Collarity <http://www.collarity.com/>, Earthfrisk <http://earthfrisk.org/>, iRazoo <http://www.irazoo.com/>, Mahalo <http://www.mahalo.com/>, Sproose <http://www.sproose.com/>, Wikia Search <http://alpha.search.wikia.com/>, etc.
iii) Mobile Search Engines: For example, Google mobile (http://www.google.com/mobile).
iv) Simultaneous Unified Search Engine (SUSI): The WebCompass acts as a personal SUSI search engine, where the user defines a set of search engines in a local database, defines a concept map of terms with associated search word, and then configure WebCompass to keyword search. A personal edition of WebCompass and other shareware packaged with similar capabilities are freely available. Other SUSI based services like SavvySearch or MetaCrawler search a range of search engines at a time. The drawback of SUSI is that their response time is slower.
v) Personalized Web Search: Google developed a personalized web search whereby the user can set up a profile and retrieve the results based on their interests.
vi) Grid Search Engine: A grid search engine can be defined as “a type of a parallel and distributed system that enables sharing, selection, and aggregation of geographically distributed autonomous resources dynamically at runtime depending on their availability, capability, performance, cost, and users’ quality-of-service requirements”. In a grid search engine, for each user query an individual crawl is started over the fresh copies of the Web document i.e the original one but not the cached one, and the relevant pages are selected. In this way, up-to date versions of the pages are evaluated and accuracy of the resulting answer set of pages is enforced. The grid search engines are sometimes known as Real Time Search Engine. For example, in December 2003, Orase published the first version of its new real time search technology.
vii) Natural Language Queries (Index Crawling): For example, Altavista, Ask Jeevas.
viii) Freeware Search Software: Freeware Search Softwares are used via a WWW servers CGI, like freeways, Glimpse and SWISH (Simple Web Indexing System for Humans).
In near future it is no doubt that some subject search engines will come out to overcome the problem of general search engines.
d) Importance of Search Engine: Search engines are the most popular destination on the internet. Again, the cached pages maintained by some search engines are very useful when the content of the web page has been updated and the search terms are no longer in it, or the web page is no longer available or the site’s server is down. So, in such cases when a particular website is withdrawn one can search for cached pages for the data that may no longer be available elsewhere.
Without search engine, to try to find what you need can be like finding a needle in a haystack. To use search engines effectively, it is essential to apply techniques that narrow the results and push the most relevant pages to the top of the results list.
e) Examples of Search Engine: Nowadays, we have thousands of search engines for searching over internet. Each of the search engines makes an appearance over the web; continues for some time, then the new one emerges and the old one falls to decay and disuse. Some of the popular types of search engines, which create new milestone in the origin and development of search engines, are discussed below
i) Lycos: Lycos (http://www.lycos.com/) was started at Carnegi Mellon University as a research project in 1994 and it was one of the first engines. It ceases crawling the web for its own listing in April 1999 and instead uses crawler based results provided by Fast i.e All the Web.com. Now it is owned by Terra Lycos, a company formed with Lycos and Terra Networks merged in October 2000.
ii) Altavista: Altavista (http://www.altavista.com/) was originated in 1995. It was the first search engine to use natural language queries (index crawling), meaning a user could type in a sentence like “Who is the Prime Minister of India” and does not get a million pages containing the word “Who”. AltaVista also offers a number of powerful search features not found elsewhere. One very effective tool available on the Advanced Search page is the NEAR search. A NEAR search limits the results to pages where the keywords appear within 10 words of each other. This can be extremely helpful in situations where an AND search produces too many results and a phrase search (" ") produces too few results. Altavista also provides news and multimedia which was owned by Digital Equipment Corporation.
iii) Ask Jeevas: Ask Jeevas (http://www.ask.com/) initially gained fame in 1998 and 1999 as being the natural language search engine that lets one to search by asking questions and being responded with what seemed to be the right answer to everything. i.e it can be said that it delivers search results based on one’s question.
iv) Google: Google (http://www.google.com) was originally a Stanford University project by student Larry Page and Sergey Brain called Back Rub. In 1998 the name had been changed to Google and the project jumped off campus and became private company. In around 2001 the Google search engines rose to prominence. Its success was based in part on the concept of link popularity and page rank which is very adept at returning the relevant results. Page rank is based on citation analysis that was developed in the 1950s by Dr. Eugene Garfield at the University of Pennsylvania. The page rank takes into consideration how many other websites and web page linking pages and the number of links on theses pages contribute to the page rank of the linked page. This makes it possible for Google to order its results by how many website links to each found page. Finally, unlike other search engines, Google offers a cached copy of each result.
v) Yahoo: Yahoo (http://www.yahoo.com), the huge subject tree was started by two Stanford graduate students David Flo and Jerry Yang. They created a list of their favorite site. The list grew bigger and bigger and in time has become the Yahoo. In 2002, Yahoo acquired Inktomi and in 2003 Overtune, which owned All the web and Altavista. In 2004 Yahoo launched its own search engines based on the combined technologies of its acquisition and providing a service that gave pre-eminence to the web search engine over the directory.