Developing Search Engine in Java

In this section we will discuss about the search engine, and then show you how you can develop your own search engine for your website in Java technologies.

Developing Search Engine in Java

Developing Search Engine in Java

     

In this section we will discuss about the search engine, and then show you how you can develop your own search engine for your website in Java technologies. We will be using Hibernate Search for developing the search engine.

What is a Search Engine?

A Search Engine is typically a program that based on some specific keywords enabling the retrieval of some data or documents from the databases over the World Wide Web. Search engine uses a special automated program called a Spider or a Web Crawler which fetches web pages in a methodical manner and build a database of such web pages. The documents are then added to the search index and when an user input a keyword in the search engine, the spider search it against the index and retrieves the results accordingly. Google, Alta Vista are etc. are some examples of web search engines.

The frequently and mostly used search is based on a keyword search. In full-text search, all the words in the input are taken into account except the most common ones like “a”, “an”, “the”, “www” etc.

Web crawler provides up-to-date data and manages the web pages by keeping track of them for further processing which makes the search faster. On the basis of the words contained in a document an Indexer makes an index of the documents. This helps in easier search of a web page. Search engine applies a proprietary algorithm for making an index so as to enable retrieval of only meaningful information.

The results of a search query is different in different search engines just because of the difference in the efficiency of the algorithms being used in respective spiders. The index must have to be always updated for better performance.

Basically there are three types of search engines:

  • Crawler-based search engine- Here listing of the documents are done automatically by a special program called “the spider” or “the crawler”. Any changes made in the documents by the web authors affect the ranking of these documents. The changes may be in the title page, body as well in the other components too.
  • Human-powered search engine- This kind of search is based on listing manually by human beings. A good example is the Open Directory where the description of the documents are listed by the editors. The changes made in the web pages does not affect the ranking of the web pages. But a well crafted web page may get a free review rather than a poorly constructed one.
  • A Hybrid search engine- This type of search engine provides both crawler based and human powered listings considering one to be the superior to the other at different times. As for example, MSN search results are basically human powered but it also provides crawler based results in case of an obscure query.

A typical crawler based search engine has three components:

  • A Crawler or a Spider- This is a special program which serach for the web pages based on some search criteria and read them and also goes to the linked web pages. It periodically visits the web pages to look for any change made to the documents.
  • An Index- This is the second component where the documents found by the crawler are listed. It automatically updates the data in the documents listed according to the relative change in the web pages.
  • A seacrh engine API- This is the software which finds for any matches in the documents listed in the index for a specific search and retrieves them to the users. It ranks the pages in a order to what it finds them most relevant.

The ranking of the web pages by the search engines (relevancy):

Generally, a search engine retrieves the data for an user which is the most relevant to the subject of search and ranks them automatically in order of relevance. How different search engines retrieves and ranks information is based on the algorithms they use to implement. The description of the algorithms are made secret to the public as a trade policy but the general rules can be listed as below-

  • The location/frequency method: Search engine takes account of the location and the frequency of occurrence of a search keyword in a specific web page for ranking purpose. It looks for a keyword match in the HTML title tags of the documents and consider it more relevant than those occurring at other locations. The keyword terms appearing more frequently at the top of a pages like in the headings or in the first few paragraphs are considered to be more relevant.
  • Adding extra effort to the location/frequency method: Some search engines add more web pages subject to some specific keywords than others and some do indexing more frequently than others. Such differences in the indexing approach yield differences in search results also.
     
    Off the page factor: To have a boost in the ranking by a search engine some web masters reversal develop the location/frequency method implemented by the search engines. To encounter with such problem, search engines use the “off the page” factor which generally based on the link analysis of the pages which determines the ranking by evaluating how they are linked to each other. The more attractive pages are ranked higher than the poorly managed ones and the user feedback also affects the rankings of web page.

About Our search Engine

The Search engine we are developing is power full engine that you can use on your website. There is no crawler included in the search engine. You have to manually add the pages of your website into the index. Or you can develop some automatic program that adds the pages of your website into the full text index.

In the next section we will see how our search engine works.