Programming Tutorials Browser Tutorials Articles Struts Tutorials Hibernate Tutorials

  Tutorial: Unweaving a Tangled Web With HTMLParser and Lucene

In this article I'll show you the basic technique in building a search engine using two powerful Open Source products: HTMLParser and Lucene.

Tutorial Details:

Ever wanted to write a Java program that crawls the web? You know a program that reads HTML-pages, retrieves the links, gets the new pages--with more links and so on. Maybe you also have thought about storing the text from the HTML pages for later use, to be able to search for specific information in the pages for example. These are the characteristics of a search engine like Google or Yahoo. If you have a web site of your own you might be interested in having your own search engine. One possibility is to buy one, or use an Open Source search engine, but you might also find it rewarding to write your own!

The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found.


 

Read Tutorial at: Click here to view the tutorial

Rate Tutorial:
Unweaving a Tangled Web With HTMLParser and Lucene

View Tutorial:
Unweaving a Tangled Web With HTMLParser and Lucene

Related Tutorials:

Process XML with JavaBeans, Part 3 - JavaWorld January 2000
Process XML with JavaBeans, Part 3 - JavaWorld January 2000
 
The Lucene search engine: Powerful, flexible, and free - JavaWorld September 2000
The Lucene search engine: Powerful, flexible, and free - JavaWorld September 2000
 
Web services hits the Java scene, Part 1
Web services hits the Java scene, Part 1
 
Tracing in a multithreaded, multiplatform environment
Tracing in a multithreaded, multiplatform environment In \"Use a consistent trace system for easier debugging,\" Scott Clee showed you how to trace and log from a custom class to provide a consistent tracing approach across your applications. This approa
 
Real World HTML Parser
Real World HTML Parser The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). Whil
 
Develop Your Own Plugins for Eclipse, Part 1
This article series is intended provide you the basic information necessary to quickly code your first plugin. The resources section will point to all of the necessary introductory materials.
 
ServerEclipse - Web Eclipse Plug-in
ServerEclipse - Web Eclipse Plug-in
 
yawiki (Yet Another Wiki)
yawiki (Yet Another Wiki) A wiki system is a perfect place for working together and sharing informations. The syntax for the wiki system is really simple to learn. Getting started with a wiki system is easy.
 
Lucene in Action
Lucene in Action Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, includin
 
Java Server Pages Dynamically Generated Web Content.
JavaServer PagesTM (JSP TM) technology allows Web developers and designers to rapidly develop and easily maintain, information-rich, dynamic Web pages that leverage existing business systems.
 
JavaServer Pages Technology - Documentation
Sun's tutorial for Java Server Pages that provide a good introduction to design web pages with JSP.
 
Using Lucene With OJB
Brian McCallister looks at the Lucene search engine and shows us how to index and retrieve objects from a sample Student application.
 
Running Lucene Search on WebLogic Portal 8.1
Lucene comes with two main services available: indexing and searching. The indexing tasks are done independently from the search tasks. Both the index and search services are available so that developers can extend them to meet their needs.
 
Unweaving a Tangled Web With HTMLParser and Lucene
In this article I'll show you the basic technique in building a search engine using two powerful Open Source products: HTMLParser and Lucene.
 
Adding search to your applications
The Lucene search engine is an open source, Jakarta project used to build and search indexes. Lucene can index any text-based information you like and then find it later based on various search criteria.
 
Introduction To Enterprise Java Bean(EJB). Developing web component.
Introduction To Enterprise Java Bean(EJB). Developing web component. Developing web component Introduction To Java Beans J2EE specification defines the structure of a J2EE application. According to the specification J2EE application consists of
 
Brief Introduction to the Web Application development
Brief Introduction to the Web Application development Brief Introduction to the Web Application development Gone are the days of serving static HTML pages to the world. Now a days most website serves the dynamic pages based on the user and their
 
Complete Webhosting Guide, Search Web hosts, Find Plans
Complete Webhosting Guide, Search Web hosts, Find Plans The Complete Web Hosting Guide RoseIndia.net is the complete beginner's guide to finding a web hosting company. Introduction to Web Hosting What is Web Hosting? Linux vs. Windows
 
Web Site promotion services at roseindia.net
Web Site promotion services at roseindia.net Welcome to RoseIndia.net Web Submission Services Our Web site services will help you get listed in major search engines and directory of the world. Our own site traffic comes from the major search
 

Submission Home | Submit Web Sites | Why Manual Submission? | Web Promotion Guide
 
Site navigation
 

 

Send your comments, Suggestions or Queries regarding this site at roseindia_net@yahoo.com.

Copyright © 2006. All rights reserved.