Automating WWW
Exploration
Tutorial Details:
Automating Web exploration
Automating Web exploration
By: By Laurence Vanhelsuwé
Here's how to create a Web search engine service
Editor's Note: A few readers have correctly pointed out that WebExplorer, in the simple state presented in this article, ignores some established guidelines for Web robot-type programs. At issue is the possible (temporary) overloading of low-performance Web sites due to very rapid consecutive page requests. The normal solution is to insert delays between page requests and additionally see if the site has a robots.txt file that guides the robot to appropriate areas and designates areas that are off-limits. In view of these points, we strongly advise that you initially use WebExplorer exclusively on your own company's intranets, and only use it on the Internet after adapting the program to observe standard robot practice.
ay you want to set up a Web search engine service that competes with Lycos or Yahoo. How would you go about constructing your database of Web-page URLs? You could knock on the aforementioned services' doors and ask the people in charge if they'll sell you their databases. "Fat chance," I bet you're thinking. You could email all your friends and ask them to mail in their favorite URLs. Not very practical either. So why not mine the Web itself and extract from it the data you need? All that's required is the equivalent of a Lunar Rover -- adapted to cyberspace.
This article will look at some of the issues involved in designing and implementing such a Web crawler-type program.
Before we tackle the design of the program, let's review what Java provides in the field of network programming, since the program will necessarily delve into technical networking matters. Java happens to be wonderfully suited for all but the lowest-level TCP/IP programming tasks: The standard java.net package features a small but functional set of classes encapsulating TCP/IP in a way that greatly reduces potential network programming pitfalls. The java.net classes really let you concentrate on your application, rather than on myriad protocol programming details.
The main java.net classes used for doing any TCP/IP programming are:
Socket
InetAddress
URL
Using these three simple classes, you can write a host (no pun intended) of interesting Internet client software. Class Socket is your main gateway to the Internet: Given some host's address in the form of an InetAddress instance, and a port number to connect to, class Socket will hand you two I/O streams to communicate with your chosen server in full duplex (via one input stream and one output stream). Class URL allows both addressing and accessing of Web resources. This class therefore can be considered a higher-level combination of Socket and InetAddress but with a bias toward Web applications.
If you need to write server programs, need finer control over HTTP (WWW) exchanges, or need to use the User Datagram Protocol (UDP), then the following additional classes will help you in your endeavors:
ServerSocket
URLConnection
DatagramPacket and DatagramSocket (for UDP applications)
Before starting any network projects, bear in mind that Java applications and Java applets get different licenses to use these classes. As you probably know by now, applets are severely crippled when it comes to networking capabilities: For security reasons applets are restricted to communicating only with the server that produced (served) the applet itself. The Web exploration program we're going to write needs full access to the Web, so by necessity it will be a Java application and not an applet.
Conceptually, our Web exploration program is very similar to a textbook (recursive) file system lister (phew, what a mouthful), just like the ubiquitous DIRs or ls-es we all know. But instead of a directory specification, our program takes any Web page as sole argument. This page will act as the root for a hypertext graph to be explored. Every link (embedded in HTML anchor tags) in a page leads to more pages that need to be visited. While the recursive algorithm in programs like DIR or ls is purely sequential in nature, there's an opportunity to design our Web explorer to be more efficient by harnessing the inherent parallelism of the Internet. Like all Web browsers, a Web explorer can download different Web pages at the same time, thus exploiting the full bandwidth possible with any given link.
Downloading Web pages at the same time calls for Java's multithreading capability. Every page visited can be processed by a separate thread, which in turn spawns new threads to deal with every page referenced by the page it is handling. This approach sounds very elegant, but actually it hides a flaw: Any Java virtual machine only has so many threads it will support -- which is unspecified by the language specification itself.
Also, the use of multiple threads quickly becomes counter-productive once the available bandwidth is saturated with concurrent page requests and downloads. (This is why browsers like Netscape Navigator have a "Number of Connections" option that users can tune to their particular set-ups and Internet providers.) The design of our Web explorer therefore calls for an additional mechanism to transparently limit the exponential growth potential of the recursive algorithm. By transparent I mean that this mechanism shouldn't require us to modify the elegance and simplicity of the basic recursive algorithm; it should just keep it on a (very) short lead to prevent the runaway creation of threads.
The pseudo-code for the program so far is simply:
load page
extract all URLs
for all found URLs
spawn new thread of self
Click here to download a zip file of the WebCrawler source code .
At the highest level, the recursion we rely on is very similar to that used by a DIR program. Unlike most hierarchical filing systems, though, the Web is not a pure tree. Therefore we need to take steps to avoid going around in endless loops, revisiting pages over and over again. A simple global database of visited pages will allow the program to decide whether to go down a link or not. This database should be efficient in terms of look-up speed, and flexible enough to deal with the infiniteness of the Web. Java's Hashtable class easily fulfills our requirements -- at least for the kind of prototype we're building here.
Writing the program
Let's work in a top-down fashion and write the program frame for WebExplorer first:
Listing 1. Top-level class WebExplorer .
The program takes a single argument -- the URL of an existing Web page (for example, http://www.telework.demon.co.uk, which is this
author's home page). Notice the structure of the program: Its static main() method does not contain any of the program's logic; instead it instantiates an object of its own class and hands all control over to that object. This approach is preferable to the common C "technique" of cramming everything in main().
If you use Java as an implementation language, you will quickly find that this C idiom doesn't work very well in Java. Java's main() has to be declared static, which means that it can only directly call fellow static methods in the same class. This situation becomes unbearable after budding off just one or two subroutines from your main program. In Java, therefore, a much better approach is to simply do object-oriented work from the start, by instantiating an object and handing control over to that object. As an instance of the class, it can then invoke static and instance methods at its leisure.
The business end of the WebExplorer class is the triggering of the Web crawl process. This is achieved by instantiating a PageVisitor thread object (which immediately starts running) while passing it the seed Web page URL the command line got as argument.
The next class to examine therefore, is the PageVisitor class, which is the algorithmic heart of the program:
Listing 2. class PageVisitor .
Before looking at PageVisitor 's constructor, we need to look at what class initializations occur right after class PageVisitor is first loaded into the JVM. Class PageVisitor contains two class (static) objects that are central to the functioning of the program: the global database of encountered pages and an object of type CrowdController that is used to limit the number of threads being spawned by our recursive algorithm. The fact that both objects are declared static means that the many thread instances will all access the same database and CrowdController object. The database is implemented as a java.util.Hashtable and is initially empty (the choice of class Hashtable for the database, instead of Vector, for example, is because Hashtables can find stored items fast). The CrowdController object is initialized to manage a "crowd" of maximum MAX_THREADS participants (that is, PageVisitor threads). Class CrowdController will be discussed later.
The constructor for class PageVisitor converts the argument page URL (in String form) to an instance of a URL object. Being a thread, it labels itself with the page's URL; this gives the thread an identity that it can use later or use for debugging. The constructor then starts the thread running, which means the logic continues in this class' run() method. The body of the thread, as implemented by run() , is also the heart of this program. Therefore, the pseudo-code given earlier should be your guide. You can ignore the start and end statements dealing with limiting the number of running threads, for the moment.
The main methods supporting the recursive page-visiting algorithm are loadPage() and extractHyperTextLinks() . Method loadPage() relies heavily on an underlying class -- class HTTP -- to hide the nitty-gritty details of talking to a Web server and requesting (and getting) a Web page. Once a page is loaded and stored in a single string for convenience, the hypertext links can be extracted from it and collected in a vector. (I use a
Read
Tutorial at: Click here to view the tutorial
Rate Tutorial: Automating WWW
Exploration
View Tutorial: Automating WWW
Exploration
Related
Tutorials:
Automating WWW
Exploration
Automating WWW
Exploration |
JavaWorld - Distributed
applet-based
massively parallel processing
(DAMPP) - January
1997
JavaWorld - Distributed
applet-based
massively parallel processing
(DAMPP) - January
1997 |
Revolutionary RMI: Dynamic
class loading
and behavior
objects - JavaWorld - December 1998
Revolutionary RMI: Dynamic
class loading
and behavior
objects - JavaWorld - December 1998 |
Java Tip 66: Control
browsers from your Java application - JavaWorld
Java Tip 66: Control
browsers from your Java application - JavaWorld |
Make a sweep with
clean beans -
JavaWorld
November 1999
Make a sweep with
clean beans -
JavaWorld
November 1999 |
Take control of the servlet environment, Part 3 - JavaWorld January 2001
Take control of the servlet environment, Part 3 - JavaWorld January 2001 |
Device programming with MIDP, Part
2 - JavaWorld
March 2001
Device programming with MIDP, Part
2 - JavaWorld
March 2001 |
Business process
automation
made easy with
Java, Part 1
Business process
automation
made easy with
Java, Part 1 |
JHttpTunnel
JHttpTunnel is the implementation of GNU httptunnel\'s protocol in pure Java. |
Build scripts with Groovy and Ant
Build scripts with Groovy and Ant
Summary
In nearly all developers' toolboxes, Ant is the standard build tool for Java applications, thanks to its open, standard, and multiplatform structure. Though it represents a great improvement in automating produc |
Automate GUI tests for Swing applications
Summary
Automation is necessary for frequent and consistent testing, which is the foundation of agile development. However, acceptance tests of GUI applications are not always easy to automate. This article explains a simple way of automating Java Swing |
Chat Transcript: Project Looking Glass
Where did the original Looking Glass idea come from? Read the answer to this and other interesting questions about Project Looking Glass, a project that explores the next generation (3-dimensional) desktop. |
N1 Grid Service Provisioning System 5.0 Extends Features of Solaris Containers
A new version of the N1 Grid Provisioning System, scheduled for release in December, supports the Solaris Containers technology of the Solaris 10 Operating System. |
Automating the Installation of an FC-Fabric SAN Booted System
Read about how an enterprise deployed and extended the Sun Data Center Reference Architecture to become its Service Delivery Platform. |
We are providing Knoppix 3.6 Live Linux CD's
We are providing Knoppix 3.6 Live Linux CD's
Knoppix Linux CD's
Now Available Linux Knoppix 3.6 CD's
What is KNOPPIX?
KNOPPIX is a bootable Linux CD with a collection of various GNU/Linux software. It auto-detects hardware and supports many |
Buy Peanut 9.6 Linux in India from us. Peanut 9.6 distribution is available in India.
Buy Peanut 9.6 Linux in India from us. Peanut 9.6 distribution is available in India.
Peanut 9.6 Linux
Now Available Peanut 9.6 Linux CD
It is a Linux OS (operating system), especially made for those new to Linux. This is the most POWERFUL and |
What is WAP? Wireless Application Protocol
What is WAP? Wireless Application Protocol
Tutorial
What is WAP?
W ireless Application Protocol or WAP for short is simply a protocol - a standerized way for delivering Internet data over wireless networks. Thus WAP links Wireless Network with |
Manual Submission to Search Engines. Hand Submit Website URL Submission to Major Search Engines
Manual Submission to Search Engines. Hand Submit Website URL Submission to Major Search Engines
ATTENTION: Website owners!
Hand Submissions to major search engines for as low as $10.00 per month
Let us introduce you to the services provided by |
Configuring JumpStart Servers to Provision on Sun x86-64 Systems (pdf)
Solaris JumpStart technology provides a mechanism for fully automating the installation of the Solaris Operating System. This Sun BluePrints article describes how to modify existing JumpStart servers to support the deployment of the Solaris OS and Linux |
DB Visual Architect for Eclipse
DB Visual Architect for Eclipse (DBVA-EC) is a full featured Object Relational Mapping (ORM) plugin for Eclipse that provides the industry\'s best round-trip code engineering support with Java. |
|
|
|