Netscape DevEdge

Skip to: [content] [navigation]

CSpider - A Web Site Processor

Introduction

CSpider is an example application which illustrates some possible techniques in developing interactive event-driven web applications. CSpider will crawl a web site while optionally executing user-defined functions to enable custom processing of the contents of the site.

Note: CSpider is supported by Netscape 7.0x, Mozilla and Internet Explorer on sites which are in the same domain as where the CSpider application is hosted. Netscape 7.0x and Mozilla also can spider sites from other domains if the cross-domain security checks are relaxed. See Bypassing Security Restrictions and Signing Code for more details. To enable extended privileges in Netscape 7.0x and Mozilla to allow CSpider to access other domains, install user.xpi, which will automatically install user.js in your profile if you don't already have a copy.

Script

CSpider.js implements a JavaScript Object CSpider can be used to recursively visit (spider) a web site. CSpider uses CCallWrapper and WDocumentLoader.

Constructor
CSpider(String aUrl, Boolean aRestrictUrl, Number aDepth, WDocumentLoader aPageLoader, Number aOnLoadTimeoutInterval)

Constructs an instance of a CSpider object which can be used to spider a site beginning at the URL aUrl to a maximum depth of aDepth. aPageLoader is a reference to a window object containing a WDocumentLoader which is responsible for loading pages and notifying CSpider when each page has completely downloaded. If aRestrictUrl is false, CSpider will follow links which do not contain the aUrl as a prefix. If any page does not load in the specified time aOnLoadTimeoutInterval (in seconds) CSpider will enter the 'paused' state and the user specified method mOnPageTimeout will be called.

The following user-specified functions are called by CSpider to allow the customization of an application built using CSpider.

  • mOnStart
  • mOnBeforePage
  • mOnAfterPage
  • mOnPause
  • mOnRestart
  • mOnStop
  • mOnPageTimeout
Class Methods
CSpider.handlePageLoad(CFormData aFormData)

CSpider.handlePageLoad is used as a callback function from WDocumentLoader for notification of when pages have completed loading.

Properties
String mUrl

mUrl is the initial page where CSpider begins crawling a site.

Boolean mRestrictUrl

When mRestrictUrl is true, CSpider will only follow links which begin with mUrl. Set mRestrictUrl to false to allow CSpider to follow links to other sites.

Number mDepth

mDepth is the depth (number of links away from the starting page) that CSpider will crawl.

WDocument Loader mPageLoader

mPageLoader is a reference to the instance of WDocumentLoader used to load pages.

Number mOnLoadTimeoutInterval

If a page has not completed loading in mOnLoadTimeoutInterval seconds, the user-specified function mOnPageTimeout is called then the spider enters the 'paused' state.

Array mPagesVisited

mPagesVisited is an array of all pages visited by CSpider while crawling the site.

Object mPageHash

mPageHash is a hash which is used to prevent visiting the same page more than once.

String mState

mState records the current 'state' of the CSpider.

  • 'initialized' - initial state.
  • 'running' - is running.
  • 'paused' - in paused state (can be restarted).
  • 'stopped' - finished run.
HTMLDocument mDocument

mDocument is a reference to the currently loaded document.

Function mOnStart

mOnStart is a user-defined function which will be called when CSpider's run() method is called.

Function mOnBeforePage

mOnBeforePage is a user-defined function which will be called just before a page is loaded. It can be used to initialize page dependent data structure.

Function mOnAfterPage

mOnAfterPage is a user-defined function which will be called just after a page is loaded. It can be used to process a page's content.

Function mOnPause

mOnPause is a user-defined function which is called when CSpider enters the 'paused' state either as a result of the method pause() or after a page load time out has occured and the mOnPageTimeout user-spefified function has been called.

Function mOnRestart

mOnRestart is a user-defined function which is called when the method restart() is called.

Function mOnStop

mOnStop is a user-defined function which will be called after CSpider has completed crawling the site.

Function mOnPageTimeout

mOnPageTimeout is a user-defined function which will be called if a page is not loaded within the specified interval defined by mOnLoadTimeoutInterval.

Function mOnCallWrapperOnLoadPage

Internal property used to manage asynchronous calls.

Function mOnCallWrapperOnLoadPageTimeout

Internal property used to manage asynchronous calls.

Function mOnCallWrapperLoadPage

Internal property used to manage asynchronous calls.

Function mOnCallWrapperPause

Internal property used to manage asynchronous calls.

Methods
init()

init() is a convienience method which resets the CSpider to its initial conditions.

run()

run() begins crawling the specified site. It also calls the user-defined mOnStart() function.

pause()

pause() pauses the CSpider and calls the user-defined mOnPause() function.

restart()

restart() restarts a paused the CSpider and calls the user-defined mOnRestart() function.

stop()

stop() stops crawling the specified site. It also calls the user-defined mOnStop() function.

addPage(String href)

addPage() is an internal method used to queue pages for visiting.

loadPage()

loadPage() is an internal method used to invoke the page loader. loadPage calls the user-defined function mOnBeforePage.

onLoadPage()

onLoadPage() is an internal method used to handle page load events. onLoadPage calls the user-defined function mOnAfterPage.

CSpider Application

Launch the CSpider application.

<html>
  <head>
    <title>CSpider</title>
    <script type="text/javascript" src="CCallWrapper.js"></script>
    <script type="text/javascript" src="CSpider.js"></script>
    <script type="text/javascript">
/* ***** BEGIN LICENSE BLOCK *****
 * Version: MPL 1.1/GPL 2.0/LGPL 2.1
 *
 * The contents of this file are subject to the Mozilla Public License Version
 * 1.1 (the "License"); you may not use this file except in compliance with
 * the License. You may obtain a copy of the License at
 * http://www.mozilla.org/MPL/
 *
 * Software distributed under the License is distributed on an "AS IS" basis,
 * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
 * for the specific language governing rights and limitations under the
 * License.
 *
 * The Original Code is Netscape code.
 *
 * The Initial Developer of the Original Code is
 * Netscape Corporation.
 * Portions created by the Initial Developer are Copyright (C) 2003
 * the Initial Developer. All Rights Reserved.
 *
 * Contributor(s): Bob Clary <bclary@netscape.com>
 *
 * ***** END LICENSE BLOCK ***** */

      var gOutput;
      var gSpider;
      var gPageLoader;
      var gPageCount = 0;

      function main(form)
      {
        gPageCount = 0;
        gPageLoader = window.frames.pageLoader;
        gOutput = document.getElementById('output');

        var url = form.url.value;
        var depth = parseInt(form.depth.value);
        var restrict = form.restrict.checked;
        var timeout = parseFloat(form.timeout.value);

        gSpider = new CSpider(url, restrict, depth, pageLoader, timeout);

        // CSpider is a strategy pattern. You customize its
        // behavior by specifying the following functions which
        // will be called by CSpider on your behalf.

        gSpider.mOnStart = function()
        {
          var form = document.forms.spiderForm;
          form.run.disabled = true;
          form.pause.disabled = false;
          form.restart.disabled = true;
          form.stop.disabled = false;
  
          msg('Starting...');
          return true;
        };

        gSpider.mOnBeforePage = function()
        {
          msg('Starting to load ' +  this.mCurrentUrl.mUrl +  '<br>' + 
          'Depth       : ' + this.mCurrentUrl.mDepth + '<br>' +
          'Remaining   : ' + this.mPagesPending.length);
          return true;
        };

        gSpider.mOnAfterPage = function()
        {
          // If you wish to process the DOM of the loaded page,
          // use this.mDocument in this user-defined function.

          ++gPageCount;

          msg('Page loaded: ' + this.mCurrentUrl.mUrl + '<br>' +
            'Depth       : ' + this.mCurrentUrl.mDepth + '<br>' + 
            'Remaining   : ' + this.mPagesPending.length);
          return true;
        };

        gSpider.mOnStop = function()
        {
          var form = document.forms.spiderForm;
          form.run.disabled = false;
          form.pause.disabled = true;
          form.restart.disabled = true;
          form.stop.disabled = true;
  
          msg('Stopped... loaded ' + gPageCount + ' pages');
          return true;
        };

        gSpider.mOnPause = function()
        {
          var form = document.forms.spiderForm;
          form.run.disabled = true;
          form.pause.disabled = true;
          form.restart.disabled = false;
          form.stop.disabled = false;
  
          msg('Paused... click Restart to continue');
          return true;
        };

        gSpider.mOnRestart = function()
        {
          var form = document.forms.spiderForm;
          form.run.disabled = true;
          form.pause.disabled = false;
          form.restart.disabled = true;
          form.stop.disabled = false;
  
          msg('Restarting...');
          return true;
        };

        gSpider.mOnPageTimeout = function()
        {
          msg('Page Load Timed out...');
          return true;
        };

        gSpider.run();
      }

      function msg(s)
      {
        gOutput.innerHTML = '<pre>' + s + '<\/pre>';
      }

    </script>
  </head>
  <body>

    <h1>CSpider</h1>

    <p>
    Enter the URL of a web site you wish to process and the depth you wish
    to process the site. 
    </p>
    
    <p>
    Note that Internet Explorer can only spider DevEdge due to same-domain 
    security restrictions.  However Netscape 7.0x and Mozilla can process 
    other web sites if you have enabled the appropriate security bypasses. 
    See the <a href="./">Example</a> for more details.
    </p>

    <form name="spiderForm">
      <fieldset>
        <label>
          URL <input name="url" type="text" size="80" 
                     value="http://devedge.netscape.com/">
        </label>

        <br />

        <label>
          Depth <input name="depth" type="text" size="4" value="1">
        </label>

        <label>
          Restrict Urls <input name="restrict" type="checkbox" value="on" checked>
        </label>

        <label>
          Page timeout  <input name="timeout" type="text" size="4" value="120">
        </label>
      </fieldset>

      <fieldset>
        <legend>Controls</legend>
        <button name="run" type="button" onclick="main(this.form)">Run</button>
        <button name="pause" type="button" onclick="gSpider.pause()" disabled>Pause</button>
        <button name="restart" type="button" onclick="gSpider.restart()" disabled>Restart</button>
        <button name="stop" type="button" onclick="gSpider.stop()">Stop</button>
      </fieldset>
    </form>

    <div id="output"></div>

   <iframe id="pageLoader" name="pageLoader" 
           width="100%" height="80%" border="0" src="WDocumentLoader.html"></iframe>

  </body>
</html>

Change Log

2003-07-08
  • Fixed bug in depth calculations.

  • Added ability to either restrict the spider to follow urls containing the original URL as a prefix or to follow any link.

  • Added ability to handle page load timeouts.

  • Added ability to pause and restart the spider, replaced Boolean mRunning with String mState.

  • Added ability of user-specified CSpider "event handlers", to return true to continue normal operation or to return false to cause the CSpider to enter the 'paused' state.

A+R