Library httpspider

A smallish httpspider library providing basic spidering capabilities It consists of the following classes:

  • Options
** This class is responsible for handling library options.

  • LinkExtractor
** This class contains code responsible for extracting urls from web pages.

  • URL
** This class contains code to parse and process URLs.

  • UrlQueue
** This class contains a queue of the next links to process.

  • Crawler
** This class is responsible for the actual crawling.

The following sample code shows how the spider could be used:

  local crawler = httpspider.Crawler:new( host, port, '/', { scriptname = SCRIPT_NAME } )
  crawler:set_timeout(10000)

  local result
  while(true) do
    local status, r = crawler:crawl()
    if ( not(status) ) then
      break
    end
    if ( r.response.body:match(str_match) ) then
       crawler:stop()
       result = r.url
       break
    end
  end

  return result

For advanced use, the library currently supports a number of closures (withinhost, withindomain, doscraping). Please note, that withinhost and withindomain options also support boolean values. You will want to override them only for advanced use. You can define them using the following utilities:

  • iswithinhost
** You can use this utility to check if the resource exists within the host.

  • iswithindomain
** You can use this utility to check if the resource exists within the domain.

  • isresource
** You can use this utility to check the type of the resource (for example "js"). ** A third option may hold a number of signs that may exist after the extension ** of the resource. By default, these are [#, ?]. For example, if we want to return only php resources, the function will also return example.php?query=foo or example.php#foo.

The following sample code shows an example usage. We override the default withinhost method and we allow spidering only on resources within the host that they are not "js" or "css".

  crawler.options.withinhost = function(url)
      if crawler:iswithinhost(url)
      and not crawler:isresource(url, "js")
      and not crawler:isresource(url, "css") then
          return true
      end
   end

Author:

  • Patrik Karlsson <patrik@cqure.net>

Source: https://svn.nmap.org/nmap/nselib/httpspider.lua

Script Arguments

httpspider.doscraping

Closure that overrides the default doscraping function used to check if the resource should be scraped (in terms of extracting any links within it). See the closure section above to override the default behaviour.

httpspider.url

the url to start spidering. This is a URL relative to the scanned host eg. /default.html (default: /)

httpspider.maxpagecount

the maximum amount of pages to visit. A negative value disables the limit (default: 20)

httpspider.useheadfornonwebfiles

if set, the crawler would use HEAD instead of GET for files that do not have extensions indicating that they are webpages (the list of webpage extensions is located in nselib/data/http-web-files-extensions.lst)

httpspider.noblacklist

if set, doesn't load the default blacklist

httpspider.maxdepth

the maximum amount of directories beneath the initial url to spider. A negative value disables the limit. (default: 3)

httpspider.withinhost

Closure that overrides the default withinhost function that only spiders URLs within the same host. If this is set to false the crawler will spider URLs both inside and outside the host. See the closure section above to override the default behaviour. (default: true)

httpspider.withindomain

Closure that overrides the default withindomain function that only spiders URLs within the same domain. This widens the scope from withinhost and can not be used in combination. See the closure section above to override the default behaviour. (default: false)