Library httpspider
A smallish httpspider library providing basic spidering capabilities It consists of the following classes:
Options
LinkExtractor
URL
UrlQueue
Crawler
The following sample code shows how the spider could be used:
local crawler = httpspider.Crawler:new( host, port, '/', { scriptname = SCRIPT_NAME } ) crawler:set_timeout(10000) local result while(true) do local status, r = crawler:crawl() if ( not(status) ) then break end if ( r.response.body:match(str_match) ) then crawler:stop() result = r.url break end end return result
For advanced use, the library currently supports a number of closures (withinhost, withindomain, doscraping). Please note, that withinhost and withindomain options also support boolean values. You will want to override them only for advanced use. You can define them using the following utilities:
iswithinhost
iswithindomain
isresource
The following sample code shows an example usage. We override the default withinhost method and we allow spidering only on resources within the host that they are not "js" or "css".
crawler.options.withinhost = function(url) if crawler:iswithinhost(url) and not crawler:isresource(url, "js") and not crawler:isresource(url, "css") then return true end end
Author:
Source: https://svn.nmap.org/nmap/nselib/httpspider.lua
Script Arguments
- httpspider.doscraping
Closure that overrides the default doscraping function used to check if the resource should be scraped (in terms of extracting any links within it). See the closure section above to override the default behaviour.
- httpspider.url
the url to start spidering. This is a URL relative to the scanned host eg. /default.html (default: /)
- httpspider.maxpagecount
the maximum amount of pages to visit. A negative value disables the limit (default: 20)
- httpspider.useheadfornonwebfiles
if set, the crawler would use HEAD instead of GET for files that do not have extensions indicating that they are webpages (the list of webpage extensions is located in nselib/data/http-web-files-extensions.lst)
- httpspider.noblacklist
if set, doesn't load the default blacklist
- httpspider.maxdepth
the maximum amount of directories beneath the initial url to spider. A negative value disables the limit. (default: 3)
- httpspider.withinhost
Closure that overrides the default withinhost function that only spiders URLs within the same host. If this is set to false the crawler will spider URLs both inside and outside the host. See the closure section above to override the default behaviour. (default: true)
- httpspider.withindomain
Closure that overrides the default withindomain function that only spiders URLs within the same domain. This widens the scope from
withinhost
and can not be used in combination. See the closure section above to override the default behaviour. (default: false)