Library httpspider
A smallish httpspider library providing basic spidering capabilities It consists of the following classes:
Options
LinkExtractor
URL
UrlQueue
Crawler
The following sample code shows how the spider could be used:
local crawler = httpspider.Crawler:new( host, port, '/', { scriptname = SCRIPT_NAME } )
crawler:set_timeout(10000)
local result
while(true) do
local status, r = crawler:crawl()
if ( not(status) ) then
break
end
if ( r.response.body:match(str_match) ) then
crawler:stop()
result = r.url
break
end
end
return result
For advanced use, the library currently supports a number of closures (withinhost, withindomain, doscraping). Please note, that withinhost and withindomain options also support boolean values. You will want to override them only for advanced use. You can define them using the following utilities:
iswithinhost
iswithindomain
isresource
The following sample code shows an example usage. We override the default withinhost method and we allow spidering only on resources within the host that they are not "js" or "css".
crawler.options.withinhost = function(url)
if crawler:iswithinhost(url)
and not crawler:isresource(url, "js")
and not crawler:isresource(url, "css") then
return true
end
end
Author:
Source: https://svn.nmap.org/nmap/nselib/httpspider.lua
Script Arguments
- httpspider.doscraping
Closure that overrides the default doscraping function used to check if the resource should be scraped (in terms of extracting any links within it). See the closure section above to override the default behaviour.
- httpspider.url
the url to start spidering. This is a URL relative to the scanned host eg. /default.html (default: /)
- httpspider.maxpagecount
the maximum amount of pages to visit. A negative value disables the limit (default: 20)
- httpspider.useheadfornonwebfiles
if set, the crawler would use HEAD instead of GET for files that do not have extensions indicating that they are webpages (the list of webpage extensions is located in nselib/data/http-web-files-extensions.lst)
- httpspider.noblacklist
if set, doesn't load the default blacklist
- httpspider.maxdepth
the maximum amount of directories beneath the initial url to spider. A negative value disables the limit. (default: 3)
- httpspider.withinhost
Closure that overrides the default withinhost function that only spiders URLs within the same host. If this is set to false the crawler will spider URLs both inside and outside the host. See the closure section above to override the default behaviour. (default: true)
- httpspider.withindomain
Closure that overrides the default withindomain function that only spiders URLs within the same domain. This widens the scope from
withinhostand can not be used in combination. See the closure section above to override the default behaviour. (default: false)
