Library httpspider
A smallish httpspider library providing basic spidering capabilities It consists of the following classes:
Options
LinkExtractor
URL
UrlQueue
Crawler
The following sample code shows how the spider could be used:
local crawler = httpspider.Crawler:new( host, port, '/', { scriptname = SCRIPT_NAME } )
crawler:set_timeout(10000)
local result
while(true) do
local status, r = crawler:crawl()
if ( not(status) ) then
break
end
if ( r.response.body:match(str_match) ) then
crawler:stop()
result = r.url
break
end
end
return result
Author:
| Patrik Karlsson <patrik@cqure.net> |
Source: http://nmap.org/svn/nselib/httpspider.lua
Script Arguments
httpspider.url
the url to start spidering. This is a URL relative to the scanned host eg. /default.html (default: /)
httpspider.maxpagecount
the maximum amount of pages to visit. A negative value disables the limit (default: 20)
httpspider.useheadfornonwebfiles
if set, the crawler would use HEAD instead of GET for files that do not have extensions indicating that they are webpages (the list of webpage extensions is located in nselib/data/http-web-files-extensions.lst)
httpspider.noblacklist
if set, doesn't load the default blacklist
httpspider.maxdepth
the maximum amount of directories beneath the initial url to spider. A negative value disables the limit. (default: 3)
httpspider.withinhost
only spider URLs within the same host. (default: true)
httpspider.withindomain
only spider URLs within the same
domain. This widens the scope from withinhost and can
not be used in combination. (default: false)


