Typedef
Static Public Summary | ||
public |
|
|
public |
|
|
public |
|
|
public |
|
|
public |
OnLoadInject: {scriptSource: string} |
|
public |
OnNewDocumentInject: {source: string} |
|
public |
|
|
public |
UserScript: function(page: Page): Promise<void> |
|
public |
|
|
public |
|
Static Public
public ChromeOptions: Object source
Properties:
Name | Type | Attribute | Description |
use | string |
|
Should Squidwarc connect directly to Chrome/Chromium or via puppeteer |
executable | string |
|
Path to the browser executable or command to be use to launch the browser |
userDataDir | string |
|
Path to a user data directory (generated by Chrome/Chromium) to be used rather than a temporary one |
host | string |
|
The host name the browsers CDP endpoint is listing on |
port | number |
|
The port number the browsers CDP endpoint is listing on |
launch | boolean |
|
Should Squidwarc launch and manage the browser or connect to an already running instance |
headless | boolean |
|
Should the browser used by Squidwarc be launched in headless mode |
local | boolean |
|
Should the CDP descriptor used by the chrome-remote-interface use the local CDP descriptor or fetch it from the browser connecting to |
public CrawlConfig: Object source
Properties:
Name | Type | Attribute | Description |
chrome | ChromeOptions | Information about how to connect to or launch Chrome/Chromium |
|
mode | string |
|
The mode this crawl is to be operating in |
depth | number |
|
The depth of this crawl |
crawlControl | CrawlControl | Options for fine tuning the crawl |
|
warc | WARCOptions | Options for how this crawls WARCs should be created |
|
versionInfo | VersionInfo | Information to be included in the WARC Info record fields per page preserved |
|
seeds | Seed | Seed[] | The seed(s) to be crawled |
|
script | UserScript |
|
A script to be run when using puppeteer. If the value of this correct, use defaults to puppeteer |
public CrawlControl: Object source
Properties:
Name | Type | Attribute | Description |
globalWait | number |
|
Maximum amount of time, in milliseconds, that Squidwarc should wait before generating a WARC and moving to the next URL |
numInflight | number |
|
The number of inflight requests (requests with no response) that should exist before starting the inflightIdle timer |
inflightIdle | number |
|
Amount of time, in milliseconds, that should elapse when there are only numInflight requests for network idle to be determined |
navWait | number |
|
Maximum amount of time, in milliseconds, that Squidwarc should wait for indication that the browser has navigated to the page being crawled |
public NetIdleOptions: Object source
Properties:
Name | Type | Attribute | Description |
globalWait | number |
|
Maximum amount of time, in milliseconds, to wait for network idle to occur |
numInflight | number |
|
The number of inflight requests (requests with no response) that should exist before starting the inflightIdle timer |
inflightIdle | number |
|
Amount of time, in milliseconds, that should elapse when there are only numInflight requests for network idle to be determined |
public WARCOptions: Object source
Properties:
Name | Type | Attribute | Description |
naming | string |
|
The naming scheme to be used for WARC generation |
output | string |
|
Path to the directory the WARCs are to be created in |
append | boolean |
|
Should Squidwarc create a single WARC file for the crawl or no |