Reference Source

Typedef

Static Public Summary
public
public
public
public
public

OnLoadInject: {scriptSource: string}

public
public
public

UserScript: function(page: Page): Promise<void>

public
public

Static Public

public ChromeOptions: Object source

Properties:

NameTypeAttributeDescription
use string
  • optional
  • default: chrome

Should Squidwarc connect directly to Chrome/Chromium or via puppeteer

executable string
  • optional

Path to the browser executable or command to be use to launch the browser

userDataDir string
  • optional

Path to a user data directory (generated by Chrome/Chromium) to be used rather than a temporary one

host string
  • optional
  • default: localhost

The host name the browsers CDP endpoint is listing on

port number
  • optional
  • default: 9222

The port number the browsers CDP endpoint is listing on

launch boolean
  • optional
  • default: true

Should Squidwarc launch and manage the browser or connect to an already running instance

headless boolean
  • optional
  • default: true

Should the browser used by Squidwarc be launched in headless mode

local boolean
  • optional
  • default: false

Should the CDP descriptor used by the chrome-remote-interface use the local CDP descriptor or fetch it from the browser connecting to

public CrawlConfig: Object source

Properties:

NameTypeAttributeDescription
chrome ChromeOptions

Information about how to connect to or launch Chrome/Chromium

mode string
  • optional
  • default: page-only

The mode this crawl is to be operating in

depth number
  • optional
  • default: 1

The depth of this crawl

crawlControl CrawlControl

Options for fine tuning the crawl

warc WARCOptions

Options for how this crawls WARCs should be created

versionInfo VersionInfo

Information to be included in the WARC Info record fields per page preserved

seeds Seed | Seed[]

The seed(s) to be crawled

script UserScript
  • optional

A script to be run when using puppeteer. If the value of this correct, use defaults to puppeteer

public CrawlControl: Object source

Properties:

NameTypeAttributeDescription
globalWait number
  • optional
  • default: 60000

Maximum amount of time, in milliseconds, that Squidwarc should wait before generating a WARC and moving to the next URL

numInflight number
  • optional
  • default: 2

The number of inflight requests (requests with no response) that should exist before starting the inflightIdle timer

inflightIdle number
  • optional
  • default: 1000

Amount of time, in milliseconds, that should elapse when there are only numInflight requests for network idle to be determined

navWait number
  • optional
  • default: 8000

Maximum amount of time, in milliseconds, that Squidwarc should wait for indication that the browser has navigated to the page being crawled

public NetIdleOptions: Object source

Properties:

NameTypeAttributeDescription
globalWait number
  • optional
  • default: 40000

Maximum amount of time, in milliseconds, to wait for network idle to occur

numInflight number
  • optional
  • default: 2

The number of inflight requests (requests with no response) that should exist before starting the inflightIdle timer

inflightIdle number
  • optional
  • default: 1500

Amount of time, in milliseconds, that should elapse when there are only numInflight requests for network idle to be determined

public OnLoadInject: {scriptSource: string} source

public OnNewDocumentInject: {source: string} source

public Seed: Object source

Properties:

NameTypeAttributeDescription
url string

The URL of the seed to be crawled

mode Symbol

The mode the seed and the URLs discovered by crawl the seed should operate in

depth number

The depth of the crawl

public UserScript: function(page: Page): Promise<void> source

public VersionInfo: Object source

Properties:

NameTypeAttributeDescription
isPartOfV string
  • optional
  • default: Squidwarc Crawl

The value for the isPartOf field of the WARC Info Record

warcInfoDescription string
  • optional
  • default: High fidelity preservation using Squidwarc

The value for the description field of the WARC Info Record

public WARCOptions: Object source

Properties:

NameTypeAttributeDescription
naming string
  • optional
  • default: url

The naming scheme to be used for WARC generation

output string
  • optional
  • default:

Path to the directory the WARCs are to be created in

append boolean
  • optional
  • default: false

Should Squidwarc create a single WARC file for the crawl or no