Reference Source

Function

Static Public Summary
public

async chromeRunner(conf: CrawlConfig): Promise<void, Error>

Launches a crawl using the supplied configuration file path

public

collect(): Promise<{outlinks: string, links: Array<string>, location: string}>

Starts the collection of the outlinks.

public

delay(amount: number): Promise<void>

Promise wrapper around setTimeout

public

Function that is injected into every frame of the page currently being crawled that will setup the outlink collection depending if the frame injected into is the top frame or a sub frame.

public

Test to see if a plain object is empty

public

async launch(options: ChromeOptions): Promise<!Puppeteer.Browser>

Launch and connect or connect to Chrome/Chromium

public

makeRunnable(runnable: function(...args: any): Promise): function(...args: any): void

Composes the supplied function with runPromise.

public

Function that disables the setting of window event handlers onbeforeunload and onunload and disables the usage of window.alert, window.confirm, and window.prompt.

public

async outLinks(): Promise<{outlinks: string, links: Array<string>}>

Builds the WARC outlink metadata information and finds potential links to goto next from a page and build

public

async puppeteerRunner(conf: CrawlConfig): Promise<void, Error>

Launches a crawl using the supplied configuration file path

public

runPromise(runnable: function(): Promise<any>|Promise<any>, thener: function(...args: any), catcher: function(...args: any)): void

Runs a promise using the supplied thener and catcher functions

public

Function that is injected into every frame of the page being crawled that starts scrolling the page once the load event has been fired a maximum of 20 times or until no more scroll can be done

public

async scrollPage(): Promise<void>

Function that scrolls the page/frame injected into a maximum of 20 times or until no more scroll can be done

Static Public

public async chromeRunner(conf: CrawlConfig): Promise<void, Error> source

import chromeRunner from 'squidwarc/lib/runners/chromeRunner.js'

Launches a crawl using the supplied configuration file path

Params:

NameTypeAttributeDescription
conf CrawlConfig

The crawl config for this crawl

Return:

Promise<void, Error>

public collect(): Promise<{outlinks: string, links: Array<string>, location: string}> source

Starts the collection of the outlinks. Use only when initCollectLinks is pre-injected into every frame

Return:

Promise<{outlinks: string, links: Array<string>, location: string}>

public delay(amount: number): Promise<void> source

import {delay} from 'squidwarc/lib/utils/promises.js'

Promise wrapper around setTimeout

Params:

NameTypeAttributeDescription
amount number

The amount of time to delay by

Return:

Promise<void>

Function that is injected into every frame of the page currently being crawled that will setup the outlink collection depending if the frame injected into is the top frame or a sub frame.

If this function is injected into the top frame an instance of Collector / TopHandler are created otherwise only an instance of Collector is created.

In the case of injection into the top frame the $$$$Squidwarc$$Collector$$$$ property will be defined on window with value of the created TopHandler instance and message event listener will be registered on window for receiving messages sent by this script when injected into child frames.

Each child frame will send two messages (indicateIsChild, outlinkgot) and listen for one (outlinkcollect). The message types are found in the object m within the body of this function. The indicateIsChild message is sent immediately by a child frames to allow TopHandler can hold onto a reference to the frame for communicating with it. The outlinkgot message is sent by each child frame to the top frame once outlinks have been collected for that frame. The outlinkcollect message is sent by TopHandler to each child frame to have it start collecting outlinks.

Return:

void

public isEmptyPlainObject(object: Object): boolean source

import isEmptyPlainObject from 'squidwarc/lib/utils/isEmptyPlainObject.js'

Test to see if a plain object is empty

Params:

NameTypeAttributeDescription
object Object

Return:

boolean

public async launch(options: ChromeOptions): Promise<!Puppeteer.Browser> source

Launch and connect or connect to Chrome/Chromium

Params:

NameTypeAttributeDescription
options ChromeOptions

Return:

Promise<!Puppeteer.Browser> (nullable: false)

public makeRunnable(runnable: function(...args: any): Promise): function(...args: any): void source

import {makeRunnable} from 'squidwarc/lib/utils/promises.js'

Composes the supplied function with runPromise.

Params:

NameTypeAttributeDescription
runnable function(...args: any): Promise

Return:

function(...args: any): void

public noNaughtyJS() source

Function that disables the setting of window event handlers onbeforeunload and onunload and disables the usage of window.alert, window.confirm, and window.prompt.

This is done to ensure that they can not be used crawler traps.

Builds the WARC outlink metadata information and finds potential links to goto next from a page and build

Return:

Promise<{outlinks: string, links: Array<string>}>

public async puppeteerRunner(conf: CrawlConfig): Promise<void, Error> source

import puppeteerRunner from 'squidwarc/lib/runners/puppeteerRunner.js'

Launches a crawl using the supplied configuration file path

Params:

NameTypeAttributeDescription
conf CrawlConfig

The crawl config for this crawl

Return:

Promise<void, Error>

public runPromise(runnable: function(): Promise<any>|Promise<any>, thener: function(...args: any), catcher: function(...args: any)): void source

import runPromise from 'squidwarc/lib/runPromise.js'

Runs a promise using the supplied thener and catcher functions

Params:

NameTypeAttributeDescription
runnable function(): Promise<any>|Promise<any>
  • nullable: false

The promise or async / promise returning function to run

thener function(...args: any)
  • optional

The callback function to be supplied to Promise.then

catcher function(...args: any)
  • optional

The callback function to be supplied to Promise.catch

Return:

void

public scrollOnLoad() source

Function that is injected into every frame of the page being crawled that starts scrolling the page once the load event has been fired a maximum of 20 times or until no more scroll can be done

public async scrollPage(): Promise<void> source

Function that scrolls the page/frame injected into a maximum of 20 times or until no more scroll can be done

Return:

Promise<void>