Reference Source
import PuppeteerCrawler from 'squidwarc/lib/crawler/puppeteer.js'
public class | source

PuppeteerCrawler

Extends:

EventEmitter → PuppeteerCrawler

Crawler based on puppeteer

Constructor Summary

Public Constructor
public

Create a new PuppeteerCrawler instance.

Member Summary

Public Members
public

defaultWait: {waitUntil: string, timeout: number}

Default wait time for page.goto

public

Crawl configuration options

public

requestCapturer: PuppeteerCDPRequestCapturer

Private Members
private

_browser: Browser

An instance of puppeteer Browser

private

_client: CDPSession

An instance of puppeteer CDPSession used to

private

The current url the crawler is visiting

private

An instance of puppeteer Page

private

The UserAgent string of the browser

private

_warcGenerator: PuppeteerCDPWARCGenerator

Method Summary

Public Methods
public

[Symbol.iterator](): Iterator<CapturedRequest>

Iterate over the captured network requests for the current web page

public

async genInfoMetaDataRecord(warcInfo: Object): Promise<void>

Generate the WARC Info and Metadata records

public

genWARC(warcInfo: Object): Promise<void, Error>

Alias for genWarc

public

async genWarc(warcInfo: Object): Promise<void, Error>

Generate the WARC file

public

async getOutLinks(): Promise<{outlinks: string, links: Array<{href: string, pathname: string, host: string}>, location: string}, Error>

Retrieve the page's meta information

public

Retrieve the browsers user-agent string

public

async init()

Setup the crawler

public

initWARC(warcPath: string, appending: boolean): Promise<void>

Initialize the WARC writter for writting a new WARC

public

async navigate(url: string): Promise<boolean>

Navigate the browser to the URL of the page to be crawled

public

async runUserScript(): Promise<void>

If the user supplied a script that scrip is executed or if non was supplied just scroll the page

public

async shutdown(): Promise<void>

Stop crawling and exit

public

Stop the page loading and stop capturing requests

public

Stop capturing the current web pages network requests

public

Equivalent to hitting the refresh button when it is an X

Private Methods
private

CB used to emit the disconnected event

private

Listener for warc generator error

private

Listener for warc generator finished

Public Constructors

public constructor(options: CrawlConfig) source

Create a new PuppeteerCrawler instance. For a description of the expected options see the JSDoc CrawlConfig typedef CrawlConfig

Params:

NameTypeAttributeDescription
options CrawlConfig

The crawl config for this crawl

Public Members

public defaultWait: {waitUntil: string, timeout: number} source

Default wait time for page.goto

public options: CrawlConfig source

Crawl configuration options

public requestCapturer: PuppeteerCDPRequestCapturer source

Private Members

private _browser: Browser source

An instance of puppeteer Browser

private _client: CDPSession source

An instance of puppeteer CDPSession used to

private _currentUrl: string source

The current url the crawler is visiting

private _page: Page source

An instance of puppeteer Page

private _ua: string source

The UserAgent string of the browser

private _warcGenerator: PuppeteerCDPWARCGenerator source

Public Methods

public [Symbol.iterator](): Iterator<CapturedRequest> source

Iterate over the captured network requests for the current web page

Return:

Iterator<CapturedRequest>

public async genInfoMetaDataRecord(warcInfo: Object): Promise<void> source

Generate the WARC Info and Metadata records

Params:

NameTypeAttributeDescription
warcInfo Object

WARC record information

Return:

Promise<void> (nullable: false)

Return Properties:

NameTypeAttributeDescription
outlinks string
  • nullable: false

Pre-formatted string containing the pages outlinks tobe used by the WARC metadata record

info Object
  • nullable: true

Information for the WARC info record

public genWARC(warcInfo: Object): Promise<void, Error> source

Alias for genWarc

Params:

NameTypeAttributeDescription
warcInfo Object

WARC record information

Return:

Promise<void, Error>

Return Properties:

NameTypeAttributeDescription
outlinks string
  • nullable: false

Pre-formatted string containing the pages outlinks tobe used by the WARC metadata record

info Object
  • nullable: true

Information for the WARC info record

public async genWarc(warcInfo: Object): Promise<void, Error> source

Generate the WARC file

Params:

NameTypeAttributeDescription
warcInfo Object

WARC record information

Return:

Promise<void, Error>

Return Properties:

NameTypeAttributeDescription
outlinks string
  • nullable: false

Pre-formatted string containing the pages outlinks tobe used by the WARC metadata record

info Object
  • nullable: true

Information for the WARC info record

Retrieve the page's meta information

Return:

Promise<{outlinks: string, links: Array<{href: string, pathname: string, host: string}>, location: string}, Error>

public async getUserAgent(): Promise<string> source

Retrieve the browsers user-agent string

Return:

Promise<string> (nullable: false)

public async init() source

Setup the crawler

public initWARC(warcPath: string, appending: boolean): Promise<void> source

Initialize the WARC writter for writting a new WARC

Params:

NameTypeAttributeDescription
warcPath string

the path to the new WARC

appending boolean
  • optional
  • default: false

append to an already existing WARC file

Return:

Promise<void>

A Promise that resolves once the warc-gen-finished event is emitted

public async navigate(url: string): Promise<boolean> source

Navigate the browser to the URL of the page to be crawled

Params:

NameTypeAttributeDescription
url string

Return:

Promise<boolean>

public async runUserScript(): Promise<void> source

If the user supplied a script that scrip is executed or if non was supplied just scroll the page

Return:

Promise<void>

public async shutdown(): Promise<void> source

Stop crawling and exit

Return:

Promise<void>

public stop(): PromiseObject> source

Stop the page loading and stop capturing requests

Return:

PromiseObject> (nullable: false)

public stopCapturingNetwork() source

Stop capturing the current web pages network requests

public stopPageLoading(): PromiseObject> source

Equivalent to hitting the refresh button when it is an X

Return:

PromiseObject> (nullable: false)

Private Methods

private _onDisconnected() source

CB used to emit the disconnected event

private _onWARCGenError(err: Error) source

Listener for warc generator error

Params:

NameTypeAttributeDescription
err Error

The error to emit

private _onWARCGenFinished() source

Listener for warc generator finished