Reference Source
import ChromeCrawler from 'squidwarc/lib/crawler/chrome.js'
public class | source

ChromeCrawler

Extends:

EventEmitter → ChromeCrawler

Crawler based on cyrus-and/chrome-remote-interface

Static Method Summary

Static Public Methods
public static

Create a new ChromeCrawler instance with auto close enabled

Constructor Summary

Public Constructor
public

Create a new ChromeCrawler instance.

Member Summary

Public Members
public

Crawl configuration options

public

requestMonitor: RequestMonitor

Handles the tracking and capturing of the HTTP requests made by the browser

Private Members
private

Flag indicating if once the process exists should the crawler close the browser

private

Devtools protocol client for issuing commands to the browser

private

The current url the crawler is visiting

private

Manger for detecting network-idle, if we have not navigated or if we have reached the global wait time

private

The UserAgent string of the remote instance we are connecting to

private

_warcGenerator: RemoteChromeWARCGenerator

WARC generator for use with cyrus-and/chrome-remote-interface

Method Summary

Public Methods
public

[Symbol.iterator](): Iterator<CapturedRequest>

Iterate over the captured network requests for the current web page

public

async genInfoMetaDataRecord(warcInfo: Object): Promise<void>

Generate the WARC Info and Metadata records

public

genWARC(warcInfo: Object): Promise<void, Error>

Alias for genWarc

public

async genWarc(warcInfo: Object): Promise<void, Error>

Generate the WARC file

public

async getOutLinks(): Promise<{outlinks: string, links: string[], location: string}, Error>

Retrieve the page's meta information

public

Retrieve the browsers user-agent string

public

async init(): Promise<void>

Connect to the Chrome instance the crawler will be using and setup crawler

public

initWARC(warcPath: string, appending: boolean)

Initialize the WARC writter for writting a new WARC

public

Navigate to a new Web Page

public

Disconnect from the Chrome instance currently attached to

public

stop(): Promise<void>

Stop the page loading and stop capturing requests

public

Stop capturing the current web pages network requests

public

Equivalent to hitting the refresh button when it is an X

Private Methods
private

_close(): *

Callback for process.on('exit')

private

Callback used for Page.navigate

private

async _initInjects(): Promise<void>

Instruct the browsers to inject JavaScript into every page

private

Listener for warc generator error

private

Listener for warc generator finished

private

Enable auto closing of the connection to the remote browser

Static Public Methods

public static withAutoClose(options: CrawlConfig): ChromeCrawler source

Create a new ChromeCrawler instance with auto close enabled

Params:

NameTypeAttributeDescription
options CrawlConfig

The crawl config for this crawl

Return:

ChromeCrawler

Public Constructors

public constructor(options: CrawlConfig) source

Create a new ChromeCrawler instance. For a description of the expected options see the JSDoc CrawlConfig typedef CrawlConfig

Params:

NameTypeAttributeDescription
options CrawlConfig

The crawl config for this crawl

Public Members

public options: CrawlConfig source

Crawl configuration options

public requestMonitor: RequestMonitor source

Handles the tracking and capturing of the HTTP requests made by the browser

Private Members

private _autoClose: boolean source

Flag indicating if once the process exists should the crawler close the browser

private _client: CRI source

Devtools protocol client for issuing commands to the browser

private _currentUrl: string source

The current url the crawler is visiting

private _navMan: NavigationMan source

Manger for detecting network-idle, if we have not navigated or if we have reached the global wait time

private _ua: string source

The UserAgent string of the remote instance we are connecting to

private _warcGenerator: RemoteChromeWARCGenerator source

WARC generator for use with cyrus-and/chrome-remote-interface

Public Methods

public [Symbol.iterator](): Iterator<CapturedRequest> source

Iterate over the captured network requests for the current web page

Return:

Iterator<CapturedRequest>

public async genInfoMetaDataRecord(warcInfo: Object): Promise<void> source

Generate the WARC Info and Metadata records

Params:

NameTypeAttributeDescription
warcInfo Object
  • nullable: false

WARC record information

Return:

Promise<void> (nullable: false)

Return Properties:

NameTypeAttributeDescription
outlinks string
  • nullable: false

Pre-formatted string containing the pages outlinks tobe used by the WARC metadata record

info Object
  • nullable: true

Information for the WARC info record

public genWARC(warcInfo: Object): Promise<void, Error> source

Alias for genWarc

Params:

NameTypeAttributeDescription
warcInfo Object
  • nullable: false

WARC record information

Return:

Promise<void, Error>

Return Properties:

NameTypeAttributeDescription
outlinks string
  • nullable: false

Pre-formatted string containing the pages outlinks tobe used by the WARC metadata record

info Object
  • nullable: true

Information for the WARC info record

public async genWarc(warcInfo: Object): Promise<void, Error> source

Generate the WARC file

Params:

NameTypeAttributeDescription
warcInfo Object
  • nullable: false

WARC record information

Return:

Promise<void, Error>

Return Properties:

NameTypeAttributeDescription
outlinks string
  • nullable: false

Pre-formatted string containing the pages outlinks tobe used by the WARC metadata record

info Object
  • nullable: true

Information for the WARC info record

Retrieve the page's meta information

Return:

Promise<{outlinks: string, links: string[], location: string}, Error>

public async getUserAgent(): Promise<string> source

Retrieve the browsers user-agent string

Return:

Promise<string>

public async init(): Promise<void> source

Connect to the Chrome instance the crawler will be using and setup crawler

Return:

Promise<void>

Emit:

connected

when the required setup is done

public initWARC(warcPath: string, appending: boolean) source

Initialize the WARC writter for writting a new WARC

Params:

NameTypeAttributeDescription
warcPath string

the path to the new WARC

appending boolean
  • optional
  • default: false

append to an already existing WARC file

public navigate(url: string) source

Navigate to a new Web Page

Params:

NameTypeAttributeDescription
url string

The url to navigate the browser to

public shutdown() source

Disconnect from the Chrome instance currently attached to

public stop(): Promise<void> source

Stop the page loading and stop capturing requests

Return:

Promise<void>

public stopCapturingNetwork() source

Stop capturing the current web pages network requests

public stopPageLoading(): Promise<any> source

Equivalent to hitting the refresh button when it is an X

Return:

Promise<any>

Private Methods

private _close(): * source

Callback for process.on('exit')

Return:

*

private _didNavigate() source

Callback used for Page.navigate

private async _initInjects(): Promise<void> source

Instruct the browsers to inject JavaScript into every page

Return:

Promise<void>

private _onWARCGenError(err: Error) source

Listener for warc generator error

Params:

NameTypeAttributeDescription
err Error

private _onWARCGenFinished() source

Listener for warc generator finished

private enableAutoClose(): ChromeCrawler source

Enable auto closing of the connection to the remote browser

Return:

ChromeCrawler