ChromeCrawler
Extends:
Crawler based on cyrus-and/chrome-remote-interface
Static Method Summary
Static Public Methods | ||
public static |
withAutoClose(options: CrawlConfig): ChromeCrawler Create a new ChromeCrawler instance with auto close enabled |
Constructor Summary
Public Constructor | ||
public |
constructor(options: CrawlConfig) Create a new ChromeCrawler instance. |
Member Summary
Public Members | ||
public |
Crawl configuration options |
|
public |
requestMonitor: RequestMonitor Handles the tracking and capturing of the HTTP requests made by the browser |
Private Members | ||
private |
Flag indicating if once the process exists should the crawler close the browser |
|
private |
Devtools protocol client for issuing commands to the browser |
|
private |
The current url the crawler is visiting |
|
private |
Manger for detecting network-idle, if we have not navigated or if we have reached the global wait time |
|
private |
The UserAgent string of the remote instance we are connecting to |
|
private |
_warcGenerator: RemoteChromeWARCGenerator WARC generator for use with cyrus-and/chrome-remote-interface |
Method Summary
Public Methods | ||
public |
[Symbol.iterator](): Iterator<CapturedRequest> Iterate over the captured network requests for the current web page |
|
public |
async genInfoMetaDataRecord(warcInfo: Object): Promise<void> Generate the WARC Info and Metadata records |
|
public |
Alias for genWarc |
|
public |
Generate the WARC file |
|
public |
async getOutLinks(): Promise<{outlinks: string, links: string[], location: string}, Error>{outlinks:> Retrieve the page's meta information |
|
public |
async getUserAgent(): Promise<string> Retrieve the browsers user-agent string |
|
public |
Connect to the Chrome instance the crawler will be using and setup crawler |
|
public |
Initialize the WARC writter for writting a new WARC |
|
public |
Navigate to a new Web Page |
|
public |
shutdown() Disconnect from the Chrome instance currently attached to |
|
public |
Stop the page loading and stop capturing requests |
|
public |
Stop capturing the current web pages network requests |
|
public |
stopPageLoading(): Promise<any> Equivalent to hitting the refresh button when it is an X |
Private Methods | ||
private |
_close(): * Callback for process.on('exit') |
|
private |
Callback used for Page.navigate |
|
private |
async _initInjects(): Promise<void> Instruct the browsers to inject JavaScript into every page |
|
private |
_onWARCGenError(err: Error) Listener for warc generator error |
|
private |
Listener for warc generator finished |
|
private |
Enable auto closing of the connection to the remote browser |
Static Public Methods
public static withAutoClose(options: CrawlConfig): ChromeCrawler source
Create a new ChromeCrawler instance with auto close enabled
Params:
Name | Type | Attribute | Description |
options | CrawlConfig | The crawl config for this crawl |
Public Constructors
public constructor(options: CrawlConfig) source
Create a new ChromeCrawler instance. For a description of the expected options see the JSDoc CrawlConfig typedef CrawlConfig
Params:
Name | Type | Attribute | Description |
options | CrawlConfig | The crawl config for this crawl |
Public Members
public requestMonitor: RequestMonitor source
Handles the tracking and capturing of the HTTP requests made by the browser
Private Members
private _autoClose: boolean source
Flag indicating if once the process exists should the crawler close the browser
private _navMan: NavigationMan source
Manger for detecting network-idle, if we have not navigated or if we have reached the global wait time
private _warcGenerator: RemoteChromeWARCGenerator source
WARC generator for use with cyrus-and/chrome-remote-interface
Public Methods
public [Symbol.iterator](): Iterator<CapturedRequest> source
Iterate over the captured network requests for the current web page
Return:
Iterator<CapturedRequest> |
public async genInfoMetaDataRecord(warcInfo: Object): Promise<void> source
Generate the WARC Info and Metadata records
Params:
Name | Type | Attribute | Description |
warcInfo | Object |
|
WARC record information |
public genWARC(warcInfo: Object): Promise<void, Error> source
Alias for genWarc
Params:
Name | Type | Attribute | Description |
warcInfo | Object |
|
WARC record information |
public async genWarc(warcInfo: Object): Promise<void, Error> source
Generate the WARC file
Params:
Name | Type | Attribute | Description |
warcInfo | Object |
|
WARC record information |
public async getOutLinks(): Promise<{outlinks: string, links: string[], location: string}, Error>{outlinks:> source
Retrieve the page's meta information
public async init(): Promise<void> source
Connect to the Chrome instance the crawler will be using and setup crawler
Emit:
connected |
when the required setup is done |
public initWARC(warcPath: string, appending: boolean) source
Initialize the WARC writter for writting a new WARC
public navigate(url: string) source
Navigate to a new Web Page
Params:
Name | Type | Attribute | Description |
url | string | The url to navigate the browser to |
public stopPageLoading(): Promise<any> source
Equivalent to hitting the refresh button when it is an X
Private Methods
private async _initInjects(): Promise<void> source
Instruct the browsers to inject JavaScript into every page
private _onWARCGenError(err: Error) source
Listener for warc generator error
Params:
Name | Type | Attribute | Description |
err | Error |
private enableAutoClose(): ChromeCrawler source
Enable auto closing of the connection to the remote browser