PuppeteerCrawler
Extends:
Crawler based on puppeteer
Constructor Summary
Public Constructor | ||
public |
constructor(options: CrawlConfig) Create a new PuppeteerCrawler instance. |
Member Summary
Public Members | ||
public |
defaultWait: {waitUntil: string, timeout: number} Default wait time for page.goto |
|
public |
Crawl configuration options |
|
public |
requestCapturer: PuppeteerCDPRequestCapturer |
Private Members | ||
private |
_browser: Browser An instance of puppeteer Browser |
|
private |
_client: CDPSession An instance of puppeteer CDPSession used to |
|
private |
The current url the crawler is visiting |
|
private |
An instance of puppeteer Page |
|
private |
The UserAgent string of the browser |
|
private |
_warcGenerator: PuppeteerCDPWARCGenerator |
Method Summary
Public Methods | ||
public |
[Symbol.iterator](): Iterator<CapturedRequest> Iterate over the captured network requests for the current web page |
|
public |
async genInfoMetaDataRecord(warcInfo: Object): Promise<void> Generate the WARC Info and Metadata records |
|
public |
Alias for genWarc |
|
public |
Generate the WARC file |
|
public |
async getOutLinks(): Promise<{outlinks: string, links: Array<{href: string, pathname: string, host: string}>, location: string}, Error>{href:> Retrieve the page's meta information |
|
public |
async getUserAgent(): Promise<string> Retrieve the browsers user-agent string |
|
public |
async init() Setup the crawler |
|
public |
Initialize the WARC writter for writting a new WARC |
|
public |
Navigate the browser to the URL of the page to be crawled |
|
public |
async runUserScript(): Promise<void> If the user supplied a script that scrip is executed or if non was supplied just scroll the page |
|
public |
Stop crawling and exit |
|
public |
Stop the page loading and stop capturing requests |
|
public |
Stop capturing the current web pages network requests |
|
public |
Equivalent to hitting the refresh button when it is an X |
Private Methods | ||
private |
CB used to emit the disconnected event |
|
private |
_onWARCGenError(err: Error) Listener for warc generator error |
|
private |
Listener for warc generator finished |
Public Constructors
public constructor(options: CrawlConfig) source
Create a new PuppeteerCrawler instance. For a description of the expected options see the JSDoc CrawlConfig typedef CrawlConfig
Params:
Name | Type | Attribute | Description |
options | CrawlConfig | The crawl config for this crawl |
Public Members
public requestCapturer: PuppeteerCDPRequestCapturer source
Private Members
private _warcGenerator: PuppeteerCDPWARCGenerator source
Public Methods
public [Symbol.iterator](): Iterator<CapturedRequest> source
Iterate over the captured network requests for the current web page
Return:
Iterator<CapturedRequest> |
public async genInfoMetaDataRecord(warcInfo: Object): Promise<void> source
Generate the WARC Info and Metadata records
Params:
Name | Type | Attribute | Description |
warcInfo | Object | WARC record information |
public genWARC(warcInfo: Object): Promise<void, Error> source
Alias for genWarc
Params:
Name | Type | Attribute | Description |
warcInfo | Object | WARC record information |
public async genWarc(warcInfo: Object): Promise<void, Error> source
Generate the WARC file
Params:
Name | Type | Attribute | Description |
warcInfo | Object | WARC record information |
public async getOutLinks(): Promise<{outlinks: string, links: Array<{href: string, pathname: string, host: string}>, location: string}, Error>{href:> source
Retrieve the page's meta information
public initWARC(warcPath: string, appending: boolean): Promise<void> source
Initialize the WARC writter for writting a new WARC
public async navigate(url: string): Promise<boolean> source
Navigate the browser to the URL of the page to be crawled
Params:
Name | Type | Attribute | Description |
url | string |
public async runUserScript(): Promise<void> source
If the user supplied a script that scrip is executed or if non was supplied just scroll the page
public stopPageLoading(): PromiseObject> source
Equivalent to hitting the refresh button when it is an X