ChromeCrawler
Extends:
Crawler based on cyrus-and/chrome-remote-interface
Static Method Summary
| Static Public Methods | ||
| public static | 
       withAutoClose(options: CrawlConfig): ChromeCrawler Create a new ChromeCrawler instance with auto close enabled  | 
    |
Constructor Summary
| Public Constructor | ||
| public | 
       constructor(options: CrawlConfig) Create a new ChromeCrawler instance.  | 
    |
Member Summary
| Public Members | ||
| public | 
      
       Crawl configuration options  | 
    |
| public | 
       requestMonitor: RequestMonitor Handles the tracking and capturing of the HTTP requests made by the browser  | 
    |
| Private Members | ||
| private | 
      
       Flag indicating if once the process exists should the crawler close the browser  | 
    |
| private | 
      
       Devtools protocol client for issuing commands to the browser  | 
    |
| private | 
      
       The current url the crawler is visiting  | 
    |
| private | 
      
       Manger for detecting network-idle, if we have not navigated or if we have reached the global wait time  | 
    |
| private | 
      
       The UserAgent string of the remote instance we are connecting to  | 
    |
| private | 
       _warcGenerator: RemoteChromeWARCGenerator WARC generator for use with cyrus-and/chrome-remote-interface  | 
    |
Method Summary
| Public Methods | ||
| public | 
       [Symbol.iterator](): Iterator<CapturedRequest> Iterate over the captured network requests for the current web page  | 
    |
| public | 
       async genInfoMetaDataRecord(warcInfo: Object): Promise<void> Generate the WARC Info and Metadata records  | 
    |
| public | 
      
       Alias for genWarc  | 
    |
| public | 
      
       Generate the WARC file  | 
    |
| public | 
       async getOutLinks(): Promise<{outlinks: string, links: string[], location: string}, Error>{outlinks:> Retrieve the page's meta information  | 
    |
| public | 
       async getUserAgent(): Promise<string> Retrieve the browsers user-agent string  | 
    |
| public | 
      
       Connect to the Chrome instance the crawler will be using and setup crawler  | 
    |
| public | 
      
       Initialize the WARC writter for writting a new WARC  | 
    |
| public | 
      
       Navigate to a new Web Page  | 
    |
| public | 
       shutdown() Disconnect from the Chrome instance currently attached to  | 
    |
| public | 
      
       Stop the page loading and stop capturing requests  | 
    |
| public | 
      
       Stop capturing the current web pages network requests  | 
    |
| public | 
       stopPageLoading(): Promise<any> Equivalent to hitting the refresh button when it is an X  | 
    |
| Private Methods | ||
| private | 
       _close(): * Callback for process.on('exit')  | 
    |
| private | 
      
       Callback used for Page.navigate  | 
    |
| private | 
       async _initInjects(): Promise<void> Instruct the browsers to inject JavaScript into every page  | 
    |
| private | 
       _onWARCGenError(err: Error) Listener for warc generator error  | 
    |
| private | 
      
       Listener for warc generator finished  | 
    |
| private | 
      
       Enable auto closing of the connection to the remote browser  | 
    |
Static Public Methods
public static withAutoClose(options: CrawlConfig): ChromeCrawler source
Create a new ChromeCrawler instance with auto close enabled
Params:
| Name | Type | Attribute | Description | 
| options | CrawlConfig | The crawl config for this crawl  | 
    
Public Constructors
public constructor(options: CrawlConfig) source
Create a new ChromeCrawler instance. For a description of the expected options see the JSDoc CrawlConfig typedef CrawlConfig
Params:
| Name | Type | Attribute | Description | 
| options | CrawlConfig | The crawl config for this crawl  | 
    
Public Members
public requestMonitor: RequestMonitor source
Handles the tracking and capturing of the HTTP requests made by the browser
Private Members
private _autoClose: boolean source
Flag indicating if once the process exists should the crawler close the browser
private _navMan: NavigationMan source
Manger for detecting network-idle, if we have not navigated or if we have reached the global wait time
private _warcGenerator: RemoteChromeWARCGenerator source
WARC generator for use with cyrus-and/chrome-remote-interface
Public Methods
public [Symbol.iterator](): Iterator<CapturedRequest> source
Iterate over the captured network requests for the current web page
Return:
| Iterator<CapturedRequest> | 
public async genInfoMetaDataRecord(warcInfo: Object): Promise<void> source
Generate the WARC Info and Metadata records
Params:
| Name | Type | Attribute | Description | 
| warcInfo | Object | 
  | 
      WARC record information  | 
    
public genWARC(warcInfo: Object): Promise<void, Error> source
Alias for genWarc
Params:
| Name | Type | Attribute | Description | 
| warcInfo | Object | 
  | 
      WARC record information  | 
    
public async genWarc(warcInfo: Object): Promise<void, Error> source
Generate the WARC file
Params:
| Name | Type | Attribute | Description | 
| warcInfo | Object | 
  | 
      WARC record information  | 
    
public async getOutLinks(): Promise<{outlinks: string, links: string[], location: string}, Error>{outlinks:> source
Retrieve the page's meta information
public async init(): Promise<void> source
Connect to the Chrome instance the crawler will be using and setup crawler
Emit:
connected  | 
        when the required setup is done  | 
      
public initWARC(warcPath: string, appending: boolean) source
Initialize the WARC writter for writting a new WARC
public navigate(url: string) source
Navigate to a new Web Page
Params:
| Name | Type | Attribute | Description | 
| url | string | The url to navigate the browser to  | 
    
public stopPageLoading(): Promise<any> source
Equivalent to hitting the refresh button when it is an X
Private Methods
private async _initInjects(): Promise<void> source
Instruct the browsers to inject JavaScript into every page
private _onWARCGenError(err: Error) source
Listener for warc generator error
Params:
| Name | Type | Attribute | Description | 
| err | Error | 
private enableAutoClose(): ChromeCrawler source
Enable auto closing of the connection to the remote browser
  
  Reference
  Source
  
  
    
  