Function
Static Public Summary | ||
public |
async chromeRunner(conf: CrawlConfig): Promise<void, Error> Launches a crawl using the supplied configuration file path |
|
public |
Starts the collection of the outlinks. |
|
public |
Promise wrapper around setTimeout |
|
public |
initCollectLinks(): void Function that is injected into every frame of the page currently being crawled that will setup the outlink collection depending if the frame injected into is the top frame or a sub frame. |
|
public |
isEmptyPlainObject(object: Object): boolean Test to see if a |
|
public |
async launch(options: ChromeOptions): Promise<!Puppeteer.Browser> Launch and connect or connect to Chrome/Chromium |
|
public |
makeRunnable(runnable: function(...args: any): Promise): function(...args: any): void Composes the supplied function with runPromise. |
|
public |
Function that disables the setting of window event handlers onbeforeunload and onunload and disables the usage of window.alert, window.confirm, and window.prompt. |
|
public |
Builds the WARC outlink metadata information and finds potential links to goto next from a page and build |
|
public |
async puppeteerRunner(conf: CrawlConfig): Promise<void, Error> Launches a crawl using the supplied configuration file path |
|
public |
runPromise(runnable: function(): Promise<any>|Promise<any>, thener: function(...args: any), catcher: function(...args: any)): void Runs a promise using the supplied thener and catcher functions |
|
public |
Function that is injected into every frame of the page being crawled that starts scrolling the page
once the |
|
public |
async scrollPage(): Promise<void> Function that scrolls the page/frame injected into a maximum of 20 times or until no more scroll can be done |
Static Public
public async chromeRunner(conf: CrawlConfig): Promise<void, Error> source
import chromeRunner from 'squidwarc/lib/runners/chromeRunner.js'
Launches a crawl using the supplied configuration file path
Params:
Name | Type | Attribute | Description |
conf | CrawlConfig | The crawl config for this crawl |
public collect(): Promise<{outlinks: string, links: Array<string>, location: string}>{outlinks:> source
import {collect} from 'squidwarc/lib/injectManager/pageInjects/collectLinks.js'
Starts the collection of the outlinks. Use only when initCollectLinks is pre-injected into every frame
public delay(amount: number): Promise<void> source
import {delay} from 'squidwarc/lib/utils/promises.js'
Promise wrapper around setTimeout
Params:
Name | Type | Attribute | Description |
amount | number | The amount of time to delay by |
public initCollectLinks(): void source
import {initCollectLinks} from 'squidwarc/lib/injectManager/pageInjects/collectLinks.js'
Function that is injected into every frame of the page currently being crawled that will setup the outlink collection depending if the frame injected into is the top frame or a sub frame.
If this function is injected into the top frame an instance of Collector / TopHandler are created otherwise only an instance of Collector is created.
In the case of injection into the top frame the $$$$Squidwarc$$Collector$$$$
property will be defined on
window with value of the created TopHandler instance and message
event listener will be registered on window for
receiving messages sent by this script when injected into child frames.
Each child frame will send two messages (indicateIsChild
, outlinkgot
) and listen for one (outlinkcollect
).
The message types are found in the object m within the body of this function.
The indicateIsChild
message is sent immediately by a child frames to allow TopHandler can hold onto a reference to the frame for communicating with it.
The outlinkgot
message is sent by each child frame to the top frame once outlinks have been collected for that frame.
The outlinkcollect
message is sent by TopHandler to each child frame to have it start collecting outlinks.
Return:
void |
public isEmptyPlainObject(object: Object): boolean source
import isEmptyPlainObject from 'squidwarc/lib/utils/isEmptyPlainObject.js'
Test to see if a plain object
is empty
Params:
Name | Type | Attribute | Description |
object | Object |
public async launch(options: ChromeOptions): Promise<!Puppeteer.Browser> source
import launch from 'squidwarc/lib/launcher/puppeteer.js'
Launch and connect or connect to Chrome/Chromium
Params:
Name | Type | Attribute | Description |
options | ChromeOptions |
public makeRunnable(runnable: function(...args: any): Promise): function(...args: any): void source
import {makeRunnable} from 'squidwarc/lib/utils/promises.js'
Composes the supplied function with runPromise.
public noNaughtyJS() source
import noNaughtyJS from 'squidwarc/lib/injectManager/pageInjects/noNaughtyJS.js'
Function that disables the setting of window event handlers onbeforeunload and onunload and disables the usage of window.alert, window.confirm, and window.prompt.
This is done to ensure that they can not be used crawler traps.
public async outLinks(): Promise<{outlinks: string, links: Array<string>}>{outlinks:> source
import {outLinks} from 'squidwarc/lib/injectManager/pageInjects/collectLinks.js'
Builds the WARC outlink metadata information and finds potential links to goto next from a page and build
public async puppeteerRunner(conf: CrawlConfig): Promise<void, Error> source
import puppeteerRunner from 'squidwarc/lib/runners/puppeteerRunner.js'
Launches a crawl using the supplied configuration file path
Params:
Name | Type | Attribute | Description |
conf | CrawlConfig | The crawl config for this crawl |
public runPromise(runnable: function(): Promise<any>|Promise<any>, thener: function(...args: any), catcher: function(...args: any)): void source
import runPromise from 'squidwarc/lib/runPromise.js'
Runs a promise using the supplied thener and catcher functions
Params:
Name | Type | Attribute | Description |
runnable | function(): Promise<any>|Promise<any> |
|
The promise or async / promise returning function to run |
thener | function(...args: any) |
|
The callback function to be supplied to Promise.then |
catcher | function(...args: any) |
|
The callback function to be supplied to Promise.catch |
Return:
void |
public scrollOnLoad() source
import {scrollOnLoad} from 'squidwarc/lib/injectManager/pageInjects/scroll.js'
Function that is injected into every frame of the page being crawled that starts scrolling the page
once the load
event has been fired a maximum of 20 times or until no more scroll can be done
public async scrollPage(): Promise<void> source
import {scrollPage} from 'squidwarc/lib/injectManager/pageInjects/scroll.js'
Function that scrolls the page/frame injected into a maximum of 20 times or until no more scroll can be done