Node Web Crawler

N

crawler - npm

crawler – npm

Most powerful, popular and production crawling/scraping package for Node, happy hacking:)
Features:
server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM
Configurable pool size and retries
Control rate limit
Priority queue of requests
forceUTF8 mode to let crawler deal for you with charset detection and conversion
Compatible with 4. x or newer version
Here is the CHANGELOG
Thanks to Authuir, we have a Chinese docs. Other languages are welcomed!
Get started
Install
Basic usage
Slow down
Custom parameters
Raw body
preRequest
Advanced
Send request directly
Work with bottleneck
Class:Crawler
Event: ‘schedule’
Event: ‘limiterChange’
Event: ‘request’
Event: ‘drain’
(uri|options)
crawler. queueSize
Options reference
Basic request options
Callbacks
Schedule options
Retry options
Server-side DOM options
Charset encoding
Cache
Http headers
Work with Cheerio or JSDOM
Working with Cheerio
Work with JSDOM
How to test
Alternative: Docker
Rough todolist
var Crawler = require(“crawler”);var c = new Crawler({ maxConnections: 10, callback: function (error, res, done) { if(error){ (error);}else{ var $ = res. $; ($(“title”)());} done();}});(”);([”, ”]);([{ uri: ”, jQuery: false, callback: function (error, res, done) { if(error){ (error);}else{ (‘Grabbed’,, ‘bytes’);} done();}}]);([{ html: ‘

This is a test

‘}]);
Use rateLimit to slow down when you are visiting web sites.
var Crawler = require(“crawler”);var c = new Crawler({ rateLimit: 1000, callback: function(err, res, done){ (res. $(“title”)()); done();}});(tasks);
Sometimes you have to access variables from previous request/response session, what should you do is passing parameters as same as options:
({ uri:”, parameter1:”value1″, parameter2:”value2″, parameter3:”value3″})
then access them in callback via res. options
(rameter1);
Crawler picks options only needed by request, so don’t worry about the redundance.
If you are downloading files like image, pdf, word etc, you have to save the raw response body which means Crawler shouldn’t convert it to string. To make it happen, you need to set encoding to null
var Crawler = require(“crawler”);var fs = require(‘fs’);var c = new Crawler({ encoding:null, jQuery:false, callback:function(err, res, done){ if(err){ ();}else{ eateWriteStream(lename)();} done();}});({ uri:”, filename:””});
If you want to do something either synchronously or asynchronously before each request, you can try the code below. Note that direct requests won’t trigger preRequest.
var c = new Crawler({ preRequest: function(options, done) { (options); done();}, callback: function(err, res, done) { if(err) { (err)} else { (atusCode)}}});({ uri: ”, preRequest: function(options, done) { setTimeout(function() { (options); done();}, 1000)}});
In case you want to send a request directly without going through the scheduler in Crawler, try the code below. direct takes the same options as queue, please refer to options for detail. The difference is when calling direct, callback must be defined explicitly, with two arguments error and response, which are the same as that of callback of method queue.
({ uri: ”, skipEventRequest: false, callback: function(error, response) { if(error) { (error)} else { (atusCode);}}});
Work with Http2
Node-crawler now supports request. Proxy functionality for 2 request does not be included now. It will be added in the future.
({ uri: ”, method: ‘GET’, 2: true, callback: (error, response, done) => { if(error) { (error); return done();} (`inside callback`); (); return done();}})
Control rate limit for with limiter. All tasks submit to a limiter will abide the rateLimit and maxConnections restrictions of the limiter. rateLimit is the minimum time gap between two tasks. maxConnections is the maximum number of tasks that can be running at the same time. Limiters are independent of each other. One common use case is setting different limiters for different proxies. One thing is worth noticing, when rateLimit is set to a non-zero value, maxConnections will be forced to 1.
var crawler = require(‘crawler’);var c = new Crawler({ rateLimit: 2000, maxConnections: 1, callback: function(error, res, done) { if(error) { (error)} else { var $ = res. $; ($(‘title’)())} done();}})(”)(”)(”)({ uri:”, limiter:’proxy_1′, proxy:’proxy_1′})({ uri:”, limiter:’proxy_2′, proxy:’proxy_2′})({ uri:”, limiter:’proxy_3′, proxy:’proxy_3′})({ uri:”, limiter:’proxy_1′, proxy:’proxy_1′})
Normally, all limiter instances in limiter cluster in crawler are instantiated with options specified in crawler constructor. You can change property of any limiter by calling the code below. Currently, we only support changing property ‘rateLimit’ of limiter. Note that the default limiter can be accessed by tLimiterProperty(‘default’, ‘rateLimit’, 3000). We strongly recommend that you leave limiters unchanged after their instantiation unless you know clearly what you are doing.
var c = new Crawler({});tLimiterProperty(‘limiterName’, ‘propertyName’, value)
options Options
Emitted when a task is being added to scheduler.
(‘schedule’, function(options){ = “proxy:port”;});
limiter String
Emitted when limiter has been changed.
Emitted when crawler is ready to send a request.
If you are going to modify options at last stage before requesting, just listen on it.
(‘request’, function(options){ = new Date(). getTime();});
Emitted when queue is empty.
(‘drain’, function(){ ();});
uri String
Enqueue a task and wait for it to be executed.
Number
Size of queue, read-only
You can pass these options to the Crawler() constructor if you want them to be global or as
items in the queue() calls if you want them to be specific to that item (overwriting global options)
This options list is a strict superset of mikeal’s request options and will be directly passed to
the request() method.
String The url you want to crawl.
options. timeout: Number In milliseconds (Default 15000).
All mikeal’s request options are accepted.
callback(error, res, done): Function that will be called after a request was completed
error: Error
res: comingMessage A response of standard IncomingMessage includes $ and options
atusCode: Number HTTP status code. E. G. 200
Buffer | String HTTP response content which could be a html page, plain text or xml document e. g.
res. headers: Object HTTP response headers
quest: Request An instance of Mikeal’s Request instead of ientRequest
urlObject HTTP request entity of parsed url
String HTTP request method. GET
quest. headers: Object HTTP request headers
res. options: Options of this task
$: jQuery Selector A selector for html or xml document.
done: Function It must be called when you’ve done your work in callback.
xConnections: Number Size of the worker pool (Default 10).
options. rateLimit: Number Number of milliseconds to delay between each requests (Default 0).
iorityRange: Number Range of acceptable priorities starting from 0 (Default 10).
iority: Number Priority of this request (Default 5). Low values have higher priority.
tries: Number Number of retries if the request fails (Default 3),
tryTimeout: Number Number of milliseconds to wait before retrying (Default 10000),
Boolean|String|Object Use cheerio with default configurations to inject document if true or “cheerio”. Or use customized cheerio if an object with Parser options. Disable injecting jQuery selector if false. If you have memory leak issue in your project, use “whacko”, an alternative parser, to avoid that. (Default true)
rceUTF8: Boolean If true crawler will get charset from HTTP headers or meta tag in html and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default true),
comingEncoding: String With forceUTF8: true to set encoding manually (Default null) so that crawler will not have to detect charset by itself. For example, incomingEncoding: ‘windows-1255’. See all supported encodings
ipDuplicates: Boolean If true skips URIs that were already crawled, without even calling callback() (Default false). This is not recommended, it’s better to handle outside Crawler use seenreq
tateUA: Boolean If true, userAgent should be an array and rotate it (Default false)
erAgent: String|Array, If rotateUA is false, but userAgent is an array, crawler will use the first one.
ferer: String If truthy sets the HTTP referer header
moveRefererHeader: Boolean If true preserves the set referer during redirects
options. headers: Object Raw key-value of headers
Http2
tp2: Boolean If true, request will be sent in 2 protocol (Default false)
Https socks5
const Agent = require(‘socks5–client/lib/Agent’);var c = new Crawler({ maxConnections: 20, agentClass: Agent, method: ‘GET’, strictSSL: true, agentOptions: { socksHost: ‘localhost’, socksPort: 9050}, callback: function (error, res, done) { if (error) { (error);} else {} done();}});
Crawler by default use Cheerio instead of JSDOM. JSDOM is more robust, if you want to use JSDOM you will have to require it require(‘jsdom’) in your own script before passing it to crawler.
jQuery: true jQuery: ‘cheerio’jQuery: { name: ‘cheerio’, options: { normalizeWhitespace: true, xmlMode: true}}
These parsing options are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well. The default options are:
{ normalizeWhitespace: false, xmlMode: false, decodeEntities: true}
For a full list of options and their effects, see this and
htmlparser2’s options.
source
In order to work with JSDOM you will have to install it in your project folder npm install jsdom, and pass it to crawler.
var jsdom = require(‘jsdom’);var Crawler = require(‘crawler’);var c = new Crawler({ jQuery: jsdom});
Crawler uses nock to mock request, thus testing no longer relying on server.
$ npm install$ npm test$ npm run cover
After installing Docker, you can run:
$ docker build -t node-crawler. $ docker run node-crawler sh -c “npm install && npm test”$ docker run -i -t node-crawler bash
Introducing zombie to deal with page with complex ajax
Refactoring the code to be more maintainable
Make Sizzle tests pass (JSDOM bug? )
Promise support
Commander support
Middleware support
bda-research/node-crawler - GitHub

bda-research/node-crawler – GitHub

Most powerful, popular and production crawling/scraping package for Node, happy hacking:)
Features:
Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
Configurable pool size and retries,
Control rate limit,
Priority queue of requests,
forceUTF8 mode to let crawler deal for you with charset detection and conversion,
Compatible with 4. x or newer version.
Here is the CHANGELOG
Thanks to Authuir, we have a Chinese docs. Other languages are welcomed!
Get started
Install
Basic usage
Slow down
Custom parameters
Raw body
preRequest
Advanced
Send request directly
Work with bottleneck
Class:Crawler
Event: ‘schedule’
Event: ‘limiterChange’
Event: ‘request’
Event: ‘drain’
(uri|options)
crawler. queueSize
Options reference
Basic request options
Callbacks
Schedule options
Retry options
Server-side DOM options
Charset encoding
Cache
Http headers
Work with Cheerio or JSDOM
Working with Cheerio
Work with JSDOM
How to test
Alternative: Docker
Rough todolist
const Crawler = require(‘crawler’);
const c = new Crawler({
maxConnections: 10,
// This will be called for each crawled page
callback: (error, res, done) => {
if (error) {
(error);} else {
const $ = res. $;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
($(‘title’)());}
done();}});
// Queue just one URL, with default callback
(”);
// Queue a list of URLs
([”, ”]);
// Queue URLs with custom callbacks & parameters
([{
uri: ”,
jQuery: false,
// The global callback won’t be called
(‘Grabbed’,, ‘bytes’);}
done();}}]);
// Queue some HTML code directly without grabbing (mostly for tests)
html: ‘

This is a test

‘}]);
Use rateLimit to slow down when you are visiting web sites.
rateLimit: 1000, // `maxConnections` will be forced to 1
callback: (err, res, done) => {
(res. $(‘title’)());
(tasks);//between two tasks, minimum time gap is 1000 (ms)
Sometimes you have to access variables from previous request/response session, what should you do is passing parameters as same as options:
({
parameter1: ‘value1’,
parameter2: ‘value2’,
parameter3: ‘value3’})
then access them in callback via res. options
(rameter1);
Crawler picks options only needed by request, so don’t worry about the redundance.
If you are downloading files like image, pdf, word etc, you have to save the raw response body which means Crawler shouldn’t convert it to string. To make it happen, you need to set encoding to null
const fs = require(‘fs’);
encoding: null,
jQuery: false, // set false to suppress warning message.
if (err) {
();} else {
eateWriteStream(lename)();}
filename: ”});
If you want to do something either synchronously or asynchronously before each request, you can try the code below. Note that direct requests won’t trigger preRequest.
preRequest: (options, done) => {
// ‘options’ here is not the ‘options’ you pass to ”, instead, it’s the options that is going to be passed to ‘request’ module
(options);
// when done is called, the request will start
done();},
(err);} else {
(atusCode);}}});
// this will override the ‘preRequest’ defined in crawler
setTimeout(() => {
done();}, 1000);}});
In case you want to send a request directly without going through the scheduler in Crawler, try the code below. direct takes the same options as queue, please refer to options for detail. The difference is when calling direct, callback must be defined explicitly, with two arguments error and response, which are the same as that of callback of method queue.
skipEventRequest: false, // default to true, direct requests won’t trigger Event:’request’
callback: (error, response) => {
(error)} else {
Work with Http2
Node-crawler now supports request. Proxy functionality for 2 request does not be included now. It will be added in the future.
//unit test work with bin 2 server. It could be used for test
method: ‘GET’,
2: true, //set 2 to be true will make a 2 request
callback: (error, response, done) => {
(error);
return done();}
(`inside callback`);
();
return done();}})
Control rate limit for with limiter. All tasks submit to a limiter will abide the rateLimit and maxConnections restrictions of the limiter. rateLimit is the minimum time gap between two tasks. maxConnections is the maximum number of tasks that can be running at the same time. Limiters are independent of each other. One common use case is setting different limiters for different proxies. One thing is worth noticing, when rateLimit is set to a non-zero value, maxConnections will be forced to 1.
const crawler = require(‘crawler’);
rateLimit: 2000,
maxConnections: 1,
// if you want to crawl some website with 2000ms gap between requests
// if you want to crawl some website using proxy with 2000ms gap between requests for each proxy
uri:”,
limiter:’proxy_1′,
proxy:’proxy_1′});
limiter:’proxy_2′,
proxy:’proxy_2′});
limiter:’proxy_3′,
proxy:’proxy_3′});
Normally, all limiter instances in limiter cluster in crawler are instantiated with options specified in crawler constructor. You can change property of any limiter by calling the code below. Currently, we only support changing property ‘rateLimit’ of limiter. Note that the default limiter can be accessed by tLimiterProperty(‘default’, ‘rateLimit’, 3000). We strongly recommend that you leave limiters unchanged after their instantiation unless you know clearly what you are doing.
const c = new Crawler({});
tLimiterProperty(‘limiterName’, ‘propertyName’, value);
options Options
Emitted when a task is being added to scheduler.
(‘schedule’, (options) => {
= ‘proxy:port’;});
limiter String
Emitted when limiter has been changed.
Emitted when crawler is ready to send a request.
If you are going to modify options at last stage before requesting, just listen on it.
(‘request’, (options) => {
= new Date(). getTime();});
Emitted when queue is empty.
(‘drain’, () => {
// For example, release a connection to database.
();// close connection to MySQL});
uri String
Enqueue a task and wait for it to be executed.
Number
Size of queue, read-only
You can pass these options to the Crawler() constructor if you want them to be global or as
items in the queue() calls if you want them to be specific to that item (overwriting global options)
This options list is a strict superset of mikeal’s request options and will be directly passed to
the request() method.
String The url you want to crawl.
options. timeout: Number In milliseconds (Default 15000).
All mikeal’s request options are accepted.
callback(error, res, done): Function that will be called after a request was completed
error: Error
res: comingMessage A response of standard IncomingMessage includes $ and options
atusCode: Number HTTP status code. E. G. 200
Buffer | String HTTP response content which could be a html page, plain text or xml document e. g.
res. headers: Object HTTP response headers
quest: Request An instance of Mikeal’s Request instead of ientRequest
urlObject HTTP request entity of parsed url
String HTTP request method. GET
quest. headers: Object HTTP request headers
res. options: Options of this task
$: jQuery Selector A selector for html or xml document.
done: Function It must be called when you’ve done your work in callback.
xConnections: Number Size of the worker pool (Default 10).
options. rateLimit: Number Number of milliseconds to delay between each requests (Default 0).
iorityRange: Number Range of acceptable priorities starting from 0 (Default 10).
iority: Number Priority of this request (Default 5). Low values have higher priority.
tries: Number Number of retries if the request fails (Default 3),
tryTimeout: Number Number of milliseconds to wait before retrying (Default 10000),
Boolean|String|Object Use cheerio with default configurations to inject document if true or ‘cheerio’. Or use customized cheerio if an object with Parser options. Disable injecting jQuery selector if false. If you have memory leak issue in your project, use ‘whacko’, an alternative parser, to avoid that. (Default true)
rceUTF8: Boolean If true crawler will get charset from HTTP headers or meta tag in html and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default true),
comingEncoding: String With forceUTF8: true to set encoding manually (Default null) so that crawler will not have to detect charset by itself. For example, incomingEncoding: ‘windows-1255’. See all supported encodings
ipDuplicates: Boolean If true skips URIs that were already crawled, without even calling callback() (Default false). This is not recommended, it’s better to handle outside Crawler use seenreq
tateUA: Boolean If true, userAgent should be an array and rotate it (Default false)
erAgent: String|Array, If rotateUA is false, but userAgent is an array, crawler will use the first one.
ferer: String If truthy sets the HTTP referer header
moveRefererHeader: Boolean If true preserves the set referer during redirects
options. headers: Object Raw key-value of headers
Http2
tp2: Boolean If true, request will be sent in 2 protocol (Default false)
Https socks5
const Agent = require(‘socks5–client/lib/Agent’);
//…
// rateLimit: 2000,
maxConnections: 20,
agentClass: Agent, //adding socks5 agent
strictSSL: true,
agentOptions: {
socksHost: ‘localhost’,
socksPort: 9050},
// debug: true,
(error);}
Crawler by default use Cheerio instead of JSDOM. JSDOM is more robust, if you want to use JSDOM you will have to require it require(‘jsdom’) in your own script before passing it to crawler.
jQuery: true //(default)
//OR
jQuery: ‘cheerio’
jQuery: {
name: ‘cheerio’,
options: {
normalizeWhitespace: true,
xmlMode: true}}
These parsing options are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well. The default options are:
{
normalizeWhitespace: false,
xmlMode: false,
decodeEntities: true}
For a full list of options and their effects, see this and
htmlparser2’s options.
source
In order to work with JSDOM you will have to install it in your project folder npm install jsdom, and pass it to crawler.
const jsdom = require(‘jsdom’);
jQuery: jsdom});
Crawler uses nock to mock request, thus testing no longer relying on server.
$ npm install
$ npm test
$ npm run cover # code coverage
After installing Docker, you can run:
# Builds the local test environment
$ docker build -t node-crawler.
# Runs tests
$ docker run node-crawler sh -c “npm install && npm test”
# You can also ssh into the container for easier debugging
$ docker run -i -t node-crawler bash
Introducing zombie to deal with page with complex ajax
Refactoring the code to be more maintainable
Make Sizzle tests pass (JSDOM bug? )
Promise support
Commander support
Middleware support
Node.js web scraping tutorial - LogRocket Blog

Node.js web scraping tutorial – LogRocket Blog

Editor’s note: This web scraping tutorial was last updated on 28 February 2021.
In this web scraping tutorial, we’ll demonstrate how to build a web crawler in to scrape websites and stores the retrieved data in a Firebase database. Our web crawler will perform the web scraping and data transfer using worker threads.
Here’s what we’ll cover:
What is a web crawler?
What is web scraping in
workers: The basics
Communicating with worker threads
Building a web crawler
Web scraping in
Using worker threads for web scraping
Is web scraping legal?
A web crawler, often shortened to crawler or sometimes called a spider-bot, is a bot that systematically browses the internet typically for the purpose of web indexing. These internet bots can be used by search engines to improve the quality of search results for users.
In addition to indexing the world wide web, crawling can also be used to gather data. This is known as web scraping.
Use cases for web scraping include collecting prices from a retailer’s site or hotel listings from a travel site, scraping email directories for sales leads, and gathering information to train machine learning models.
The process of web scraping can be quite tasking on the CPU, depending on the site’s structure and the complexity of data being extracted. You can use worker threads to optimize the CPU-intensive operations required to perform web scraping in
Installation
Launch a terminal and create a new directory for this tutorial:
$ mkdir worker-tutorial
$ cd worker-tutorial
Initialize the directory by running the following command:
$ yarn init -y
We need the following packages to build the crawler:
Axios, a promised based HTTP client for the browser and
Cheerio, a lightweight implementation of jQuery which gives us access to the DOM on the server
Firebase database, a cloud-hosted NoSQL database. If you’re not familiar with setting up a Firebase database, check out the documentation and follow steps 1-3 to get started
Let’s install the packages listed above with the following command:
$ yarn add axios cheerio firebase-admin
Before we start building the crawler using workers, let’s go over some basics. You can create a test file in the root of the project to run the following snippets.
Registering a worker
A worker can be initialized (registered) by importing the worker class from the worker_threads module like this:
//
const { Worker} = require(‘worker_threads’);
new Worker(“. /”);
Hello world
Printing out Hello World with workers is as simple as running the snippet below:
const { Worker, isMainThread} = require(‘worker_threads’);
if(isMainThread){
new Worker(__filename);} else{
(“Worker says: Hello World”); // prints ‘Worker says: Hello World’}
This snippet pulls in the worker class and the isMainThread object from the worker_threads module:
isMainThread helps us know when we are either running inside the main thread or a worker thread
new Worker(__filename) registers a new worker with the __filename variable which, in this case, is
When a new worker thread is spawned, there is a messaging port that allows inter-thread communications. Below is a snippet which shows how to pass messages between workers (threads):
const { Worker, isMainThread, parentPort} = require(‘worker_threads’);
if (isMainThread) {
const worker = new Worker(__filename);
(‘message’, (message) => {
(message); // prints ‘Worker thread: Hello! ‘});
Message(‘Main Thread: Hi! ‘);} else {
(message) // prints ‘Main Thread: Hi! ‘
Message(“Worker thread: Hello! “);});}
In the snippet above, we send a message to the parent thread using Message() after initializing a worker thread. Then we listen for a message from the parent thread using (). We also send a message to the worker thread using Message() and listen for a message from the worker thread using ().
Running the code produces the following output:
Main Thread: Hi!
Worker thread: Hello!
Let’s build a basic web crawler that uses Node workers to crawl and write to a database. The crawler will complete its task in the following order:
Fetch (request) HTML from the website
Extract the HTML from the response
Traverse the DOM and extract the table containing exchange rates
Format table elements (tbody, tr, and td) and extract exchange rate values
Stores exchange rate values in an object and send it to a worker thread using Message()
Accept message from parent thread in worker thread using ()
Store message in Firestore (Firebase database)
Let’s create two new files in our project directory:
for the main thread
for the worker thread
The source code for this tutorial is available here on GitHub. Feel free to clone it, fork it, or submit an issue.
In the main thread (), we will scrape the IBAN website for the current exchange rates of popular currencies against the US dollar. We will import axios and use it to fetch the HTML from the site using a simple GET request.
We will also use cheerio to traverse the DOM and extract data from the table element. To know the exact elements to extract, we will open the IBAN website in our browser and load dev tools:
From the image above, we can see the table element with the classes — table table-bordered table-hover downloads. This will be a great starting point and we can feed that into our cheerio root element selector:
const axios = require(‘axios’);
const cheerio = require(‘cheerio’);
const url = “;
fetchData(url)( (res) => {
const html =;
const $ = (html);
const statsTable = $(‘ > tbody > tr’);
(function() {
let title = $(this)(‘td’)();
(title);});})
async function fetchData(url){
(“Crawling data… “)
// make call to url
let response = await axios(url)((err) => (err));
if(! == 200){
(“Error occurred while fetching data”);
return;}
return response;}
Running the code above with Node will give the following output:
Going forward, we will update the file so that we can properly format our output and send it to our worker thread.
Updating the main thread
To properly format our output, we need to get rid of white space and tabs since we will be storing the final output in JSON. Let’s update the file accordingly:
[… ]
let workDir = __dirname+”/”;
const mainFunc = async () => {
// fetch html data from iban website
let res = await fetchData(url);
if(! ){
(“Invalid data Obj”);
let dataObj = new Object();
// mount html page to the root element
//loop through all table rows and get table data
let title = $(this)(‘td’)(); // get the text in all the td elements
let newStr = (“\t”); // convert text (string) into an array
(); // strip off empty array element at index 0
formatStr(newStr, dataObj); // format array string and store in an object});
return dataObj;}
mainFunc()((res) => {
// start worker
const worker = new Worker(workDir);
(“Sending crawled data to dbWorker… “);
// send formatted data to worker thread
Message(res);
// listen to message from worker thread
(“message”, (message) => {
(message)});});
function formatStr(arr, dataObj){
// regex to match all the words before the first digit
let regExp = /[^A-Z]*(^\D+)/
let newArr = arr[0](regExp); // split array element 0 using the regExp rule
dataObj[newArr[1]] = newArr[2]; // store object}
In the snippet above, we are doing more than data formatting; after the mainFunc() has been resolved, we pass the formatted data to the worker thread for storage.
In this worker thread, we will initialize Firebase and listen for the crawled data from the main thread. When the data arrives, we will store it in the database and send a message back to the main thread to confirm that data storage was successful.
The snippet that takes care of the aforementioned operations can be seen below:
const { parentPort} = require(‘worker_threads’);
const admin = require(“firebase-admin”);
//firebase credentials
let firebaseConfig = {
apiKey: “XXXXXXXXXXXX-XXX-XXX”,
authDomain: “XXXXXXXXXXXX-XXX-XXX”,
databaseURL: “XXXXXXXXXXXX-XXX-XXX”,
projectId: “XXXXXXXXXXXX-XXX-XXX”,
storageBucket: “XXXXXXXXXXXX-XXX-XXX”,
messagingSenderId: “XXXXXXXXXXXX-XXX-XXX”,
appId: “XXXXXXXXXXXX-XXX-XXX”};
// Initialize Firebase
itializeApp(firebaseConfig);
let db = restore();
// get current data in DD-MM-YYYY format
let date = new Date();
let currDate = `${tDate()}-${tMonth()}-${tFullYear()}`;
// recieve crawled data from main thread
(“Recieved data from mainWorker… “);
// store data gotten from main thread in database
llection(“Rates”)(currDate)({
rates: ringify(message)})(() => {
// send data back to main thread if operation was successful
Message(“Data saved successfully”);})
((err) => (err))});
Note: To set up a database on firebase, please visit the firebase documentation and follow steps 1-3 to get started.
Running (which encompasses) with Node will give the following output:
You can now check your firebase database and will see the following crawled data:
Although web scraping can be fun, it can also be against the law if you use data to commit copyright infringement. It is generally advised that you read the terms and conditions of the site you intend to crawl, to know their data crawling policy beforehand.
You should learn more about web crawling policy before undertaking your own web scraping project.
The use of worker threads does not guarantee your application will be faster but can present that mirage if used efficiently because it frees up the main thread by making CPU-intensive tasks less cumbersome on the main thread.
Conclusion
In this tutorial, we learned how to build a web crawler that scrapes currency exchange rates and saves it to a database. We also learned how to use worker threads to run these operations.
The source code for each of the following snippets is available on GitHub. Feel free to clone it, fork it or submit an issue.
Further reading
Interested in learning more about worker threads? You can check out the following links:
Worker threads
multithreading: What are Worker Threads and why do they matter?
Going Multithread with
Simple bidirectional messaging in Worker Threads
200’s only Monitor failed and slow network requests in production Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you’re interested in ensuring requests to the backend or third party services are successful, try LogRocket. is like a DVR for web apps, recording literally everything that happens on your site. Instead of guessing why problems happen, you can aggregate and report on problematic network requests to quickly understand the root cause. LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.

Frequently Asked Questions about node web crawler

What is crawler in Nodejs?

By sending HTTP request to a particular URL and then by extracting HTML of that web page for getting useful information is known as crawling or web scraping. Modules to be used for crawling in Nodejs: request: For sending HTTP request to the URL. cheerio: For parsing DOM and extracting HTML of web page.Dec 17, 2018

What is node in web?

Node (or more formally Node. js) is an open-source, cross-platform runtime environment that allows developers to create all kinds of server-side tools and applications in JavaScript. The runtime is intended for use outside of a browser context (i.e. running directly on a computer or server OS).Sep 5, 2021

What are spiders and crawlers?

A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.

About the author

proxyreview

If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

By proxyreview

Recent Posts

Useful Tools