Web Crawler Javascript

W

Building a simple web crawler with Node.js and JavaScript

Building a simple web crawler with Node.js and JavaScript

All of us use search engines almost daily. When most of us talk about search engines, we really mean the World Wide Web search engines. A very superficial overview of a search engine suggests that a user enters a search term and the search engine gives a list of all relevant resources related to that search term. But, to provide the user with a list of resources the search engine should know that they exist on the Internet.
Search engines do not magically know what websites exist on the Internet. So this is exactly where the role of web crawlers comes into the picture.
What is a web crawler?
A web crawler is a bot that downloads and indexes content from all over the Internet. The aim of such a bot is to get a brief idea about or know what every webpage on the Internet is about so that retrieving the information becomes easy when needed.
The web crawler is like a librarian organizing the books in a library making a catalog of those books so that it becomes easier for the reader to find a particular book. To categorize a book, the librarian needs to read its topic, summary, and some part of the content if required.
How does a web crawler work?
It is very difficult to know how many webpages exist on the Internet in total. A web crawler starts with a certain number of known URLs and as it crawls that webpage, it finds links to other webpages. A web crawler follows certain policies to decide what to crawl and how frequently to crawl.
Which webpages to crawl first is also decided by considering some parameters. For instance, webpages with a lot of visitors are a good option to start with, and that a search engine has it indexed.
Building a simple web crawler with and JavaScript
We will be using the modules cheerio and request.
Install these dependencies using the following commands
npm install –save cheerio
npm install –save request
Enter fullscreen mode
Exit fullscreen mode
The following code imports the required modules and makes a request to Hacker News.
We log the status code of the response to the console to see if the request was successful.
var request = require(‘request’);
var cheerio = require(‘cheerio’);
var fs = require(‘fs’);
request(“, (error, response, body) => {
if(error) {
(“Error: ” + error);}
(“Status code: ” + atusCode);});
Note that the fs module is used to handle files and it is a built-in module.
We observe the structure of the data using the developer tools of our browser. We see that there are tr elements with athing class.
We will go through all the elements and get the title of the post by selecting the child element and the hyperlink by selecting the a element. This task is accomplished by adding the following code after the of the previous code block
var $ = (body);
$(‘(links)’)(function( index) {
var title = $(this)(‘ > a’)()();
var link = $(this)(‘ > a’)(‘href’);
endFileSync(”, title + ‘\n’ + link + ‘\n’);});
We also skip the posts related to hiring (if observed carefully, we see that the element of such post does not have a child links element.
The complete code now looks like following
request(“, function(error, response, body) {
(“Status code: ” + atusCode);
endFileSync(”, title + ‘\n’ + link + ‘\n’);});});
The data is stored in a file named
Your simple web crawler is ready!!
References
How to Crawl JavaScript Websites | Sitebulb.com

How to Crawl JavaScript Websites | Sitebulb.com

Crawling websites is not quite as straightforward as it was a few years ago, and this is mainly due to the rise in usage of JavaScript frameworks, such as Angular and React.
Traditionally, a crawler would work by extracting data from static HTML code, and up until relatively recently, most websites you would encounter could be crawled in this manner.
However, if you try to crawl a website built with Angular like this, you won’t get very far (literally). In order to ‘see’ the HTML of a web page (and the content and links within it), the crawler needs to process all the code on the page and actually render the content.
Rendering is a process carried out by the browser, taking the code (HTML, CSS, JS, etc… ) and translating this into the visual representation of the web page you see on the screen.
Search engines (and crawling tools like Sitebulb) are able to do this en masse using a ‘headless’ browser, which is a browser that runs without the visual user interface. This works by building up the page content (i. e. ‘rendering the page’) then extracting the HTML after the page has rendered.
The key difference between extracting HTML before the page is rendered and after it is rendered is the influence of JavaScript.
Why? Because when JavaScript is fired, it can drastically change the page content. On sites that are particularly JavaScript heavy, most or all of the content will be changed by JavaScript.
Plenty of sites do not rely on JavaScript in this way, but when you encounter a site that does, there are certain things you need to take into account when setting up the crawler.
Table of contents:
How Google handles rendering
How Sitebulb handles rendering
Side effects of crawling with JavaScript
How to detect JavaScript websites
How to detect JavaScript dependence
Further reading
Over the years, this rise in the prevalence of JavaScript has caused Google various headaches. For a long time they struggled to render JavaScript-heavy pages at scale, and their default advice was to utilise server-side or pre-rendering, instead of client-side rendering.
Since 2019, they have implemented an ‘evergreen Googlebot’, which means that Googlebot runs the latest Chromium rendering engine and keeps it constantly up-to-date (incidentally, we do exactly the same here at Sitebulb, so crawling with Sitebulb reflects exactly what Google sees).
Nowadays, rendering is built into Google’s crawling and indexing process at a fundamental level;
The important thing to note from this diagram is that the index gets updated after rendering. Additionally, consider that Google claim they basically render every single page they encounter.
This is important because it should affect how you think about (and potentially crawl) every website, not only the ones that have been built using a JavaScript framework.
Additional resources
This post is about crawling JavaScript websites, so further depth on ‘how rendering fits in with Google’ will lead us down multiple rabbit-holes.
However, it is a deep, complex and interesting topic that absolutely deserves the attention of technical SEOs, so here is some more reading for you to enjoy:
How JavaScript Rendering Affects Google Indexing
Rendering SEO manifesto – why JavaScript SEO is not enough
Googlebot & JavaScript: A Closer Look at the WRS (VIDEO PRESENTATION)
Sitebulb offers two different ways of crawling:
HTML Crawler
Chrome Crawler
The HTML Crawler uses the traditional method of downloading the source HTML and parsing it, without rendering JavaScript.
The Chrome Crawler utilises headless Chromium (like Google) to render the page, then parse the rendered HTML. Since it takes time to compile and fire all the JavaScript in order to render the time, it is necessarily slower to crawl with the Chrome Crawler.
As we have mentioned above, however, some websites rely on client-side JavaScript and therefore can only be crawled with the Chrome Crawler.
Selecting the Chrome Crawler in the crawler settings will allow you to crawl JavaScript websites.
Trying to crawl a JavaScript website without rendering
As a brief aside, we’re first going to investigate what happens when you try to crawl a JavaScript website without rendering, which means selecting the ‘HTML Crawler’ in the settings.
Let’s take a look…
One page.
Why only one page? Because the response HTML (the stuff you can see with ‘View Source’) only contains a bunch of scripts and some fallback text.
You simply can’t see the meat and bones of the page – the product images, description, technical spec, video, and most importantly, links to other pages… everything a crawler needs in order to understand your page content.
On websites like this you absolutely need to use the Chrome Crawler to get back any meaningful crawl data.
How to crawl JavaScript websites with Sitebulb
Every time you set up a new Project in Sitebulb, you have the option of setting it up to use the HTML Crawler or the Chrome Crawler. If you are crawling a JavaScript website, this is the first step you need to cover:
Secondly, you will also need to consider the render timeout, as this affects how much of the page content Sitebulb is actually able to access.
You will find this in the Crawler Settings on the left hand side, and the Render Timeout dropdown is right underneath ‘Crawler Type’ on the right.
By default, this is set at 1 second, which is absolutely fine for most sites that do not have a high dependence on JavaScript. However, websites built using a JavaScript framework have a very high dependence on JavaScript, so this needs to be set with some care.
What is this render timeout?
The render timeout is essentially how long Sitebulb will wait for rendering to complete before taking an ‘HTML snapshot’ of each web page.
Justin Briggs published a post which is an excellent primer on handling JavaScript content for SEO, which will help us explain where the Render Timeout fits in.
I strongly advise you go and read the whole post, but at the very least, the screenshot below shows the sequence of events that occur when a browser requests a page that is dependent upon JavaScript rendered content:
The ‘Render Timeout’ period used by Sitebulb starts just after #1, the Initial Request. So essentially, the render timeout is the time you need to wait for everything to load and render on the page. Say you have the Render Timeout set to 4 seconds, this means that the each page has 4 seconds for all the content to finish loading and any final changes to take effect.
Anything that changes after these 4 seconds will not be captured and recorded by Sitebulb.
Render timeout example
I’ll demonstrate with an example, again using the Roku site we looked at earlier.
In my first audit I used the HTML Crawler – 1 URL crawled
In my second audit I used the Chrome Crawler with a 3 second render timeout – 139 URLs crawled
In my third audit I used the Chrome Crawler was a 5 second render timeout – 144 URLs crawled
Digging into a little more detail about these two Chrome audits, there were 5 more internal HTML URLs found with the 5 second timeout. This means that, in the audit with a 3 second render timeout, the content which contains links to those URLs had not been loaded when Sitebulb took the snapshot.
I actually crawled it one my time after this with a 10 second render timeout, but there was no difference to the 5 second render timeout, which suggests that 5 seconds is sufficient to see all the content on this website.
On another example site, I experimented with not setting a render timeout at all, and crawling the site again with a 5 second timeout. Comparing the two Crawl Maps shows stark differences:
Clearly, this can have a profound impact upon your understanding of the website and its architecture, which underlines why it is very important to set the correct render timeout in order for Sitebulb to see all of the content.
Recommended render timeout
Understanding why the render timeout exists does not actually help us decide what to set it at. #
Although Google have never published anything official about how long they wait for a page to render, most industry experts tend to concur that 5 seconds is generally considered to be ‘about right. ‘
Either way, all this will show you is an approximation of what Google may be seeing. If you want to crawl ALL the content on your site, then you’ll need to develop a better understanding of how the content on your website actually renders.
To do this, head to Chrome’s DevTools Console. Right click on the page and hit ‘Inspect’, then select ‘Network’ from the tabs in the Console, and then reload the page. I’ve positioned the dock to the right of my screen to demonstrate:
Keep your eye on the waterfall graph that builds, and the timings that are recorded in the summary bar at the bottom:
So we have 3 times recorded here:
DOMContentLoaded: 727 ms (= 0. 727 s)
Load: 2. 42 s
Finish: 4. 24 s
You can find the definitions for ‘DOMContentLoaded’ and ‘Load’ from the image above that I took from Justin Briggs’ post. The ‘Finish’ time is exactly that, when the content is fully rendered and any changes or asynchronous scripts have completed.
If the website content depends on JavaScript changes, then you really need to wait for the ‘Finish’ time, so use this as a rule of thumb for determining the render timeout.
Bear in mind that so far we’ve only looked at a single page. To develop a better picture of what’s going on, you’d need to check a number of pages/page templates and check the timings for each one.
If you are going to be crawling with the Chrome Crawler, we urge you to experiment further with the render timeout so you can set your Projects up to correctly crawl all your content every time.
Rendering data from Google Tag Manager
Some SEOs utilise Google Tag Manager (GTM) in order to dynamically change on-page elements, either as a full-blown optimisation solution, or as a proof-of-concept to justify budget for ‘proper’ dev work.
If you are unfamiliar with this, check out Dave Ashworth’s post for Organic Digital – How To: Do Dynamic Product Meta Data in Magento Using GTM – which describes how he used GTM to dynamically re-write and localise the titles and meta descriptions for thousands of pages, with impressive results;
Most other crawlers won’t be able to pick up the data inserted by GTM, which means they don’t allow you to actually audit this data. This is because by default they block tracking scripts, which can have the affect of bloating audit data.
Here at Sitebulb, we have accounted for that too, and actually give you the option to turn this off, so you CAN collect on-page data dynamically inserted or changed using Google Tag Manager.
To do this, when setting up your audit, head over to the ‘URL Exclusions’ tab on the left hand menu:
Then scroll alllllll the way down to the section entitled ‘Block Third Party URLs’, then you need to untick the option marked ‘Block Ad and Tracking Scripts’, which will always be ticked by default;
And then when you go ahead and crawl the site, Sitebulb will correctly extract the GTM-altered meta data. Note that you may need to tweak the render timeout.
Here is what Dave had to say about his experiences using Sitebulb in his auditing workflow:
Almost every website you will ever see uses JavaScript to some degree – interactive elements, pop-ups, analytics codes, dynamic page elements… all controlled by JavaScript.
However, most websites do not employ JavaScript to dynamically alter the majority of the content on a given web page. For websites like this, there is no real benefit in crawling with JavaScript enabled. In fact, in terms of reporting, there is literally no difference at all:
And there are actually a couple of downsides to crawling with the Chrome Crawler, for example:
Crawling with the Chrome Crawler means you need to fetch and render every single page resource (JavaScript, Images, CSS, etc… ) – which is more resource intensive for both your local machine that runs Sitebulb, and the server that the website is hosted on.
As a direct result of #1 above, crawling with the Chrome Crawler is slower than with the HTML Crawler, particularly if you have set a long render timeout. On some sites, and with some settings, it can end up taking 6-10 X longer to complete.
So, unless you need to crawl with the Chrome Crawler because the website uses a JavaScript framework, or because you specifically want to see how the website responds to a JavaScript crawler, it makes sense to crawl with the HTML Crawler by default.
Note: there is one other reason you would choose the Chrome Crawler, and that is if you want to audit Performance or Accessibility, both of which require the use of the Chrome Crawler.
In this post I’ve used the phrase ‘JavaScript Websites’ for brevity, where what I actually mean is ‘websites that depend on JavaScript-rendered content. ‘
It is most likely that the type of websites you come across will be using one of the increasing popular JavaScript frameworks, such as:
Angular
React
Embed
Backbone
Vue
Meteor
If you are dealing with a website running one of these frameworks, it is important that you understand as soon as possible that you are dealing with a website that is fundamentally different from a non-JavaScript website.
Client briefing
Obviously the first port of call, you can save time doing discovery work with a thorough briefing with the client or their dev team.
However, whilst it is nice to think that every client briefing would give you this sort of information up front, I know from painful experience that they are not always forthcoming with seemingly obvious details…
Trying a crawl
Ploughing head first into an audit with the HTML Crawler is actually not going to cost you too much time, since even the most ‘niche’ websites have more than a single URL.
Whilst this would not mean that you’re definitely dealing with a JavaScript website, it would be a pretty good indicator.
It is certainly worth bearing in mind though, in case you are a set-it-and-forget-it type, or you tend to leave Sitebulb on overnight with a queue of websites to audit… by the morning you’d be bitterly disappointed.
Manual inspection
You can also use Google’s tools to help you understand how a website is put together. Using Google Chrome, right click anywhere on a web page and choose ‘Inspect’ to bring up Chrome’s DevTools Console.
Then hit F1 to bring up the Settings. Scroll down to find the Debugger, and tick ‘Disable JavaScript. ‘
Then, leave the DevTools Console open and refresh the page. Does the content stay exactly the same, or does it all disappear?
The Roku site, for instance, provides extremely short shrift:
As we’ve covered already, HTML crawling is both quicker and less resource-intensive, so it does make sense to use this as your default option most of the time. In general, we find it is helpful to understand how a website is put together, and if there is little to no dependence on JavaScript on the site, you can be confident using the HTML Crawler for all your audits on that site.
We’ve also explored ways to identify ‘JavaScript websites’ where basically all the content is loaded in with JavaScript. But what about sites where only some of the content changes after rendering?
Our Roku example above is actually a pretty obvious example of a website not working with JavaScript disabled. But consider instead that some websites load only a portion of the content in with JavaScript (e. g. an image gallery) – on that sort of website if you only ever crawled with the HTML Crawler you could be missing out on an important chunk of data.
Comparing response vs rendered HTML
This is where you can make use of Sitebulb’s unique report: Response vs Render, which is generated automatically whenever you use the Chrome Crawler.
What this does is render the page like normal, then runs a comparison of the rendered HTML against the response HTML (i. the ‘View Source’ HTML). It will check for differences in terms of all the important SEO elements:
Meta robots
Canonical
Page title
Meta description
Internal links
External links
Then the report in Sitebulb will show you if JavaScript has changed or modified any of these important elements:
For the most comprehensive understanding of how this report works, check out our response vs render comparison guide.
Include JavaScript in your ‘discovery workflow’
When working with any new or unfamiliar website, part of your initial process involves discovery – what type of platform are they on, what kind of tracking/analytics are they using, how big is the website etc…
Our suggestion is that JavaScript should also enter this workflow, so you can be confident if rendering is required when crawling the site. Essentially the point of this is to determine the level of dependence upon JavaScript, and whether you need to render the pages in your audits moving forwards.
But also, knowing this could help you unpick issues with crawling or indexing, or affect how you tackle things like internal link optimisation.
A simple workflow could look like this:
Run and exploratory Sitebulb audit using the Chrome Crawler
Analyse the Response vs Render report to see if JavaScript is affecting any of the content during rendering
Include the results of this in your audit, and make a decision for future audits as to whether the Chrome Crawler is needed or not.
If you need further convincing that this is a good idea, just ask yourself ‘what would Aleyda do…?
We have a dedicated page with all the best resources for learning about JavaScript SEO, including guides, experiments and videos. Included below are some of our favourites for various topics.
Fundamental JavaScript SEO:
Core Principles of SEO for JavaScript by Justin Briggs
JavaScript & SEO: Making Your Bot Experience As Good As Your User Experience by Alexis Sanders
JavaScript and SEO: The Difference Between Crawling and Indexing by Barry Adams
The SEO’s Introduction to Rendering by Jamie Alberico
Advanced JavaScript SEO:
How to Diagnose and Solve JavaScript SEO Issues in 6 Steps by Tomek Rudzki
Rendering on the Web – The SEO Version by Jan-Willem Bobbink
“Rendering SEO” with Martin Splitt by Onely (Webinar)
Rendering SEO Manifesto – Why we need to go beyond JavaScript SEO by Bartosz Góralewicz
What We Do in the Shadow DOM by Jamie Alberico
Try out Sitebulb’s JavaScript crawling
If you’re looking for a way to crawl and render your own JavaScript website, you can download Sitebulb here, and try it free for 14 days.
How To Crawl JavaScript Websites - Screaming Frog

How To Crawl JavaScript Websites – Screaming Frog

Introduction To Crawling JavaScript
Historically search engine bots such as Googlebot didn’t crawl and index content created dynamically using JavaScript and were only able to see what was in the static HTML source code.
However, there’s been a huge growth in JavaScript use, and frameworks such as AngularJS, React,, single page applications (SPAs) and progressive web apps (PWAs).
This has meant Google in particular has evolved significantly, deprecating their old AJAX crawling scheme guidelines of escaped-fragment #! URLs and HTML snapshots in October ’15, and are now generally able to render and understand web pages like a modern-day browser.
While Google are generally able to crawl and index most JavaScript content, they recommend server-side rendering, pre-rendering or dynamic rendering rather than relying on client-side JavaScript as its ‘difficult to process JavaScript and not all search engine crawlers are able to process it successfully or immediately’.
It’s essential today to be able to read the DOM after JavaScript has come into play and constructed the web page and understand the differences between the original response HTML, when crawling and evaluating websites.
Traditional website crawlers were not able to crawl JavaScript websites, until we launched the first ever JavaScript rendering functionality into our Screaming Frog SEO Spider software. This meant pages were fully rendered in a browser first, and the rendered HTML post-JavaScript is crawled.
We integrated the Chromium project library for our rendering engine to emulate Google as closely as possible.
In 2019 Google updated their web rendering service (WRS) which was previously based on Chrome 41 to be ‘evergreen’ and use the latest, stable version of Chrome – supporting over 1, 000 more features.
The SEO Spider uses a slightly earlier version of Chrome, version 69 at the time of writing, but we recommend viewing the exact version within the app by clicking ‘Help > Debug’ and scrolling down to the ‘Chrome Version’ line as we update this frequently.
Hence, while rendering will obviously be similar, it won’t be exactly the same as there might be some small differences in supported features (there are arguments that the exact version of Chrome itself won’t be exactly the same, either). However, generally, the WRS supports the same web platform features and capabilities that the Chrome version it uses, and you can compare the differences between Chrome versions at
This guide contains the following 3 sections. Click and jump to a relevant section, or continue reading.
1) Why You Shouldn’t Crawl Blindly With JavaScript Enabled
2) How To Identify JavaScript
3) How To Crawl JavaScript Websites
If you already understand the basic principles of JavaScript and just want to crawl a JavaScript website, skip straight to our guide on configuring the Screaming Frog SEO Spider tool to crawl JavaScript sites. Or, read on.
Why You Shouldn’t Crawl Blindly With JavaScript Enabled
While it’s essential in auditing today, we recommend utilising JavaScript crawling selectively when required and only keeping this enabled by default with careful consideration.
You don’t have to identify whether the site itself is using JavaScript. You can just go ahead and crawl with JavaScript rendering enabled and sites that use JavaScript will be crawled. However, you should take care, as there are issues blindly crawling with JavaScript enabled.
First of all, JavaScript crawling is slower and more intensive for the server, as all resources (whether JavaScript, CSS, images etc. ) need to be fetched to render each web page. This won’t be an issue for smaller websites, but for a large website with many thousands or more pages, this can make a huge difference.
If your site doesn’t rely on JavaScript to dynamically manipulate a web page significantly, then there’s often no need to waste time and resource.
More importantly, if you’re auditing a website you should know how it’s built and whether it’s relying on any client-side JavaScript for key content or links. JavaScript frameworks can be quite different to one another, and the SEO implications are very different to a traditional HTML site.
Core JavaScript Principles
While Google can typically crawl and index JavaScript, there’s some core principles and limitations that need to be understood.
All the resources of a page (JS, CSS, imagery) need to be available to be crawled, rendered and indexed.
Google still require clean, unique URLs for a page, and links to be to be in proper HTML anchor tags (you can offer a static link, as well as calling a JavaScript function).
They don’t click around like a user and load additional events after the render (a click, a hover or a scroll for example).
The rendered page snapshot is taken when network activity is determined to have stopped, or over a time threshold. There is a risk if a page takes a very long time to render it might be skipped and elements won’t be seen and indexed.
Typically Google will render all pages, however they will not queue pages for rendering if they have a ‘noindex’ in the initial HTTP response or static HTML.
Finally, Google’s rendering is seperate to indexing. Google initially crawls the static HTML of a website, and defers rendering until it has resource. Only then will it discover further content and links available in the rendered HTML. Historically this could take a week, but Google have made significant improvements to the point that the median time is now down to just 5 seconds.
It’s essential you know these things with JavaScript SEO, as you live and die by the render in rankings.
Google Advice On JavaScript & Rendering Strategy
It’s important to remember that Google advises against relying on client-side JavaScript and recommend developing with progressive enhancement, building the site’s structure and navigation using only HTML and then improving the site’s appearance and interface with AJAX.
If you’re using a JavaScript framework, rather than relying on a fully client-side rendered approach, Google recommend using server-side rendering, pre-rendering or hybrid rendering which can improve performance for users and search engine crawlers.
Server-side rendering (SSR) and pre-rendering excecute the pages JavaScript and delivering a rendered initial HTML version of the page to both users and search engines.
Hybrid rendering (sometimes referred to as ‘Isomorphic’) is where rendering can take place on the server-side for the initial page load and HTML, and client-side for non critical elements and pages afterwards.
Many JavaScript frameworks such as React or Angular Universal allow for server-side and hybrid rendering.
Alternatively, a workaround to help crawlers is to use dynamic rendering. This can be particularly useful when changes can’t be made to the front-end codebase. Dynamic rendering means switching between client-side rendered for users and pre-rendered content for specific user agents (in this case, the search engines). This means crawlers will be served a static HTML version of the web page for crawling and indexing.
Dynamic rendering is seen as a stop-gap, rather than a long-term strategy as it doesn’t have the user experience or performance benefits that some of the above solutions. If you already have this set-up, then you can test this functionality by switching the user-agent to Googlebot within the SEO Spider.
For more information on how Google processes JavaScript, check out their JS basics guide.
JavaScript Indexing Complications
Even though Google are generally able to crawl and index JavaScript, there are further considerations.
Google have a two-phase indexing process, where by they initially crawl and index the static HTML, and then return later when resources are available to render the page and crawl and index content and links in the rendered HTML.
The time between crawling and rendering could take up to a week, which would be problematic for websites that rely on timely content (such as publishers). While the median time between crawling and rendering was announced to be just 5 seconds at Google’s Chrome Dev Summit in 2019.
If for some reason the render is not this quick, then elements in the original response (such as meta data and canonicals) can be used for the page, until Google gets around to rendering it when resources are available. All pages will be rendered unless they have a robots meta tag or header instructing Googlebot not to index the page. So the initial HTML response needs to be consistent, and should be audited, even if you rely on a client-side approach.
Other search engines like Bing struggle to render and index JavaScript at scale and due to the fragility of JavaScript, it’s fairly easy to experience errors hindering the render, and indexing of content. Feature detection should be used, and errors should be handled gracefully with a fallback.
The purpose of this guide is not actually to go into lots of detail about JavaScript SEO, but more specifically, how to identify and crawl JavaScript websites with a client-side approach using our Screaming Frog SEO Spider software.
How To Identify JavaScript Sites
Identifying a site built using a JavaScript framework can be pretty simple, however, identifying sections, pages or just smaller elements which are dynamically adapted using JavaScript can be far more challenging.
There’s a number of ways you’ll know whether the site is built using a JavaScript framework.
Crawling
This is a start point for many, and you can just go ahead and start a crawl of a website with the standard configuration. By default, the SEO Spider will crawl using the ‘old AJAX crawling scheme’, which means JavaScript is disabled, but the old AJAX crawling scheme will be adhered to if set up correctly by the website.
If the site uses JavaScript and is set up with escaped-fragment (#! ) URLs and HTML snapshots as per Google’s old AJAX crawling scheme, then it will be crawled and URLs will appear under the ‘AJAX’ tab in the SEO Spider. This tab only includes pages using the old AJAX crawling scheme specifically, not every page that uses AJAX.
The AJAX tab shows both ugly and pretty versions of URLs, and like Google, the SEO Spider fetches the ugly version of the URL and maps the pre-rendered HTML snapshot to the pretty URL. Some AJAX sites or pages may not use hash fragments, so the meta fragment tag can be used to recognise an AJAX page for crawlers.
If the site is built using JavaScript but doesn’t adhere to the old crawling scheme or pre-render pages, then you may find only the homepage is crawled with a 200 OK response and perhaps a couple of JavaScript and CSS files, but not much else.
You’ll find that the page has virtually no ‘outlinks’ in the tab at the bottom of the tool, as they are not being rendered and hence can’t be seen.
In the example screen shot above, the ‘outlinks’ tab in the SEO Spider shows JS and CSS files on the page only.
Client Q&A
This should really be the first step. One of the simplest ways to find out about a website is to speak to the client and the development team and ask the question. What’s the site built in? What CMS is it using?
Pretty sensible questions and you might just get a useful answer.
Disable JavaScript
You can turn JavaScript off in your browser and view content available. This is possible in Chrome using the built-in developer tools, or if you use Firefox, the web developer toolbar plugin has the same functionality. Is content available with JavaScript turned off? You may just see a blank page.
Typically it’s also useful to disable cookies and CSS during an audit as well to diagnose for other crawling issues that can be experienced.
Audit The Source Code
A simple one, by right clicking and viewing the raw HTML source code. Is there actually much text and HTML content? Often there are signs and hints to JS frameworks and libraries used. Are you able to see the content and hyperlinks rendered in your browser within the HTML source code?
You’re viewing code before it’s processed by the browser and what the SEO Spider will crawl, when not in JavaScript rendering mode.
If you run a search and can’t find them within the source, then they will be dynamically generated in the DOM and will only be viewable in the rendered code.
If the body is pretty much empty like the above example, it’s a pretty clear indication.
Audit The Rendered Code
How different is the rendered code to the static HTML source? By right clicking and using ‘inspect element’ in Chrome, you can view the rendered HTML. You can often see the JS Framework name in the rendered code, like ‘React’ in the example below.
You will find that the content and hyperlinks are in the rendered code, but not the original HTML source code. This is what the SEO Spider will see, when in JavaScript rendering mode.
By clicking on the opening HTML element, then ‘copy > outerHTML’ you can compare the rendered source code, against the original source.
Toolbars & Plugins
Various toolbars and plugins such as the BuiltWith toolbar, Wappalyser and JS library detector for Chrome can help identify the technologies and frameworks being utilised on a web page at a glance.
These are not always accurate, but can provide some valuable hints, without much work.
Manual Auditing Is Still Required
These points should help you identify sites that are built using a JS framework fairly easily. However, further analysis is always recommended to discover JavaScript elements, with a manual inspection of page templates, auditing different content areas and elements which might require user interaction.
We see lots of e-commerce websites relying on JavaScript to load products onto category pages, which is often missed by webmasters and SEOs until they realise product pages are not being crawled in standard (non-rendering) crawls.
Additionally, you can support a manual audit by crawling a selection of templates and pages from across the website, with JavaScript both disabled and enabled, and analysing any differences in elements and content. Sometimes websites use variables for elements like titles, meta tags or canonicals, which are extremely difficult to pick up by the eye only.
We recommend reading Justin Briggs’s guide to auditing JavaScript for SEO, which goes into far more practical detail about this analysis phase.
How To Crawl JavaScript Using The SEO Spider
Once you have identified client-side JavaScript you want to crawl, next you’ll need to configure the SEO Spider to JavaScript rendering mode. This will allow you to crawl dynamic, JavaScript rich websites and frameworks, such as Angular, React and
The following 7 steps should help you configure a crawl for most cases encountered.
1) Configure Rendering To ‘JavaScript’
To crawl a JavaScript website, open up the SEO Spider, click ‘Configuration > Spider > Rendering’ and change ‘Rendering’ to ‘JavaScript’.
2) Check Resources & External Links
Ensure resources such as images, CSS and JS are ticked under ‘Configuration > Spider’.
If resources are on a different subdomain, or a separate root domain, then ‘check external links‘ should be ticked, otherwise they won’t be crawled and hence rendered either.
This is the default configuration in the SEO Spider, so you can simply click ‘File > Default Config > Clear Default Configuration’ to revert to this set-up.
3) Configure User-Agent & Window Size
You can configure both the user-agent under ‘Configuration > HTTP Header > User-Agent’ and window size by clicking ‘Configuration > Spider > Rendering’ in JavaScript rendering mode to your own requirements.
This is an optionable step, the window size is set to Googlebot’s desktop dimensions in standard configuration. Google are expected to move to a mobile-first index soon, hence if you’re performing a mobile audit you can configure the SEO Spider to mimic Googlebot for Smartphones.
4) Crawl The Website
Now type or paste in the website you wish to crawl in the ‘enter url to spider’ box and hit ‘Start’.
The crawling experience is quite different to a standard crawl, as it can take time for anything to appear in the UI to start with, then all of a sudden lots of URLs appear together at once. This is due to the SEO Spider waiting for all the resources to be fetched to render a page before the data is displayed.
5) Monitor Blocked Resources
Keep an eye on anything appearing under the ‘Blocked Resource’ filter within the ‘Response Codes’ tab. You can glance at the right-hand overview pane, rather than click on the tab specifically. If JavaScript, CSS or images are blocked via (don’t respond, or error), then this will impact rendering, crawling and indexing.
Blocked resources can also be viewed for each page within the ‘Rendered Page’ tab, adjacent to the rendered screen shot in the lower window pane. In severe cases, if a JavaScript site blocks JS resources completely, then the site simply won’t crawl.
If key resources which impact the render are blocked, then unblock them to crawl (or allow them using the custom for the crawl). You can test different scenarios using both the exclude and custom features.
The pages this impacts and the individual blocked resources can also be exported in bulk via the ‘Bulk Export > Response Codes > Blocked Resource Inlinks’ report.
6) View Rendered Pages
You can view the rendered page the SEO Spider crawled in the ‘Rendered Page’ tab which dynamically appears at the bottom of the user interface when crawling in JavaScript rendering mode. This populates the lower window pane when selecting URLs in the top window.
Viewing the rendered page is vital when analysing what a modern search bot is able to see and is particularly useful when performing a review in staging, where you can’t use Google’s own Fetch & Render in Search Console.
If you have adjusted the user-agent and viewport to Googlebot Smartphone, you can see exactly how every page renders on mobile for example.
If you spot any problems in the rendered page screen shots and it isn’t due to blocked resources, you may need to consider adjusting the AJAX timeout, or digging deeper into the rendered HTML source code for further analysis.
7) Compare Raw & Rendered HTML
You may wish to store and view HTML and rendered HTML within the SEO Spider when working with JavaScript. This can be set-up under ‘Configuration > Spider > Extraction’ and ticking the appropriate store HTML & store rendered HTML options.
This then populates the lower window ‘view source’ pane, to enable you to compare the differences, and be confident that critical content or links are present within the DOM.
This is super useful for a variety of scenarios, such as debugging the differences between what is seen in a browser and in the SEO Spider, or just when analysing how JavaScript has been rendered, and whether certain elements are within the code.
8) Adjust The AJAX Timeout
Based upon the responses of your crawl, you can choose when the snapshot of the rendered page is taken by adjusting the ‘AJAX timeout‘ which is set to 5 seconds, under ‘Configuration > Spider > Rendering’ in JavaScript rendering mode.
Previous internal testing indicated that Googlebot takes their snapshot of the rendered page at 5 seconds, which many in the industry concurred with when we discussed it more publicly in 2016.
Our tests indicate Googlebot is willing to wait (approx) 5 secs for their snapshot of rendered content btw. Needs to be in well before then.
— Screaming Frog (@screamingfrog) October 13, 2016
In reality, this was via Google Search Console and real-life Googlebot is more flexible than the above, they adapt based upon how long a page takes to load content, considering network activity and things like caching play a part. However, Google obviously won’t wait forever, so content that you want to be crawled and indexed, needs to be available quickly, or it simply won’t be seen. We’ve seen cases of misfiring JS causing the render to load much later, and entire websites plummeting in rankings due to pages suddenly being indexed and scored with virtually no content.
It’s worth noting that a crawl by our software will often be more resource intensive than a regular Google crawl over time. This might mean that the site response times are typically slower, and the AJAX timeout requires adjustment.
You’ll know this might need to be adjusted if the site fails to crawl properly, ‘response times’ in the ‘Internal’ tab are longer than 5 seconds, or web pages don’t appear to have loaded and rendered correctly in the ‘rendered page’ tab.
How To Crawl JavaScript Video
If you prefer video, then check out our tutorial on crawling JavaScript.
Closing Thoughts
The guide above should help you identify JavaScript websites and crawl them efficiently using the Screaming Frog SEO Spider tool in JavaScript rendering mode.
While we have performed plenty of research internally and worked hard to mimic Google’s own rendering capabilities, a crawler is still only ever a simulation of real search engine bot behaviour.
We highly recommend using log file analysis and Google’s own URL Inspection Tool, or using the relevant version of Chrome to fully understand what they are able to crawl, render and index, alongside a JavaScript crawler.
Additional Reading
Understand the JavaScript SEO Basics – From Google.
Core Principles of JS SEO – From Justin Briggs.
Progressive Web Apps Fundamentals Guide – From Builtvisible.
Crawling JS Rich Sites – From Onely.
If you experience any problems when crawling JavaScript, or encounter any differences between how we render and crawl, and Google, we’d love to hear from you. Please get in touch with our support team directly.

Frequently Asked Questions about web crawler javascript

What is Web crawling in JavaScript?

Introduction To Crawling JavaScript JS, single page applications (SPAs) and progressive web apps (PWAs). … URLs and HTML snapshots in October ’15, and are now generally able to render and understand web pages like a modern-day browser.

Can you use JavaScript for web scraping?

js, JavaScript is a great language to use for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the DOM with front-end JavaScript.

Does Googlebot run JavaScript?

Googlebot processes JavaScript web apps in three main phases: Crawling. Rendering. Indexing.

About the author

proxyreview

If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

By proxyreview

Recent Posts

Useful Tools