Cheerio Js Tutorial

C

cheerio

cheerio

Fast, flexible & lean implementation of core jQuery designed specifically for the server.
中文文档 (Chinese Readme)
const cheerio = require(‘cheerio’);const $ = (‘

Hello world

‘);$(”)(‘Hello there! ‘);$(‘h2’). addClass(‘welcome’);$();//=>

Hello there!


Note
We are currently working on the 1. 0. 0 release of cheerio on the main branch. The source code for the last published version, 0. 22. 0, can be found here.
Installation
npm install cheerio
Features
❤ Familiar syntax:
Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.
ϟ Blazingly fast:
Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient.
❁ Incredibly flexible:
Cheerio wraps around parse5 parser and can optionally use @FB55’s forgiving htmlparser2. Cheerio can parse nearly any HTML or XML document.
Cheerio is not a web browser
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript which is common for a SPA (single page application). This makes Cheerio much, much faster than other solutions. If your use case requires any of this functionality, you should consider browser automation software like Puppeteer and Playwright or DOM emulation projects like JSDom.
API
Markup example we’ll be using:

  • Apple
  • Orange
  • Pear

This is the HTML markup we will be using in all of the API examples.
Loading
First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.
This is the preferred method:
// ES6 or TypeScript:import * as cheerio from ‘cheerio’;// In other environments:const cheerio = require(‘cheerio’);const $ = (‘

‘);$();//=>


Similar to web browser contexts, load will introduce , , and elements if they are not already present. You can set load’s third argument to false to disable this.
const $ = (‘

‘, null, false);$();//=> ‘


Optionally, you can also load in the HTML by passing the string as the context:
$(‘ul’, ‘

‘);
Or as the root:
$(‘li’, ‘ul’, ‘

‘);
If you need to modify parsing options for XML input, you may pass an extra
object to ():
const $ = (‘

‘, { xml: { normalizeWhitespace: true, }, });
The options in the xml object are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well. When xml is set, the default options are:
{ xmlMode: true, decodeEntities: true, // Decode HTML entities. withStartIndices: false, // Add a `startIndex` property to nodes. withEndIndices: false, // Add an `endIndex` property to nodes. }
For a full list of options and their effects, see domhandler and
htmlparser2’s options.
Using htmlparser2
Cheerio ships with two parsers, parse5 and htmlparser2. The
former is the default for HTML, the latter the default for XML.
Some users may wish to parse markup with the htmlparser2 library, and
traverse/manipulate the resulting structure with Cheerio. This may be the case
for those upgrading from pre-1. 0 releases of Cheerio (which relied on
htmlparser2), for those dealing with invalid markup (because htmlparser2 is
more forgiving), or for those operating in performance-critical situations
(because htmlparser2 may be faster in some cases). Note that “more forgiving”
means htmlparser2 has error-correcting mechanisms that aren’t always a match
for the standards observed by web browsers. This behavior may be useful when
parsing non-HTML content.
To support these cases, load also accepts a htmlparser2-compatible data
structure as its first argument. Users may install htmlparser2, use it to
parse input, and pass the result to load:
// Usage as of htmlparser2 version 6:const htmlparser2 = require(‘htmlparser2’);const dom = rseDocument(document, options);const $ = (dom);
If you want to save some bytes, you can use Cheerio’s slim export, which
always uses htmlparser2:
const cheerio = require(‘cheerio/lib/slim’);
Selectors
Cheerio’s selector implementation is nearly identical to jQuery’s, so the API is very similar.
$( selector, [context], [root])
selector searches within the context scope which searches within the root scope. selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object. root is typically the HTML document string.
This selector method is the starting point for traversing and manipulating the document. Like jQuery, it’s the primary method for selecting elements in the document.
$(”, ‘#fruits’)();//=> Apple$(‘ul ‘)(‘class’);//=> pear$(‘li[class=orange]’)();//=> Orange
XML Namespaces
You can select with XML Namespaces but due to the CSS specification, the colon (:) needs to be escaped for the selector to be valid.
$(‘[xml\\:id=”main”‘);
Rendering
When you’re ready to render the document, you can call the html method on the “root” selection:
$()();//=> // // //

    //

  • Apple
  • //

  • Orange
  • //

  • Pear
  • //

// //
If you want to render the outerHTML of a selection, you can use the html utility functon:
($(”));//=>

  • Pear
  • You may also render the text content of a Cheerio object using the text static method:
    const $ = (‘This is content. ‘);($(‘body’));//=> This is content.
    Plugins
    Once you have loaded a document, you may extend the prototype or the equivalent fn property with custom plugin methods:
    const $ = (‘Hello, world! ‘);$. prototype. logHtml = function () { (());};$(‘body’). logHtml(); // logs “Hello, world! ” to the console
    If you’re using TypeScript, you should add a type definition for your new method:
    declare module ‘cheerio’ { interface Cheerio { logHtml(this: Cheerio): void;}}
    The “DOM Node” object
    Cheerio collections are made up of objects that bear some resemblance to browser-based DOM nodes. You can expect them to define the following properties:
    tagName
    parentNode
    previousSibling
    nextSibling
    nodeValue
    firstChild
    childNodes
    lastChild
    Screencasts
    This video tutorial is a follow-up to Nettut’s “How to Scrape Web Pages with and jQuery”, using cheerio instead of JSDOM + jQuery. This video shows how easy it is to use cheerio and how much faster cheerio is than JSDOM + jQuery.
    Cheerio in the real world
    Are you using cheerio in production? Add it to the wiki!
    Sponsors
    Does your company use Cheerio in production? Please consider sponsoring this project! Your help will allow maintainers to dedicate more time and resources to its development and support.
    Backers
    Become a backer to show your support for Cheerio and help us maintain and improve this open source project.
    Special Thanks
    This library stands on the shoulders of some incredible developers. A special thanks to:
    • @FB55 for node-htmlparser2 & CSSSelect:
    Felix has a knack for writing speedy parsing engines. He completely re-wrote both @tautologistic’s node-htmlparser and @harry’s node-soupselect from the ground up, making both of them much faster and more flexible. Cheerio would not be possible without his foundational work
    • @jQuery team for jQuery:
    The core API is the best of its class and despite dealing with all the browser inconsistencies the code base is extremely clean and easy to follow. Much of cheerio’s implementation and documentation is from jQuery. Thanks guys.
    • @visionmedia:
    The style, the structure, the open-source”-ness” of this library comes from studying TJ’s style and using many of his libraries. This dude consistently pumps out high-quality libraries and has always been more than willing to help or answer questions. You rock TJ.
    License
    MIT
    Cheerio tutorial - web scraping in JavaScript with ... - ZetCode

    Cheerio tutorial – web scraping in JavaScript with … – ZetCode

    last modified July 7, 2020
    Cheerio tutorial shows how to do web scraping in JavaScript with Cheerio
    module. Cheerio implements the core of jQuery designed for the server.
    Cheerio
    Cheerio is a fast, flexible, and lean implementation of core
    jQuery designed specifically for the server.
    In this tutorial we scrape HTML from a local web server. For the local
    web server, we use the local-web-server.






    Home page



    My website

    I am a JavaScript programmer.

    My hobbies are:

    • Swimming
    • Tai Chi
    • Running
    • Web development
    • Reading
    • Music




    We will be working with this HTML file.
    Cheerio selectors
    In Cherrion, we use selectors to select tags of an HTML document.
    The selector syntax was borrowed from jQuery.
    The following is a partial list of available selectors:
    $(“*”) — selects all elements
    $(“#first”) — selects the element with id=”first”
    $(“”) — selects all elements with class=”intro”
    $(“div”) — selects all

    elements
    $(“h2, div, p”) — selects all

    ,

    ,

    elements
    $(“li:first”) — selects the first

  • element
    $(“li:last”) — selects the last

  • element
    $(“li:even”) — selects all even

  • elements
    $(“li:odd”) — selects all odd

  • elements
    $(“:empty”) — selects all elements that are empty
    $(“:focus”) — selects the element that currently has focus
    Installing Cheerio and other modules
    We install cheerio module and two additional modules.
    $ nodejs -v
    v9. 11. 2
    We use Node version 9. 2.
    $ sudo npm i cheerio
    $ sudo npm i request
    $ sudo npm i -g local-web-server
    We install cheerio, request, and
    local-web-server.
    $ ws
    Serving at t400:8000,,
    Inside the project directory, where we have the
    file, we start the local web server. It automatically serves
    the file on three different locations.
    Cheerio title
    In the first example, we get the title of the document.
    const cheerio = require(‘cheerio’);
    const request = require(‘request’);
    request({
    method: ‘GET’,
    url: ‘localhost:8000’}, (err, res, body) => {
    if (err) return (err);
    let $ = (body);
    let title = $(‘title’);
    (());});
    The example prints the title of the HTML document.
    We include cheerio and request modules.
    With cheerio, we do web scraping. With request,
    we create GET requests.
    We create a GET request to the localhost which is served by our
    local web server. The resource is available in the body
    parameter.
    First, we load the HTML document. To mimic jQuery, we use the
    $ variable.
    The selector returns the title tag.
    (());
    With the text() method, we get the text of the title tag.
    $ node
    Home page
    The example prints the title of the document.
    Cheerio get parent element
    The parent element is retrieved with parent().
    let h1El = $(‘h1’);
    let parentEl = ();
    ((0). tagName)});
    We get the parent of the h1 element.
    main
    The parent element of h1 is main.
    Cheerio first & last element
    The first element of a cheerio object can be found with first(),
    the last element with last().
    let main = $(‘main’);
    let fel = ildren()();
    let lel = ildren()();
    ((0). tagName);
    ((0). tagName);});
    The example prints the first and last element of the main
    tag.
    We select the main tag.
    We get the first and the last element from the main children.
    We find out the tag names.
    h1
    ul
    The first tag of the main is h1, the last
    one is ul.
    Cheerio add element
    The append() method adds a new element at the end
    of the specified tag.
    let ulEl = $(‘ul’);
    (‘

  • Travel
  • ‘);
    let lis = $(‘ul’)();
    let items = (‘\n’);
    rEach((e) => {
    if (e) {
    (place(/(\s+)/g, ”));}});});
    In the example, we add a new list item to the ul element and
    print it to the console.
    We append a new hobby.
    We get the HTML of the ul tag.
    (place(/(\s+)/g, ”));}});
    We strip white spaces. Text data of elements contains lots of
    space.

  • TaiChi
  • Webdevelopment
  • Travel
  • A new travel hobby was appended at the end of the list.
    Cheerio insert after element
    With after(), we can insert an element after a tag.
    $(‘main’)(‘

    This is a footer

    ‘)
    ($());});
    In the example, we insert a footer element after
    the main element.
    Cheerio loop over elements
    With each(), we can loop over elements.
    let hobbies = [];
    $(‘li’)(function (i, e) {
    hobbies[i] = $(this)();});
    (hobbies);});
    The example loops over li tags of the ul
    and prints the text of the elements in an array.
    [ ‘Swimming’,
    ‘Tai Chi’,
    ‘Running’,
    ‘Web development’,
    ‘Reading’,
    ‘Music’]
    This is the output.
    Cheerio get element attributes
    Attributes can be retrieved with attr() function.
    let fpEl = $(‘h1 + p’);
    let attrs = ();
    (attrs);});
    In the example, we get the attributes of the paragraph that is
    the immediate sibling of h1.
    { class: ‘fpar’}
    The paragraph contains the fpar class.
    Cheerio filter elements
    We can use filter() to apply a filter on the elements.
    let allEls = $(‘*’);
    let filteredEls = (function (i, el) {
    // this === el
    return $(this). children() > 3;});
    let items = ();
    rEach(e => {
    ();});});
    In the example, we find out all elements of the document that contain
    more than three children.
    The * selector selects all elements.
    On the retrieved elements, we apply a filter. An element is included
    in the filtered list only if it contains more than three children.
    ();});
    We go through the filtered list and print the names of the elements.
    head
    The head, main, and ul elements
    contain more than three children. The body is not included
    because it contains only one immediate child.
    In this tutorial, we have done web scraping in JavaScript with
    Cheerio library.
    List all JavaScript tutorials.
    [Tutorial] Web Scraping with NodeJs and Cheerio - DEV ...

    [Tutorial] Web Scraping with NodeJs and Cheerio – DEV …

    In this article, we’ll cover the following topics:
    -What is Web Scraping?
    -What is Cheerio?
    -Scraping data with Cheerio and Axios(practical example)
    *A brief note: I’m not the Jedi Master in these subjects, but I’ve learned about this in the past months and now I want to share a little with you. If you are more familiar with these subjects feel free to correct me and enrich this post.
    What is Web Scrapping?
    First, we need to understand Data Scraping and Crawlers.
    Data Scraping: The act of extract(or scraping) data from a source, such as an XML file or a text file.
    Web Crawler: An agent that uses web requests to simulate the navigation between pages and websites.
    So, I like to think Web Scraping is a technique that uses crawlers to navigate between the web pages and after scraping data from the HTML, XML or JSON responses.
    What is Cheerio?
    Cheerio is an open-source library that will help us to extract relevant data from an HTML string.
    Cheerio has very rich docs and examples of how to use specific methods. It also has methods to modify an HTML, so you can easily add or edit an element, but in this article, we will only get elements from the HTML.
    Note that Cheerio is not a web browser and doesn’t take requests and things like that.
    If you are familiar with JQuery, Cheerio syntax will be easy for you. It’s because Cheerio uses JQuery selectors.
    You can check Cheerio’s docs here
    Scraping data with Cheerio and Axios
    Our target website in this article is Steam. We will get the Steam Weeklong Deals.
    If you inspect the page(ctrl + shift + i), you can see that the list of deals is inside a div with id=”search_resultsRows”:
    When we expand this div we will notice that each item on this list is an “< a >” element inside the div with id=”search_resultsRows”:
    At this point, we know what web scraping is and we have some idea about the structure of the Steam site.
    So, let’s start coding!
    Before you start, make sure you have NodeJs installed on your machine. If you don’t, install it using your preferred package manager or download it from the official Node JS site by clicking here.
    First, create a folder for this project and navigate to the new folder:
    mkdir web-scraping-demo && cd web-scraping-demo
    Once in the new folder, you can run:
    yarn init -Y
    or if you use npm:
    npm init
    To make HTTP requests I will use Axios, but you can use whatever library or API you want.
    run:
    yarn add axios
    npm i axios
    After installing Axios, create a new file called inside the project folder. Now create a function to make the request and fetch the HTML content.
    //
    const axios = require(“axios”). default;
    const fethHtml = async url => {
    try {
    const { data} = await (url);
    return data;} catch {
    (
    `ERROR: An error occurred while trying to fetch the URL: ${url}`);}};
    Enter fullscreen mode
    Exit fullscreen mode
    And here we start using Cheerio to extract data from the response, but first… We need to add Cheerio to our app:
    yarn add cheerio
    npm i cheerio
    Right, in the next block of code we will:
    1- Import cheerio and create a new function into the file;
    2- Define the Steam page URL;
    3- Call our fetchHtml function and wait for the response;
    4- Create a “selector” by loading the returned HTML into cheerio;
    5- Tell cheerio the path for the deals list, according to what we saw in the above image
    const cheerio = require(“cheerio”);
    const scrapSteam = async () => {
    const steamUrl =
    “;
    const html = await fethHtml(steamUrl);
    const selector = (html);
    // Here we are telling cheerio that the “” collection
    //is inside a div with id ‘search_resultsRows’ and
    //this div is inside other with id ‘search_result_container’.
    //So, ‘searchResults’ is an array of cheerio objects with “
    ” elements
    const searchResults = selector(“body”)
    (“#search_result_container > #search_resultsRows > a”);
    // Don’t worry about this for now
    const deals = ((idx, el) => {
    const elementSelector = selector(el);
    return extractDeal(elementSelector)})
    ();
    return deals;};
    For this example, I will not get all the properties from each item. But you can get all the other properties as a challenge for you;)
    Note that for each “< a >” element in our deals list, we will call
    the extractDeal function that will receive our element “selector” as argument.
    The first property we will extract is the title. Look for the game title inside the HTML:
    Oh, now it’s time to implement our extractDeal function.
    const extractDeal = selector => {
    const title = selector
    (“. responsive_search_name_combined”)
    (“div[class=’col search_name ellipsis’] > span[class=’title’]”)
    ()
    return { title};}
    Using the same method, we can get the game release date:
    Inspecting the element on the Steam site:
    Then mapping the path in our function:
    const releaseDate = selector
    (“div[class=’col search_released responsive_secondrow’]”)
    return { title, releaseDate};}
    Now we will get the deal’s link. As we saw before, every item of the deals list is an “< a >” element, so we just need to get their “href” attribute:
    const link = (“href”)();
    return { title, releaseDate, link};}
    It’s time to get the prices. As we can see in the image below, the original price and the discounted price are inside the same div.
    So we will create a custom selector for this div with prices:
    const priceSelector = selector
    (“div[class=’col search_price_discount_combined responsive_secondrow’]”)
    (“div[class=’col search_price discounted responsive_secondrow’]”);
    And now we will get the original price inside the path “span > strike”:
    const originalPrice = priceSelector
    (“span > strike”)
    return { title, releaseDate, originalPrice, link};}
    And finally, we will get the discounted price property. But… Notice that this value isn’t inside a specific HTML tag, so we have some different ways to get this value, but I will use a regular expression.
    //First I’ll get the html from cheerio object
    const pricesHtml = ()();
    //After I’ll get the groups that matches with this Regx
    const matched = (/(
    (. +\s[0-9]. +. \d+))/);
    //Then I’ll get the last group’s value
    const discountedPrice = matched[ – 1];
    Right! Now we have scraped all the properties we want.
    Now we just need to export our scrapSteam function and after create our server.
    Here is our final file:
    (`ERROR: An error occurred while trying to fetch the URL: ${url}`);}};
    return {
    title,
    releaseDate,
    originalPrice,
    discountedPrice,
    link};};
    const searchResults = selector(“body”)(
    “#search_result_container > #search_resultsRows > a”);
    const deals = searchResults
    ((idx, el) => {
    return extractDeal(elementSelector);})
    module. exports = scrapSteam;
    So, we will create our Web API /server. I will use Hapi because we don’t need much-advanced features for this example, but it’s still free to use Express, Koa or whatever framework you want.
    yarn add @hapi/hapi
    npm i @hapi/hapi
    I copied and pasted the example of the Hapi documentation into a new file called Then, I created a route for “/ deals”, imported and called our scrapSteam function:
    const Hapi = require(“@hapi/hapi”);
    const scrapSteam = require(“. /scraper”);
    const init = async () => {
    const server = ({
    port: 3000,
    host: “localhost”});
    ({
    method: “GET”,
    path: “/deals”,
    handler: async (request, h) => {
    const result = await scrapSteam();
    return result;}});
    await ();
    (“Server running on%s”, );};
    (“unhandledRejection”, err => {
    (err);
    (1);});
    init();
    Notes:
    1- Depending on when you are reading this article, it is possible to obtain different results based on current “Weeklong Deals”;
    2- Depending on where you are, the currency and price information may differ from mine;
    3- My results are shown in this format because I use Json Viewer extension with the Dracula theme.
    You can find the source code in my repo.
    I hope this article can help you someday. : D
    Feel free to share your opinion!

    Frequently Asked Questions about cheerio js tutorial

    About the author

    proxyreview

    If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

    By proxyreview

    Recent Posts

    Useful Tools