Web Scraping Using Xpath

W

Practical XPath for Web Scraping - ScrapingBee

Practical XPath for Web Scraping – ScrapingBee


06 November, 2019
7 min read
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
XPath is a technology that uses path expressions to select nodes or node- sets in an XML document (or in our case an HTML document). Even if XPath is not a programming language in itself, it allows you to write expressions that can access directly to a specific HTML element without having to go through the entire HTML tree.
It looks like the perfect tool for web scraping right? At ScrapingBee we love XPath!
In our previous article about web scraping with Python we talked a little bit about XPath expression. Now it’s time to dig a bit deeper into this subject.
Why learn XPath
Knowing how to use basic XPath expressions is a must-have skill when extracting data from a web page.
It’s more powerful than CSS selectors
It allows you to navigate the DOM in any direction
Can match text inside HTML elements
Entire books have been written on XPath, and I don’t have the pretention to explain everything in-depth, this is an introduction to XPath and we will see through real examples how you can use it for your web scraping needs.
But first, let’s talk a little about the DOM
Document Object Model
I am going to assume you already know HTML, so this is just a small reminder.
As you already know, a web page is a document containing text within tags, that add meaning to the document by describing elements like titles, paragraphs, lists, links etc.
Let’s see a basic HTML page, to understand what the Document Object Model is.




What is the DOM?

DOM 101

Websraping is awsome!

Here is my tag. So we can select this password input with a simple: //input[@type=’password’]
Once we have this password input, we can use a relative path to select the username/email input. It will generally be the first preceding input that isn’t hidden:. //preceding::input[not(@type=’hidden’)]
It’s really important to exclude hidden inputs, because most of the time you will have at least one CSRF token hidden input. CSRF stands for Cross Site Request Forgery. The token is generated by the server and is required in every form submissions / POST requests. Almost every website use this mechanism to prevent CSRF attacks.
Now we need to select the enclosing form from one of the input:. //ancestor::form
And with the form, we can select the submit input/button:. //*[@type=’submit’]
Here is an example of such a function:
def autologin(driver, url, username, password):
(url)
password_input = nd_element_by_xpath(“//input[@type=’password’]”)
nd_keys(password)
username_input = nd_element_by_xpath(“. //preceding::input[not(@type=’hidden’)]”)
nd_keys(username)
form_element = nd_element_by_xpath(“. //ancestor::form”)
submit_button = nd_element_by_xpath(“. //*[@type=’submit’]”)()
return driver
Of course it is far from perfect, it won’t work everywhere but you get the idea.
Conclusion
XPath is very powerful when it comes to selecting HTML elements on a page, and often more powerful than CSS selectors.
One of the most difficult task when writing XPath expressions is not the expression in itself, but being precise enough to be sure to select the right element when the DOM will change, but also resilient enough to resist DOM changes.
At ScrapingBee, depending on our needs, we use XPath expressions or CSS selectors for our ready-made APIs. We will discuss the differences between the two in another blog post!
I hope you enjoyed this article, if you’re interested by CSS selectors, checkout this BeautifulSoup tutorial
Happy Scraping!
Web Scraping using lxml and XPath in Python - GeeksforGeeks

Web Scraping using lxml and XPath in Python – GeeksforGeeks

Prerequisites: Introduction to Web ScrappingIn this article, we will discuss the lxml python library to scrape data from a webpage, which is built on top of the libxml2 XML parsing library written in C. When compared to other python web scraping libraries like BeautifulSoup and Selenium, the lxml package gives an advantage in terms of performance. Reading and writing large XML files takes an indiscernible amount of time, making data processing easier & much will be using the lxml library for Web Scraping and the requests library for making HTTP requests in Python. These can be installed in the command line using the pip package installer for tting data from an element on the webpage using lxml requires the usage of XPathXPath works very much like a traditional file systemDiagram of a File SystemTo access file 1, C:/File1
Similarly, To access file 2, C:/Documents/User1/File2
Now consider a simple web page, HTML My page

Welcome to my page

An Introduction To XPath: How To Get Started – Zyte

XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is actually one of the languages that you can use to extract web data using Scrapy. The other is CSS and while CSS selectors are a popular choice, XPath can actually allow you to do more.
With XPath, you can extract data based on text elements’ contents, and not only on the page structure. So when you are scraping the web and you run into a hard-to-scrape website, XPath may just save the day (and a bunch of your time! ).
This is an introductory tutorial that will walk you through the basic concepts of XPath, crucial to a good understanding of it, before diving into more complex use cases.
Note: You can use the XPath playground to experiment with XPath. Just paste the HTML samples provided in this post and play with the expressions.
The basics
Consider this HTML document:


My page

Welcome to my page

This is the first paragraph.




XPath handles any XML/HTML document as a tree. This tree’s root node is not part of the document itself. It is in fact the parent of the document element node ( in case of the HTML above). This is how the XPath tree for the HTML document looks like:
As you can see, there are many node types in an XPath tree:
Element node: represents an HTML element, a. k. a an HTML tag.
Attribute node: represents an attribute from an element node, e. g. “href” attribute in example.
Comment node: represents comments in the document ().
Text node: represents the text enclosed in an element node (example in

example

).
Distinguishing between these different types is useful to understand how XPath expressions work. Now let’s start digging into XPath.
Here is how we can select the title element from the page above using an XPath expression:
/html/head/title
This is what we call a location path. It allows us to specify the path from the context node (in this case the root of the tree) to the element we want to select, as we do when addressing files in a file system. The location path above has three location steps, separated by slashes. It roughly means: start from the ‘html’ element, look for a ‘head’ element underneath, and a ‘title’ element underneath that ‘head’. The context node changes in each step. For example, the head node is the context node when the last step is being evaluated.
However, we usually don’t know or don’t care about the full explicit node-by-node path, we just care about the nodes with a given name. We can select them using:
//title
This means: look in the whole tree, starting from the root of the tree (//) and select only those nodes whose name matches title. In this example, // is the axis and title is the node test.
In fact, the expressions we’ve just seen are using XPath’s abbreviated syntax. Translating //title to the full syntax we get:
/descendant-or-self::node()/child::title
So, // in the abbreviated syntax is short for descendant-or-self, which means the current node or any node below it in the tree. This part of the expression is called the axis and it specifies a set of nodes to select from, based on their direction on the tree from the current context (downwards, upwards, on the same tree level). Other examples of axes are parent, child, ancestor, etc — we’ll dig more into this later on.
The next part of the expression, node(), is called a node test, and it contains an expression that is evaluated to decide whether a given node should be selected or not. In this case, it selects nodes from all types. Then we have another axis, childwhich means go to the child nodes from the current context, followed by another node test, which selects the nodes named as title.
So, the axis defines where in the tree the node test should be applied and the nodes that match the node test will be returned as a result.
You can test nodes against their name or against their type.
Here are some examples of name tests:
Expression
Meaning
/html
Selects the node named html, which is under the root.
/html/head
Selects the node named head, which is under the html node.
Selects all the title nodes from the HTML tree.
//h2/a
Selects all a nodes that are directly under an h2 node.
And here are some examples of node type tests:
//comment()
Selects only comment nodes.
//node()
Selects any kind of node in the tree.
//text()
Selects only text nodes, such as “This is the first paragraph”.
//*
Selects all nodes, except comment and text nodes.
We can also combine name and node tests in the same expression. For example:
//p/text()
This expression selects the text nodes from inside p elements. In the HTML snippet shown above, it would select “This is the first paragraph. “.
Now, let’s see how we can further filter and specify things. Consider this HTML document:

Say we want to select only the first li node from the snippet above. We can do this with:
//li[position() = 1]
The expression surrounded by square brackets is called a predicate and it filters the node-set returned by //li (that is, all li nodes from the document) using the given condition. In this case, it checks each node’s position using the position() function, which returns the position of the current node in the resulting node-set (notice that positions in XPath start at 1, not 0). We can abbreviate the expression above to:
//li[1]
Both XPath expressions above would select the following element:

  • Quote 1
  • Check out a few more predicate examples:
    //li[position()%2=0]
    Selects the li elements at even positions.
    //li[a]
    Selects the li elements that enclose an a element.
    //li[a or h2]
    Selects the li elements that enclose either an a or an h2 element.
    //li[ a [ text() = “link”]]
    Selects the li elements that enclose an a element whose text is “link”. Can also be written as //li[ a/text()=”link”].
    //li[last()]
    Selects the last li element in the document.
    So, a location path is basically composed of steps, which are separated by / and each step can have an axis, a node test, and a predicate. Here we have an expression composed of two steps, each one with axis, node test, and predicate:
    //li[ 4]/h2[ text() = “Quote 4 title”]
    And here is the same expression, written using the non-abbreviated syntax:
    /descendant-or-self::node()
    /child::li[ position() = 4]
    /child::h2[ text() = “Quote 4 title”]
    We can also combine multiple XPath expressions in a single one using the union operator |. For example, we can select all a and h2 elements in the document above using this expression:
    //a | //h2
    Now, consider this HTML document:

  • Scrapinghub
  • Footer text


  • Now we want to extract only the first paragraph after each of the titles. To do that, we can use the following-sibling axis, which selects all the siblings after the context node. Siblings are nodes who are children of the same parent, for example, all children nodes of the body tag are siblings. This is the expression:
    //h1/following-sibling::p[1]
    In this example, the context node where the following-sibling axis is applied to is each of the h1 nodes from the page.
    What if we want to select only the text that is right before the footer? We can use the preceding-sibling axis:
    //div[@id=’footer’]/preceding-sibling::text()[1]
    In this case, we are selecting the first text node before the div footer (“A single paragraph, with no markup”).
    XPath also allows us to select elements based on their text content. We can use such a feature, along with the parent axis, to select the parent of the p element whose text is “Footer text”:
    //p[ text()=”Footer text”]/..
    The expression above selects

    . As you may have noticed, we used.. here as a shortcut to the parent axis.
    As an alternative to the expression above, we could use:
    //*[p/text()=”Footer text”]
    It selects, from all elements, the ones that have a p child which text is “Footer text”, getting the same result as the previous expression.
    You can find additional axes in the XPath specification: Wrap up
    XPath is very powerful and this post is just an introduction to the basic concepts. If you want to learn more about it, check out these resources:
    XPath tips from the web scraping trenches
    And stay tuned, because we will post a series with more XPath tips from the trenches in the following months.

    Frequently Asked Questions about web scraping using xpath

    How do you scrape using XPath?

    Step-by-step ApproachWe will use requests. get to retrieve the web page with our data.We use html. fromstring to parse the content using the lxml parser.We create the correct XPath query and use the lxml xpath function to get the required element.Oct 11, 2020

    What is XPath in web scraping?

    XPath is a powerful language that is often used for scraping the web. It allows you to select nodes or compute values from an XML or HTML document and is actually one of the languages that you can use to extract web data using Scrapy.Oct 27, 2016

    What is XPath in python?

    Xpath is one the locators used in Selenium to identify elements uniquely on a web page. It traverses the DOM to reach the desired element having a particular attribute with/without tagname. The xpath can represented by the ways listed below − //tagname[@attribute=’value’]Jul 29, 2020

    About the author

    proxyreview

    If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

    By proxyreview

    Recent Posts

    Useful Tools