The Ultimate Guide To Web Scraping

T

HTTP Rotating & Static Proxies

  • 200 thousand IPs
  • Locations: US, EU
  • Monthly price: from $39
  • 1 day moneyback guarantee

Visit stormproxies.com

The Ultimate Guide to Web Scraping - Hartley Brody

The Ultimate Guide to Web Scraping – Hartley Brody

The Ultimate Guide to Web Scraping
Learn how to avoid the most common pitfalls and collect the data you need.
The book is designed to walk you from beginner to expert,
honing your skills and helping you become a master craftsman
in the art of web scraping.
Recently Updated!
Over 40 pages of new content including more code samples and better coverage of advanced topics.
The book is designed to walk you from beginner to expert, honing your skills and helping you become a master craftsman in the art of web scraping.
You’ll learn about:
The most common complaints about web scraping, and why they probably don’t matter for you.
How modern websites send information to a browser, and how you can intercept and parse it.
How to find all the data you need on someone else’s website.
Common traps and anti-scraping tactics (and how to thwart them).
How to write a well-behaved scraper and be a good scraping citizen.
The book has many working code samples in python that you can copy/paste to use yourself. It’ll walk through the process of scraping data real websites, step-by-step.
The downloadable ebook package includes all of the following formats:
Kindle
iPads
Desktops & Laptops
Table of Contents
The Ultimate Guide to Web Scraping contains the following chapters:
Introduction to Web Scraping
Web Scraping as a Legitimate Data Collection Tool
Understand Web Technologies: What Your Browser is Doing Behind the Scenes
Pattern Discovery: Finding the Right URLs that Return the Data You’re Looking For
Pattern Discovery: Finding the Structure in an HTML Document
Hands On: Building a Simple Web Scraper with Python
Hands On: Storing the Scraped Data & Keeping Track of Progress
Scraping Data that’s Not in the Response HTML
Avoiding Common Scraping Pitfalls, Good Scraping Etiquette & Other Best Practices
How to Troubleshoot and Fix Your Web Scraping Code Without Pulling Your Hair Out
A Handy, Easy-To-Reference Web Scraping Cheat Sheet
Web Scraping Resources: A Beginner-Friendly Sandbox and Online Course
About the Author
Over the years, I’ve scraped dozens of websites — from music blogs and fashion retailers to and undocumented JSON endpoints. I’ve learned a thing or two along the way, so I wrote an article in 2012 called I Don’t Need No Stinking API: Web Scraping For Fun and Profit.
That article has now been viewed almost 500, 000 times. It’s helped me meet people from all over the world who are trying to navigate the wild world of web scraping. After dozens of conversation via email and Twitter, I finally decided I’d write a book.
Join Over 2, 000+ Delighted Readers
If you’re not 100% satisifed with your purchase, simply reply to your confirmation email and I’ll refund your purchase.
The Ultimate Guide To Web Scraping - UserManual.wiki

VPN

  • No logs
  • Kill Switch
  • 6 devices
  • Monthly price: $4.92

Visit nordvpn.com

The Ultimate Guide To Web Scraping – UserManual.wiki

elements with a
class of country. Then we could loop over each of those elements and look for the following:
1.
2.
3.
4.
elements with a class of country-name to find the country’s name
elements with a class of country-capital to find the country’s capital
elements with a class of country-population to find the country’s population
elements with a class of country-area to find the country’s area
Note that we don’t even really need to get into the differences between an element and a
or element, what they mean or how they’re displayed. We just need to look for the patterns
in how the website is using them to markup the information we’re looking to scrape.
Implementing the Patterns
Now that we’ve found our patterns, it’s time to code them up. Let’s start with some code that makes
a simple HTTP request to the page we want to scrape:
Hands On: Building a Simple Web Scraper with Python
33
import requests
url = ”
r = (url)
print “We got a {} response code from {}”(atus_code, url)
Save this code in a file called on your Desktop, and then run it from the command line
using the following command
python ~/Desktop/
You should see it print out the following:
We got a 200 response code from
If so, congratulations! You’ve made your first HTTP request from a web scraping program and
verified that the server responded with a successful response status code.
Next, let’s update our program a bit to take a look at the actual HTML response that we got back
from the server with our request:
print
# print out the HTML response text
If you scroll back up through your console after you run that script, you should see all of the HTML
text that came back from the server. This is the same as what you’d see if in your browser if you
right-click > “View Source”.
Now that we’ve got the actual HTML text of the response, let’s pass it off to our HTML parsing
library to be turned into a nested, DOM-like structure. Once it’s formed into that structure, we’ll do
a simple search and print out the page’s title. We’ll be using the much beloved BeautifulSoup and
telling it to use the default HTML parser. We’ll also remove an earlier print statement that displayed
the status code.
34
from bs4 import BeautifulSoup
r = (“)
soup = BeautifulSoup(, “”)
print (“title”)
Now if we run this, we should see that it makes an HTTP request to the page and then prints out
the page’s tag.
Next, let’s try adding in some of the HTML patterns that we found earlier while inspecting the site
in our web browser.
countries = nd_all(“div”, “country”)
print “We found {} countries on this page”(len(countries))
In that code, we’re using the find_all() method of BeautifulSoup to find all of the

elements
on the page with a class of country. The first argument that we pass to find_all() is the name of
the element that we’re looking for. The second argument we pass to the find_all() function is the
class that we’re looking for.
Taken together, this ensures that we only get back

elements with a class of country. There
are other arguments and functions you can use to search through the soup, which are explained in
more detail in Chapter 11.
Once we find all of the elements that match that pattern, we’re storing them in a variable called
countries. Finally, we’re printing out the length of that variable – basically a way to see how many
elements matched our pattern.
This is a good practice as you’re building your scrapers – making sure the pattern you’re using is
finding the expected number of elements on the page helps to ensure you’re using the right pattern.
If it found too many or too few, then there might be an issue that warrants more investigation.
If you run that code, you should see that it says
35
We found 250 countries on this page
Now that we’ve implemented one search pattern to find the outer wrapper around each country, it’s
time to iterate over each country that we found and do some further extraction to get at the values
we want to store. We’ll start by just grabbing the countries’ names.
for country in countries:
print (“h3”)()
Here we’ve added a simple for-loop that iterates over each item in the countries list. Inside the
loop, we can access the current country we’re looping over using the country variable.
All that we’re doing inside the loop is another BeautifulSoup search, but this time, we’re only
searching inside the specific country element (remember, each

… ). We’re
looking for an

element, then grabbing the text inside of it using attribute, stripping off
any extra whitespace around it with () and finally printing it to the console.
If you run that from the command line, you should see the full list of countries printed to the page.
You’ve just built your first web scraper!
We can update it to extract more fields for each country like so:
# note: you will likely get errors when you run this!
name = (“h3”)()
capital = (“span”, “country-capital”)()
population = (“span”, “country-population”)()
36
area = (“span”, “country-area”)()
print “{} has a capital city of {}, a population of {} and an area of {}”\
rmat(name, capital, population, area)
If you try running that, it might start printing out information on the first few countries, but then
it’s likely to encounter an error when trying to print out some of the countries’ information. Since
some of the countries have foreign characters, they don’t play nicely when doing thing like putting
them in a format string, printing them to the terminal or writing them to a file.
We’ll need to update our code to add (“utf-8”) to the extracted values that are giving us
problems. In our case, those are simply the country name and the capital city. When we put it all
together, it should look like this.
name = (“h3”)()(“utf-8”)
capital = (“span”, “country-capital”)()(“utf-8\
“)
And there you have it! Only a dozen lines of python code, and we’ve already built our very first
web scraper that loads a page, finds all of the countries and then loops over them, extracting several
pieces of content from each country based on the patterns we discovered in our browser’s developer
tools.
One thing you’ll notice, that I did on purpose, is to build the scraper incrementally and test it often.
I’d add a bit of functionality to it, and then run it and verify it does what we expect before moving
on and adding more functionality. You should definitely follow this same process as you’re building
your scrapers.
Web scraping can be more of an art than a science sometimes, and it often takes a lot of guess and
check work. Use the print statement liberally to check on how your program is doing and then
make tweaks if you encounter errors or need to change things. I talk more about some strategies for
testing and debugging your web scraping code in Chapter 10, but it’s good to get into the habit of
making small changes and running your code often to test those changes.
Hands On: Storing the Scraped Data &
Keeping Track of Progress
In the last chapter, we built out very first web scraper that visited a single page and extracted
information about 250 countries. However, the scraper wasn’t too useful – all we did with that
information was print it to the screen.
In reality, it’d be nice to store that information somewhere useful, where we could access it and
analyze it after the scrape completes. If the scrape will visit thousands of pages, it’d be nice to check
on the scraped data while the scraper is still running, to check on our progress, see how quickly
we’re pulling in new data, and look for potential errors in our extraction patterns.
Store Data As You Go
Before we look at specific methods and places for storing data, I wanted to cover a very important
detail that you should be aware of for any web scraping project: You should be storing the
information that you extract from each page, immediately after you extract it.
It might be temping to add the information to a list in-memory and then run some processing or
aggregation on it after you have finished visiting all the URLs you intend to scrape, before storing it
somewhere. The problem with this solution becomes apparent anytime you’re scraping more than
a few dozen URLs, and is two fold:
1. Because a single request to a URL can take 2-3 seconds on average, web scrapers that need to
access lots of pages end up taking quite a while to run. Some back of the napkin math says that
100 requests will only take 4-5 minutes, but 1, 000 requests will take about 45 minutes, 10, 000
requests will take about 7 hours and 100, 000 requests will take almost 3 days of running nonstop.
2. The longer a web scraping program runs for, the more likely it is to encounter some sort of
fatal error that will bring the entire process grinding to a halt. We talk more about guarding
against this sort of issue in later chapters, but it’s important to be aware of even if your script
is only going to run for a few minutes.
Any web scraper that’s going to be running for any length of time should be storing the extracted
contents of each page somewhere permanent as soon as it can, so that if it fails and needs to be
restarted, it doesn’t have to go back to square one and revisit all of the URLs that it already extracted
content from.
Hands On: Storing the Scraped Data & Keeping Track of Progress
38
If you’re going to do any filtering, cleaning, aggregation or other processing on your scraped data,
do it separately in another script, and have that processing script load the data from the spreadsheet
or database where you stored the data initially. Don’t try to do them together, instead you should
scrape and save immediately, and then run your processing later.
This also gives you the benefit of being able to check on your scraped data as it is coming in. You
can examine the spreadsheet or query the database to see how many items have been pulled down
so far, refreshing periodically to see how things are going and how quickly the data is coming in.
Storing Data in a Spreadsheet
Now that we’ve gone over the importance of saving data right away, let’s look at a practical example
using the scraper we started working on in the last chapter. As a reminder, here is the code we ended
up with:
What we’d like to do is store the information about each country in a Comma Separated Value (CSV)
spreadsheet. Granted, in this example we’re only making one request, so the “save data right away
so that you don’t have to revisit URLs” is a bit of a moot point, but hopefully you’ll see how this
would be important if we were scraping many more URLs.
Let’s spend a minute thinking about what we’d want our spreadsheet to look like. We’d want the
first row to contain header information – the label for each of our columns: country name, capital,
population and area. Then, each row underneath that should contain information about a new
country, with the data in the correct columns. Our final spreadsheet should have 250 rows, one
for each country, plus an extra row for the headers, for a total of 251 rows. There should be four
columns, one for reach field we’re going to extract.
39
It’s good to spend a bit of time thinking about what our results should look like so that we can have a
way to tell if our scraper ran correctly. Even if we don’t know exactly how many items we’ll extract,
we should try to have a ballpark sense of the number (order of magnitude) so that we know how
large our finished data set will be, and how long we (roughly) expect our scraper to take to run until
completion.
Python comes with a builtin library for reading from and writing to CSV files, so you won’t have to
run any pip commands to install it. Let’s take a look at a simple use case:
import csv
with open(“/path/to/”, “w+”) as f:
writer = (f)
writer. writerow([“col #1”, “col #2”, “col #3”])
This creates a file called (note that you’ll likely have to change the full file path to point
to an actual folder on your computer’s hard drive). We open the file and then create a ()
object, before calling the writerow() method of that writer object.
This sample code only writes a single row to the CSV with three columns, but you can imagine that
it would write many rows to a file if the writerow() function was called inside of a for-loop, like
the one we used to iterate over the countries.
Note that the indentation of the lines matters here, just like it did inside our for-loop in the last
chapter. Once we’ve opened the file for writing, we need to indent all of the code that accesses that
file. Once our code has gone back out a tab, python will close the file for us and we won’t be able
to write to it anymore. In python, the number of spaces at the beginning of a line are significant, so
make sure you’re careful when you’re nesting file operations, for-loop and if-statements.
Let’s look at how we can combine the CSV writing that we just learned with our previous scraper.
writer. writerow([“Name”, “Capital”, “Population”, “Area”])
40
capital = (“span”, “country-capital”)()(“u\
tf-8″)
writer. writerow([name, capital, population, area])
You’ll see that we added the import csv statement to the top of the file with our other imports.
Then we open our output file and create a () for it. Before we even get to our scraping
code, we’re going to write a quick header row to the file, labeling the columns in the same order
that we’ll be saving them below inside our loop. Then we’re indenting all the rest of our code over
so that it’s “inside” the file handling code and can write our output to the csv file.
Just as before, we make the HTTP request, pass the response HTML into BeautifulSoup to parse,
look for all of the outer country wrapper elements, and then loop over them one-by-one. Inside our
loop, we’re extracting the name, capital, population and area, just as before, except this time we’re
now writing them to our CSV with the writer. writerow() function, instead of simply printing to
the screen.
If you save that into a text file on your Desktop called “”, then you can run it from the
command line with
You won’t see any output at the command line since we don’t have any print statements in our
scraper. However, if you open the output file that you created using a spreadsheet tool like excel,
you should see all 251 rows that we were expecting, with the data in the correct columns.
CSV is a very simple file format and doesn’t support formatting like bolding, colored backgrounds,
formulas, or other things you might expect from a modern spreadsheet. But it’s a universal format
that can be opened by pretty much any spreadsheet tool, and can then be re-saved into a format that
does support modern spreadsheet features, like an file.
CSV files are a useful place to write data to for many applications. They’re easy files to open, edit
and work with, and are easy to pass around as email attachments or share with coworkers. If you
know your way around a spreadsheet application, you could use pivot tables or filters to rum some
basic queries and aggregation.
But if you really want a powerful place to store your data and run complex queries and relational
logic, you’d want to use a SQL database, like sqlite.
41
Storing Data in a SQLite Database
There are a number of well-known and widely used SQL database that you may already be familiar
with, like MySQL, PostgreSQL, Oracle or Microsoft SQL Server.
When you’re working with a SQL database, it can be local – meaning that the actual database lives
on the same computer where your scraper is running – or it could be running on a server somewhere
else, and you have to connect to it over the network to write your scraped information to it.
If your database isn’t local (that is, it’s hosted on a server somewhere) you should consider that
each time you insert something into it, you’re making a round-trip across the network from your
computer to the database server, similar to making an HTTP request to a web server. I usually try
to have the database live locally on the machine that I’m running the scrape from, in order to cut
down on extra network traffic.
For those not familiar with SQL databases, an important concept is that the data layout is very
similar to a spreadsheet – each table in a database has several pre-defined columns and each new
item that’s added to the database is thought of as a “row” being added to the table, with the values
for each row stored in the appropriate columns.
When working with a database, your first step is always to open a connection to it (even if it’s a
local database). Then you create a cursor on the connection, which you use to send SQL statements
to the database. Finally, you can execute your SQL statements on that cursor to actually send them
to the database and have them be implemented.
Unlike spreadsheets, you have to define your table and all of its columns ahead of time before you
can begin to insert rows. Also, each column has a well-defined data type. If you say that a certain
column in a table is supposed to be a number or a boolean (ie “True” or “False”) and then try to insert
a row that has a text value in that column, the database will block you from inserting the incorrect
data type and return an error.
You can also setup unique constraints on the data to prevent duplicate values from being inserted.
We won’t go through all the features of a SQL database in this book, but we will take a look at some
basic sample code shortly.
The various SQL databases all offer slightly different features and use-cases in terms of how you
can get data out and work with it, but they all implement a very consistent SQL interface. For
our purposes, we’ll simply be inserting rows into our database using the INSERT command, so
the differences between the database are largely irrelevant. We’re using SQLite since it’s a very
lightweight database engine, easy to use and comes built in with python.
Let’s take a look at some simple code that connects to a local sqlite database, creates a table, inserts
some rows into the table and then queries them back out.
42
import sqlite3
# setup the connection and cursor
conn = nnect(”)
w_factory =
olation_level = None
c = ()
# create our table, only needed once
c. execute(“””
CREATE TABLE IF NOT EXISTS items (
name text,
price text,
url text)
“””)
# insert some rows
insert_sql = “INSERT INTO items (name, price, url) VALUES (?,?,? );”
c. execute(insert_sql, (“Item #1”, “$19. 95”, “))
c. execute(insert_sql, (“Item #2”, “$12. 95”, “))
# get all of the rows in our table
items = c. execute(“SELECT * FROM items”)
for item in items. fetchall():
print item[“name”]
Just as we described, we’re connecting to our database and getting a cursor before executing some
commands on the cursor. The first execute() command is to create our table. Note that we have the
IF NOT EXISTS because otherwise SQLite would throw an error if we ran this script a second time,
it tried to create the table again and discovered that there was an existing table with the same name.
We’re creating a table with three columns (name, price and URL), all of which use the column type
of text. It’s good practice to use the text column type for scraped data since it’s the most permissive
data type and is least likely to throw errors. Plus, the data we’re storing is all coming out of an HTML
text file anyways, so there’s a good chance it’s already in the correct text data type.
After we’ve created our table with the specified columns, then we can begin inserting rows into
it. You’ll see that we have two lines that are very similar, both executing INSERT SQL statements
with some sample data. You’ll also notice that we’re passing two sets of arguments to the execute()
command in order to insert the data.
The first is a SQL string with the INSERT command, the name of the table, the order of the columns
and then finally a set of question marks where we might expect the data to go. The actual data is
43
passed in inside a second argument, rather than inside the SQL string directly. This is to prevent
against accidental SQL injection issues.
If you build the SQL INSERT statement as a string yourself using the scraped values, you’ll run
into issues if you scrape a value that has an apostrophe, quotation mark, comma, semicolon or lots
of other special characters. Our sqlite3 library has a way of automatically escaping these sort of
values for us so that they can be inserted into the database without issue.
It’s worth repeating: never build your SQL INSERT statements yourself directly as strings. Instead,
use the? as a placeholder where your data will go, and then pass the data in as a second argument.
The sqlite3 library will take care of escaping it and inserting it into the string properly and safely.
Now that we’ve seen how to work with database, let’s go back to our earlier scraper and change it
to insert our scraped data into our SQLite database, instead of a CSV.
CREATE TABLE IF NOT EXISTS countries (
capital text,
population text,
area text)
insert_sql = “INSERT INTO countries (name, capital, population, area) VALUES (?, \?,?,? );”
44
c. execute(insert_sql, (name, capital, population, area))
We’ve got our imports at the top of the file, then we connect to the database and setup our cursor,
just like we did in the previous SQL example. When we were working with CSV files, we had to
indent all of our scraping code so that it could still have access to the output CSV file, but with our
SQL database connection, we don’t have to worry about that, so our scraping code can stay at the
normal indentation level.
After we’ve opened the connection and gotten the cursor, we setup our countries table with 4
columns of data that we want to store, all set to a data type of text. Then we make the HTTP
request, create the soup, and iterate over all the countries that we found, executing the same INSERT
statements over and over again, passing in the newly extracted information. You’ll notice that we
removed the (“utf-8”) calls at the end of the name and capital extraction lines, since sqlite
has better support for the unicode text that we get back from BeautifulSoup.
Once your script has finished running, you can connect to your SQLite database and run a SELECT
* FROM countries command to get all of the data back. Running queries to aggregate and analyze
your data is outside the scope of this book, but I will mention the ever-handy SELECT COUNT(*)
FROM countries; query, which simply returns the number of rows in the table. This is useful for
checking on the progress of long-running scrapes, or ensuring that your scraper pulled in as many
items as you were expecting.
In this chapter, we learned the two most common ways you’ll want to store your scraped data –
in a CSV spreadsheet or a SQL database. At this point, we have a complete picture of how to find
the pattern of HTTP requests and HTML elements we’ll need to fetch and extract our data, and
we know how to turn those patterns into python code that makes the requests and hunts for the
patterns. Now we know how to store our extracted data somewhere permanent.
For 99% of web scraping use-cases, this is a sufficient skill set to get the data we’re looking for. The
rest of the book covers some more advanced skills that you might need to use on especially tricky
websites, as well as some handy cheat sheets and tips for debugging your code when you run into
issues.
If you’re looking for some more practice with the basics, head over to Scrape This Site24, an entire
web scraping portal that I built out in order to help new people learn and practice the art of web
scraping. There’s more sample data that you can scrape as well as a whole set of video courses you
can sign up for that walk you through everything step-by-step.
24
Scraping Data that’s Not in the
Response HTML
For most sites, you can simply make an HTTP request to the page’s URL that you see in the bar at
the top of your browser. Occasionally though, the information we see on a web page isn’t actually
fetched the way we might initially expect.
The reality is that now-a-days, websites can be very complex. While we generally think of a web
page as having its own URL, the reality is that most web pages require dozens of HTTP requests –
sometimes hundreds – to load all of their resources. Your browser makes all of these requests in the
background as it’s loading the page, going back to the web server to fetch things like CSS, Javascript,
fonts, images – and occasionally, extra data to display on the page.
Open your browser’s developer tools, click over to the “Network” tab and then try navigating to a
site like The New York Times25. You’ll see that there might be hundreds of different HTTP requests
that are sent, simply to load the homepage.
The developer tools show some information about the “Type” of each request. It also allows you to
filter the requests if you’re only interested in certain request types. For example, you could filter to
only show requests for Javascript files or CSS stylesheets.
25
Scraping Data that’s Not in the Response HTML
46
175 HTTP Requests to Load the Homepage!
It’s important to note that each one of these 175 resources that were needed to load the homepage
each required their own HTTP request. Just like the request to the main page URL (),
each of these requests has its own URL, request method and headers.
Similarly, each request returns a response code, some response headers and then the actual body of
the response. For the requests we’ve looked at so far, the body of the response has always been a large
blob of HTML text, but for these requests, often the response is formatted completely differently,
depending on what sort of response it is.
I mention all of this because it’s important to note that sometimes, the data that you see on a page that
you’re trying to scrape isn’t actually returned in the HTML response to the page’s URL. Instead, it’s
requested from the web server and rendered in your browser in one of these other “hidden” requests
that your browser makes behind the scenes as the page loads.
If you try to make a request to the parent page’s URL (the one you see in your browser bar) and
then look for the DOM patterns that you see in the “Inspect Element” window of your developer
tools, you might be frustrated to find that the HTML in the parent page’s response doesn’t actually
contain the information as you see it in your browser.
At a high-level, there are two ways in which a site might load data from a web server to render in
your browser, without actually returning that data in the page’s HTML. The first is that Javascriptheavy websites might make AJAX calls in the background to fetch data in a raw form from an API.
47
The second is that sites might sometimes return iframes that embed some other webpage inside the
current one.
Javascript and AJAX Requests
With many modern websites and web applications, the first HTTP request that’s made to the parent
page’s URL doesn’t actually return much information at all. Instead, it returns just a bit of HTML
that tells the browser to download a big Javascript application, and then the Javascript application
runs in your browser and makes its own HTTP requests to the backend to fetch data and then add
it to the DOM so that it’s rendered for the user to see.
The data is pulled down asynchronously – not at the same time as the main page request. This leads
to the name AJAX for asynchronous javascript and X ML (even though the server doesn’t actually
need to return XML data, and often doesn’t).
Some common indications that the site you’re viewing uses Javascript and AJAX to add content to
the page:
• so called “infinite scrolling” where more content is added to the page as you scroll
• clicking to another page loads a spinner for a few seconds before showing the new content
• the page doesn’t appear to clear and then reload when clicking around, the new content just
shows up
• the URL contains lots of information after a hashtag character (#)
It may sound complicated at first, and some web scraping professionals really get throw into a
tailspin when they encounter a site that uses Javascript. They think that they need to download all
of the site’s Javascript and run the entire application as if they were the web browser. Then they
need to wait for the Javascript application to make the requests to fetch the data, and then they
need to wait for the Javascript application to render that data to the page in HTML, before they can
finally scrape it from the page. This involves installing a ton of tools and libraries that are usually a
pain to work with and add a lot of overhead that slows things down.
The reality – if we stick to our fundamentals – is that scraping Javascript-heavy websites is actually
easier than scraping an HTML page, using the simple tools we already have. All we have to do is
inspect the requests that the Javascript application is triggering in our browser, and then instruct
our web scraper to make those same requests. Usually the response to these requests is in a nicely
structured data format like JSON (javascript object notation), which is even easier to parse and read
through than an HTML document.
In your browser’s developer tools, pull up the “Network” tab, reload then page, and then click the
request type filter at the top to only show the XHR requests. These are the AJAX requests that the
Javascript application is making to fetch the data.
48
Once you’ve filtered down the list of requests, you’ll probably see a series of HTTP requests that
your browser has made in the background to load more information onto the page. Click through
each of these requests and look at the Response tab to see what content was returned from the server.
Inspecting an AJAX request in the “Network” tab of Chrome’s Developer Tools
Eventually, you should start to see the data that you’re looking to scrape. You’ve just discovered
your own hidden “API” that the website is using, and you can now direct your web scraper to use
it in order to pull down the content you’re looking for.
You’ll need to read through the response a bit to see what format the data is being returned in. You
might be able to tell by looking at the Content-Type Response header (under the “Headers” tab for
the request). Occasionally, AJAX requests will return HTML – in which case you would parse the
response with BeautifulSoup just as we have been.
Normally though, sites will have their AJAX requests return JSON formatted data. This is a much
more lightweight format than HTML and is usually more straightforward to parse. In fact, the
requests library that we’ve been using in python has a simple helper method to access JSON data
as a structured python object. Let’s take a look at an example.
Let’s say that we’ve found a hidden AJAX request URL that returns JSON-formatted data like this:
49
{
“items”: [{
“name”: “Item #1”,
“price”: “$19. 95”,
“permalink”: “/item-1/”, },
“name”: “Item #2”,
“price”: “$12. 95”,
“permalink”: “/item-2/”, },… ]}
We could build a simple scraper, like so:
# make the request to the “hidden” AJAX URL
# convert the response JSON into a structured python `dict()`
response_data = ()
for item in response_data[“items”]:
print item[“price”]
The data is automatically parsed into a nested structure that we can loop over and inspect, without
having to do any of our own pattern matching or hunting around. Pretty straightforward.
It takes a bit of work up-front to discover the “hidden” AJAX requests that the Javascript application
is triggering in the background, so that’s not as quick as simply grabbing the URL in the big bar at
the top of your browser. But once you find the correct URLs, it’s often much simpler to scrape data
from these hidden AJAX APIs.
Pages Inside Pages: Dealing with iframes
There’s another sort of issue you’ll run into where the parent page doesn’t actually have the data
that’s displayed in it’s own HTML response. That’s when the site you’re scraping uses iframes.
50
Iframes are a special HTML element that allows a parent page to embed a completely different page
(at a completely different URL) and load all of that page’s content onto the parent page, inside a box.
It’s basically a way to open one web page inside another web page.
This is commonly used for embedding content between
Ultimate Guide to Web Scraping with Python Part 1: Requests ...

Ultimate Guide to Web Scraping with Python Part 1: Requests …

Part one of this series focuses on requesting and wrangling HTML using two of the most popular Python libraries for web scraping: requests and BeautifulSoupAfter the 2016 election I became much more interested in media bias and the manipulation of individuals through advertising. This series will be a walkthrough of a web scraping project that monitors political news from both left and right wing media outlets and performs an analysis on the rhetoric being used, the ads being displayed, and the sentiment of certain tebook for this tutorial — GitHubThe first part of the series will we be getting media bias data and focus on only working locally on your computer, but if you wish to learn how to deploy something like this into production, feel free to leave a comment and let me should already know:Python fundamentals – lists, dicts, functions, loops – learn on CourseraBasic HTMLYou will learn:Requesting web pagesParsing HTMLSaving and loading scraped dataScraping multiple pages in a rowEvery time you load a web page you’re making a request to a server, and when you’re just a human with a browser there’s not a lot of damage you can do. With a Python script that can execute thousands of requests a second if coded incorrectly, you could end up costing the website owner a lot of money and possibly bring down their site (see Denial-of-service attack (DoS)) this in mind, we want to be very careful with how we program scrapers to avoid crashing sites and causing damage. Every time we scrape a website we want to attempt to make only one request per page. We don’t want to be making a request every time our parsing or other logic doesn’t work out, so we need to parse only after we’ve saved the page I’m just doing some quick tests, I’ll usually start out in a Jupyter notebook because you can request a web page in one cell and have that web page available to every cell below it without making a new request. Since this article is available as a Jupyter notebook, you will see how it works if you choose that we make a request and retrieve a web page’s content, we can store that content locally with Python’s open() function. To do so we need to use the argument wb, which stands for “write bytes”. This let’s us avoid any encoding issues when is a function that wraps the open() function to reduce a lot of repetitive coding later on:def save_html(html, path):
with open(path, ‘wb’) as f:
(html)
save_html(ntent, ‘google_com’)Assume we have captured the HTML from in html, which you’ll see later how to do. After running this function we will now have a file in the same directory as this notebook called google_com that contains the retrieve our saved file we’ll make another function to wrap reading the HTML back into html. We need to use rb for “read bytes” in this open_html(path):
with open(path, ‘rb’) as f:
return ()
html = open_html(‘google_com’)The open function is doing just the opposite: read the HTML from google_com. If our script fails, notebook closes, computer shuts down, etc., we no longer need to request Google again, lessening our impact on their servers. While it doesn’t matter much with Google since they have a lot of resources, smaller sites with smaller servers will benefit from this. I save almost every page and parse later when web scraping as a safety site usually has a on the root of their domain. This is where the website owner explicitly states what bots are allowed to do on their site. Simply go to and you should find a text file that looks something like this:User-agent: *
Crawl-delay: 10
Allow: /pages/
Disallow: /scripts/
# more stuffThe User-agent field is the name of the bot and the rules that follow are what the bot should follow. Some will have many User-agents with different rules. Common bots are googlebot, bingbot, and applebot, all of which you can probably guess the purpose and origin don’t really need to provide a User-agent when scraping, so User-agent: * is what we would follow. A * means that the following rules apply to all bots (that’s us) Crawl-delay tells us the number of seconds to wait before requests, so in this example we need to wait 10 seconds before making another gives us specific URLs we’re allowed to request with bots, and vice versa for Disallow. In this example we’re allowed to request anything in the /pages/subfolder which means anything that starts with On the other hand, we are disallowed from scraping anything from the /scripts/ times you’ll see a * next to Allow or Disallow which means you are either allowed or not allowed to scrape everything on the metimes there will be a disallow all pages followed by allowed pages like this:Disallow: *
Allow: /pages/This means that you’re not allowed to scrape anything except the subfolder /pages/. Essentially, you just want to read the rules in order where the next rule overrides the previous project will primarily be run through a Jupyter notebook, which is done for teaching purposes and is not the usual way scrapers are programmed. After showing you the pieces, we’ll put it all together into a Python script that can be run from command line or your IDE of Python’s requests (pip install requests) library we’re getting a web page by using get() on the URL. The response r contains many things, but using ntent will give us the HTML. Once we have the HTML we can then parse it for the data we’re interested in ‘s an interesting website called AllSides that has a media bias rating table where users can agree or disagree with the Bias Ratings on AllSidesSince there’s nothing in their that disallows us from scraping this section of the site, I’m assuming it’s okay to go ahead and extract this data for our project. Let’s request the this first page:import requests
url = ”
r = (url)
print(ntent[:100])Out:b’\n\n, , use the naked name for the tag. E. g. select_one(‘a’) gets an anchor/link element, select_one(‘body’) gets the body gets an element with a class of temp, E. to get use select_one(”)#temp gets an element with an id of temp, E. to get use select_one(‘#temp’) gets an element with both classes temp and example, E. to get use select_one(”) a gets an anchor element nested inside of a parent element with class temp, E. to get

use select_one(‘ a’). Note the space between and a.. example gets an element with class example nested inside of a parent element with class temp, E. to get

use select_one(‘. example’). Again, note the space between and. example. The space tells the selector that the class after the space is a child of the class before the, such as , are unique so you can usually use the id selector by itself to get the right element. No need to do nested selectors when using ‘s many more selectors for for doing various tasks, like selecting certain child elements, specific links, etc., that you can look up when needed. The selectors above get us pretty close to everything we would need for on figuring out how to select certain elementsMost browsers have a quick way of finding the selector for an element using their developer tools. In Chrome, we can quickly find selectors for elements byRight-click on the the element then select “Inspect” in the menu. Developer tools opens and and highlights the element we right-clickedRight-click the code element in developer tools, hover over “Copy” in the menu, then click “Copy selector”Sometimes it’ll be a little off and we need to scan up a few elements to find the right one. Here’s what it looks like to find the selector and Xpath, another type of selector, in Chrome:Our data is housed in a table on AllSides, and by inspecting the header element we can find the code that renders the table and rows. What we need to do is select all the rows from the table and then parse out the information from each ‘s how to quickly find the table in the source code:Simplifying the table’s HTML, the structure looks like this (comments added by me):








So to get each row, we just select all

inside

:rows = (‘tbody tr’)tbody tr tells the selector to extract all

(table row) tags that are children of the

body tag. If there were more than one table on this page we would have to make a more specific selector, but since this is the only table, we’re good to we have a list of HTML table rows that each contain four cells:News source name and linkBias dataAgreement buttonsCommunity feedback dataBelow is a breakdown of how to extract each ‘s look at the first cell:

ABC News

The outlet name (ABC News) is the text of an anchor tag that’s nested inside a

tag, which is a cell — or table data tting the outlet name is pretty easy: just get the first row in rows and run a select_one off that object:row = rows[0]
name = lect_one(”)()
print(name)The only class we needed to use in this case was since looks to be just a class each row is given for styling and doesn’t provide any that we didn’t need to worry about selecting the anchor tag a that contains the text. When we use is gets all text in that element, and since “ABC News” is the only text, that’s all we need to do. Bear in mind that using select or select_one will give you the whole element with the tags included, so we need to give us the text between the () ensures all the whitespace surrounding the name is removed. Many websites use whitespace as a way to visually pad the text inside elements so using strip() is always a good ‘ll notice that we can run BeautifulSoup methods right off one of the rows. That’s because the rows become their own BeautifulSoup objects when we make a select from another BeautifulSoup object. On the other hand, our name variable is no longer a BeautifulSoup object because we called also need the link to this news source’s page on AllSides. If we look back at the HTML we’ll see that in this case we do want to select the anchor in order to get the href that contains the link, so let’s do that:allsides_page = lect_one(‘ a’)[‘href’]
allsides_page = ” + allsides_page
print(allsides_page)Out: is a relative path in the HTML, so we prepend the site’s URL to make it a link we can request tting the link was a bit different than just selecting an element. We had to access an attribute (href) of the element, which is done using brackets, like how we would access a Python dictionary. This will be the same for other attributes of elements, like src in images and can see that the rating is displayed as an image so how can we get the rating in words? Looking at the HTML notice the link that surrounds the image has the text we need:


Political News Media Bias Rating: Lean Left

We could also pull the alt attribute, but the link looks easier. Let’s grab it:bias = lect_one(‘ a’)[‘href’]
bias = (‘/’)[-1]
print(bias)Here we selected the anchor tag by using the class name and tag together: is the class of the

and is for the anchor nested that we extract the href just like before, but now we only want the last part of the URL for the name of the bias so we split on slashes and get the last element of that split (left-center) last thing to scrape is the agree/disagree ratio from the community feedback area. The HTML of this cell is pretty convoluted due to the styling, but here’s the basic structure:

8241/6568

The numbers we want are located in two span elements in the last div. Both span elements have classes that are unique in this cell so we can use them to make the selection:agree = lect_one(”)
agree = int(agree)
disagree = lect_one(‘. disagree’)
disagree = int(disagree)
agree_ratio = agree / disagree
print(f”Agree: {agree}, Disagree: {disagree}, Ratio {agree_ratio:. 2f}”)Out:Agree: 8411, Disagree: 6662, Ratio 1. 26Using will return a string, so we need to convert them to integers in order to calculate the note: If you’ve never seen this way of formatting print statements in Python, the f at the front allows us to insert variables right into the string using curly braces. The:. 2f is a way to format floats to only show two decimals you look at the page in your browser you’ll notice that they say how much the community is in agreement by using “somewhat agree”, “strongly agree”, etc. so how do we get that? If we try to select it:print(lect_one(‘. community-feedback-rating-page’))It shows up as None because this element is rendered with Javascript and requests can’t pull HTML rendered with Javascript. We’ll be looking at how to get data rendered with JS in a later article, but since this is the only piece of information that’s rendered this way we can manually recreate the find the JS files they’re using, just CTRL+F for “” in the page source and open the files in a new tab to look for that turned out the logic was located in the eleventh JS file and they have a function that calculates the text and color with these parameters:Let’s make a function that replicates this logic:def get_agreeance_text(ratio):
if ratio > 3: return “absolutely agrees”
elif 2 < ratio <= 3: return "strongly agrees" elif 1. 5 < ratio <= 2: return "agrees" elif 1 < ratio <= 1. 5: return "somewhat agrees" elif ratio == 1: return "neutral" elif 0. 67 < ratio < 1: return "somewhat disagrees" elif 0. 5 < ratio <= 0. 67: return "disagrees" elif 0. 33 < ratio <= 0. 5: return "strongly disagrees" elif ratio <= 0. 33: return "absolutely disagrees" else: return None print(get_agreeance_text(2. 5))Now that we have the general logic for a single row and we can generate the agreeance text, let's create a loop that gets data from every row on the first page_data= [] for row in rows: d = dict() d['name'] = lect_one('')() d['allsides_page'] = '' + lect_one(' a')['href'] d['bias'] = lect_one(' a')['href']('/')[-1] d['agree'] = int(lect_one('')) d['disagree'] = int(lect_one('. disagree')) d['agree_ratio'] = d['agree'] / d['disagree'] d['agreeance_text'] = get_agreeance_text(d['agree_ratio']) (d)In the loop we can combine any multi-step extractions into one to create the values in the least number of data list now contains a dictionary containing key information for every {'name': 'ABC News', 'allsides_page': '', 'bias': 'left-center', 'agree': 8411, 'disagree': 6662, 'agree_ratio': 1. 2625337736415492, 'agreeance_text': 'somewhat agrees'}Keep in mind that this is still only the first page. The list on AllSides is three pages long as of this writing, so we need to modify this loop to get the other that the URLs for each page follow a pattern. The first page has no parameters on the URL, but the next pages do; specifically they attach a? page=#to the URL where '#' is the page now, the easiest way to get all pages is just to manually make a list of these three pages and loop over them. If we were working on a project with thousands of pages we might build a more automated way of constructing/finding the next URLs, but for now this = [ '', '']According to AllSides' we need to make sure we wait ten seconds before each loop will:request a pageparse the pagewait ten secondsrepeat for next member, we've already tested our parsing above on a page that was cached locally so we know it works. You'll want to make sure to do this before making a loop that performs requests to prevent having to reloop if you forgot to parse combining all the steps we've done up to this point and adding a loop over pages, here's how it looks:from time import sleep data= [] for page in pages: r = (page) soup = BeautifulSoup(ntent, '') rows = ('tbody tr') (d) sleep(10)Now we have a list of dictionaries for each row on all three cap it off, we want to get the real URL to the news source, not just the link to their presence on AllSides. To do this, we will need to get the AllSides page and look for the we go to ABC News' page there's a row of external links to Facebook, Twitter, Wikipedia, and the ABC News website. The HTML for that sections looks like this:

About the author

proxyreview

If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

By proxyreview

Recent Posts

Useful Tools