Scrapy Tutorial — Scrapy 2.5.1 documentation
In this tutorial, we’ll assume that Scrapy is already installed on your system.
If that’s not the case, see Installation guide.
We are going to scrape, a website
that lists quotes from famous authors.
This tutorial will walk you through these tasks:
Creating a new Scrapy project
Writing a spider to crawl a site and extract data
Exporting the scraped data using the command line
Changing spider to recursively follow links
Using spider arguments
Scrapy is written in Python. If you’re new to the language you might want to
start by getting an idea of what the language is like, to get the most out of
Scrapy.
If you’re already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.
If you’re new to programming and want to start with Python, the following books
may be useful to you:
Automate the Boring Stuff With Python
How To Think Like a Computer Scientist
Learn Python 3 The Hard Way
You can also take a look at this list of Python resources for non-programmers,
as well as the suggested resources in the learnpython-subreddit.
Creating a project¶
Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you’d like to store your code and run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
# deploy configuration file
tutorial/ # project’s Python module, you’ll import your code from here
# project items definition file
# project middlewares file
# project pipelines file
# project settings file
spiders/ # a directory where you’ll later put your spiders
Our first Spider¶
Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or a group of websites). They must subclass
Spider and define the initial requests to make,
optionally how to follow links in the pages, and how to parse the downloaded
page content to extract data.
This is the code for our first Spider. Save it in a file named
under the tutorial/spiders directory in your project:
import scrapy
class QuotesSpider():
name = “quotes”
def start_requests(self):
urls = [
”,
”, ]
for url in urls:
yield quest(url=url, )
def parse(self, response):
page = (“/”)[-2]
filename = f’quotes-{page}’
with open(filename, ‘wb’) as f:
()
(f’Saved file {filename}’)
As you can see, our Spider subclasses
and defines some attributes and methods:
name: identifies the Spider. It must be
unique within a project, that is, you can’t set the same name for different
Spiders.
start_requests(): must return an iterable of
Requests (you can return a list of requests or write a generator function)
which the Spider will begin to crawl from. Subsequent requests will be
generated successively from these initial requests.
parse(): a method that will be called to handle
the response downloaded for each of the requests made. The response parameter
is an instance of TextResponse that holds
the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting
the scraped data as dicts and also finding new URLs to
follow and creating new requests (Request) from them.
How to run our spider¶
To put our spider to work, go to the project’s top level directory and run:
This command runs the spider with name quotes that we’ve just added, that
will send some requests for the domain. You will get an output
similar to this:… (omitted for brevity)
2016-12-16 21:24:05 [] INFO: Spider opened
2016-12-16 21:24:05 [scrapy. extensions. logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [] DEBUG: Telnet console listening on 127. 0. 1:6023
2016-12-16 21:24:05 [] DEBUG: Crawled (404)
2016-12-16 21:24:05 [] DEBUG: Crawled (200)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file
2016-12-16 21:24:05 [] INFO: Closing spider (finished)…
Now, check the files in the current directory. You should notice that two new
files have been created: and, with the content
for the respective URLs, as our parse method instructs.
Note
If you are wondering why we haven’t parsed the HTML yet, hold
on, we will cover that soon.
What just happened under the hood? ¶
Scrapy schedules the quest objects
returned by the start_requests method of the Spider. Upon receiving a
response for each one, it instantiates Response objects
and calls the callback method associated with the request (in this case, the
parse method) passing the response as argument.
A shortcut to the start_requests method¶
Instead of implementing a start_requests() method
that generates quest objects from URLs,
you can just define a start_urls class attribute
with a list of URLs. This list will then be used by the default implementation
of start_requests() to create the initial requests
for your spider:
start_urls = [
The parse() method will be called to handle each
of the requests for those URLs, even though we haven’t explicitly told Scrapy
to do so. This happens because parse() is Scrapy’s
default callback method, which is called for requests without an explicitly
assigned callback.
Storing the scraped data¶
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -O
That will generate a file containing all scraped items,
serialized in JSON.
The -O command-line switch overwrites any existing file; use -o instead
to append new content to any existing file. However, appending to a JSON file
makes the file contents invalid JSON. When appending to a file, consider
using a different serialization format, such as JSON Lines:
scrapy crawl quotes -o
The JSON Lines format is useful because it’s stream-like, you can easily
append new records to it. It doesn’t have the same problem of JSON when you run
twice. Also, as each record is a separate line, you can process big files
without having to fit everything in memory, there are tools like JQ to help
doing that at the command-line.
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an Item Pipeline. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
tutorial/ Though you don’t need to implement any item
pipelines if you just want to store the scraped items.
Following links¶
Let’s say, instead of just scraping the stuff from the first two pages
from, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow links
from them.
First thing is to extract the link to the page we want to follow. Examining
our page, we can see there is a link to the next page with the following
markup:
We can try extracting it in the shell:
>>> (‘ a’)()
‘Next ‘
This gets the anchor element, but we want the attribute href. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:
>>> (‘ a::attr(href)’)()
‘/page/2/’
There is also an attrib property available
(see Selecting element attributes for more):
>>> (‘ a’)[‘href’]
Let’s see now our spider modified to recursively follow the link to the next
page, extracting data from it:
for quote in (”):
yield {
‘text’: (”)(),
‘author’: (”)(),
‘tags’: (‘ ‘)(), }
next_page = (‘ a::attr(href)’)()
if next_page is not None:
next_page = response. urljoin(next_page)
yield quest(next_page, )
Now, after extracting the data, the parse() method looks for the link to
the next page, builds a full absolute URL using the
urljoin() method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it’s
visiting.
In our example, it creates a sort of loop, following all the links to the next page
until it doesn’t find one – handy for crawling blogs, forums and other sites with
pagination.
A shortcut for creating Requests¶
As a shortcut for creating Request objects you can use
‘author’: (‘span small::text’)(),
yield (next_page, )
Unlike quest, supports relative URLs directly – no
need to call urljoin. Note that just returns a Request
instance; you still have to yield this Request.
You can also pass a selector to instead of a string;
this selector should extract necessary attributes:
for href in (‘ a::attr(href)’):
yield (href, )
For elements there is a shortcut: uses their href
attribute automatically. So the code can be shortened further:
for a in (‘ a’):
yield (a, )
To create multiple requests from an iterable, you can use
llow_all instead:
anchors = (‘ a’)
yield from llow_all(anchors, )
or, shortening it further:
yield from llow_all(css=’ a’, )
More examples and patterns¶
Here is another spider that illustrates callbacks and following links,
this time for scraping author information:
class AuthorSpider():
name = ‘author’
start_urls = [”]
author_page_links = (‘ + a’)
yield from llow_all(author_page_links, rse_author)
pagination_links = (‘ a’)
yield from llow_all(pagination_links, )
def parse_author(self, response):
def extract_with_css(query):
return (query)(default=”)()
‘name’: extract_with_css(”),
‘birthdate’: extract_with_css(”),
‘bio’: extract_with_css(”), }
This spider will start from the main page, it will follow all the links to the
authors pages calling the parse_author callback for each of them, and also
the pagination links with the parse callback as we saw before.
Here we’re passing callbacks to
llow_all as positional
arguments to make the code shorter; it also works for
Request.
The parse_author callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don’t need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured by the setting
DUPEFILTER_CLASS.
Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,
check out the CrawlSpider class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page,
using a trick to pass additional data to the callbacks.
Using spider arguments¶
You can provide command line arguments to your spiders by using the -a
option when running them:
scrapy crawl quotes -O -a tag=humor
These arguments are passed to the Spider’s __init__ method and become
spider attributes by default.
In this example, the value provided for the tag argument will be available
via You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument:
url = ”
tag = getattr(self, ‘tag’, None)
if tag is not None:
url = url + ‘tag/’ + tag
yield quest(url, )
‘author’: (”)(), }
If you pass the tag=humor argument to this spider, you’ll notice that it
will only visit URLs from the humor tag, such as
You can learn more about handling spider arguments here.
Next steps¶
This tutorial covered only the basics of Scrapy, but there’s a lot of other
features not mentioned here. Check the What else? section in
Scrapy at a glance chapter for a quick overview of the most important ones.
You can continue from the section Basic concepts to know more about the
command-line tool, spiders, selectors and other things the tutorial hasn’t covered like
modeling the scraped data. If you prefer to play with an example project, check
the Examples section.
How to Run Scrapy From a Script – Towards Data Science
UnsplashedForget about Scrapy’s framework and write it all in a python script that uses is a great framework to use for scraping projects. However, did you know there is a way to run Scrapy straight from a script? Looking at the documentation, there are two ways to run Scrapy. Using the Scrapy API or the this article, you will learnWhy you would use scrapy from a scriptUnderstand the basic script every time you want access scrapy from an individual scriptUnderstand how to specify customised scrapy settingsUnderstand how to specify HTTP requests for scrapy to invokeUnderstand how to process those HTTP responses using scrapy under one Use Scrapy from a Script? Scrapy can be used for a heavy-duty scraping work, however, there are a lot of projects that are quite small and don’t require the need for using the whole scrapy framework. This is where using scrapy in a python script comes in. No need to use the whole framework you can do it all from a python ’s see what the basics of this look like before fleshing out some of the necessary settings to scrape. key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python’s twisted framework is isted is a python framework that is used for input and output processes like HTTP requests for example. Now it does this through what’s called a twister event reactor. Scrapy is built on top of twisted! We won’t go into too much detail here but needless to say, the CrawlerProcess class imports a twisted reactor which listens for events like multiple HTTP requests. This is at the heart of how scrapy awlerProcess assumes that a twisted reactor is NOT used by anything else, like for example another spider. With that, we have the code scrapyfrom awler import CrawlerProcessclass TestSpider(): name = ‘test’if __name__ == “__main__”: process = CrawlerProcess() (TestSpider) ()Now for us to use the scrapy framework, we must create our spider, this is done by creating a class which inherits from is the most basic spider that we must derive from in all scrapy projects. With this, we have to give this spider a name for it to run/ Spiders will require a couple of functions and an URL to scrape but for this example, we will omit this for the you see if __name__ == “__main__”. This is used as a best practice in python. When we write a script you want to it to be able to run the code but also be able to import that code somewhere else. Please see here for further discussion on this instantiate the class CrawlerProcess first to get access to the functions we want. CrawlerProcess has two functions we are interested in, crawl and startWe use crawl to start the spider we created. We then use the start function to start a twisted reactor, the engine that processes and listens to our HTTP requests we want. scrapy framework provides a list of settings that it will use automatically, however for working with the Scrapy API we have to provide the settings explicitly. The settings we define is how we can customise our spiders. The class has a variable called custom_settings. Now this variable can be used to override the settings scrapy automatically uses. We have to create a dictionary of our settings to do this as the custom_settings variable is set to none using may want to use some or most of the settings scrapy provides, in which case you could copy them from there. Alternatively, a list of the built-in settings can be found TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} have shown how to create a spider and define the settings, but we haven’t specified any URLs to scrape, or how we want to specify the requests to the website we want to get data from. For example, parameters, headers and we create spider we also start a method called start_requests(). This will create the requests for any URL we want. Now there are two ways to use this method. 1) By defining the start_urls attribute 2) We implement our function called start_requestsThe shortest way is by defining start_urls. We define it as a list of URLs we want to get, by specifying this variable we automatically use start_requests() to go through each one of our TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} start_urls = [‘URL1’, ‘URL2’]However notice how if we do this, we can’t specify our headers, parameters or anything else we want to go along with the request? This is where implementing our start_requests method comes, we define our variables we want to go along with the request. We then implement our start_requests method so we can make use of the headers and parameters we want, as well as where we want to the response to TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} headers = {} params = {} def start_requests(self): yield quests(url, headers=headers, params=params)Here we access the Requests method which when given an URL will make the HTTP requests and return a response defined as the response will notice how we didn’t specify a callback? That is we didn’t specify where scrapy should send the response to the requests we just told it to get for ’s fix that, by default scrapy expects the callback method to be the parse function but it could be anything we want it to TestSpider(): name = ‘test’ custom_settings = { ‘DOWNLOD_DELAY’: 1} headers = {} params = {} def start_requests(self): yield quests(url, headers=headers, params=params, callback =) def parse(self, response): print()Here we have defined the function parse which accepts a response variable, remember this is created when we ask scrapy to do the HTTP requests. We then ask scrapy to print the response that, we now have the basics of running scrapy in a python script. We can use all the same methods but we just have to do a bit of configuring might you use the scrapy framework? When is importing scrapy in a python script useful? What does the CrawlerProcess class do? Can you recall the basic script used to start scrapy within a python script? How do you add scrapy settings in your python script? Why might you use a start_requests function instead of start_urls? Please see here for further details about what I’m up to project-wise on my blog and other more tech/coding related content please sign up to my newsletter hereI’d be grateful for any comments or if you want to collaborate or need help with python please do get in you want to get in contact with me, please do so here
Web Scraping in Python using Scrapy (with multiple examples)
Overview
This article teaches you web scraping using Scrapy, a library for scraping the web using Python
Learn how to use Python for scraping Reddit & e-commerce websites to collect data
Introduction
The explosion of the internet has been a boon for data science enthusiasts. The variety and quantity of data that is available today through the internet is like a treasure trove of secrets and mysteries waiting to be solved. For example, you are planning to travel – how about scraping a few travel recommendation sites, pull out comments about various do to things and see which property is getting a lot of positive responses from the users! The list of use cases is endless.
Yet, there is no fixed methodology to extract such data and much of it is unstructured and full of noise.
Such conditions make web scraping a necessary technique for a data scientist’s toolkit. As it is rightfully said,
Any content that can be viewed on a webpage can be scraped. Period.
With the same spirit, you will be building different kinds of web scraping systems using Python in this article and will learn some of the challenges and ways to tackle them.
By the end of this article, you would know a framework to scrape the web and would have scrapped multiple websites – let’s go!
Note- We have created a free course for web scraping using BeautifulSoup library. You can check it out here- Introduction to Web Scraping using Python.
Table of Contents
Overview of Scrapy
Write your first Web Scraping code with Scrapy
Set up your system
Scraping Reddit: Fast Experimenting with Scrapy Shell
Writing Custom Scrapy Spiders
Case Studies using Scrapy
Scraping an E-Commerce site
Scraping Techcrunch: Create your own RSS Feed Reader
1. Overview of Scrapy
Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.
As diverse the internet is, there is no “one size fits all” approach in extracting data from websites. Many a time ad hoc approaches are taken and if you start writing code for every little task you perform, you will eventually end up creating your own scraping framework. Scrapy is that framework.
With Scrapy you don’t need to reinvent the wheel.
Note: There are no specific prerequisites of this article, a basic knowledge of HTML and CSS is preferred. If you still think you need a refresher, do a quick read of this article.
2. Write your first Web Scraping code with Scrapy
We will first quickly take a look at how to setup your system for web scraping and then see how we can build a simple web scraping system for extracting data from Reddit website.
2. 1 Set up your system
Scrapy supports both versions of Python 2 and 3. If you’re using Anaconda, you can install the package from the conda-forge channel, which has up-to-date packages for Linux, Windows and OS X.
To install Scrapy using conda, run:
conda install -c conda-forge scrapy
Alternatively, if you’re on Linux or Mac OSX, you can directly install scrapy by:
pip install scrapy
Note: This article will follow Python 2 with Scrapy.
2. 2 Scraping Reddit: Fast Experimenting with Scrapy Shell
Recently there was a season launch of a prominent TV series (GoTS7) and the social media was on fire, people all around were posting memes, theories, their reactions etc. I had just learned scrapy and was wondering if it can be used to catch a glimpse of people’s reactions?
Scrapy Shell
I love the python shell, it helps me “try out” things before I can implement them in detail. Similarly, scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line type:
scrapy shell
Woah! Scrapy wrote a bunch of stuff. For now, you don’t need to worry about it. In order to get information from Reddit (about GoT) you will have to first run a crawler on it. A crawler is a program that browses web sites and downloads content. Sometimes crawlers are also referred as spiders.
About Reddit
Reddit is a discussion forum website. It allows users to create “subreddits” for a single topic of discussion. It supports all the features that conventional discussion portals have like creating a post, voting, replying to post, including images and links etc. Reddit also ranks the post based on their votes using a ranking algorithm of its own.
A crawler needs a starting point to start crawling(downloading) content from. Let’s see, on googling “game of thrones Reddit” I found that Reddit has a sub-reddit exclusively for game of thrones at this will be the crawler’s start URL.
To run the crawler in the shell type:
fetch(“)
When you crawl something with scrapy it returns a “response” object that contains the downloaded information. Let’s see what the crawler has downloaded:
view(response)
This command will open the downloaded page in your default browser.
Wow that looks exactly like the website, the crawler has successfully downloaded the entire web page.
Let’s see how does the raw content looks like:
print
That’s a lot of content but not all of it is relevant. Let’s create list of things that need to be extracted:
Title of each post
Number of votes it has
Number of comments
Time of post creation
Extracting title of posts
Scrapy provides ways to extract information from HTML based on css selectors like class, id etc. Let’s find the css selector for title, right click on any post’s title and select “Inspect” or “Inspect Element”:
This will open the the developer tools in your browser:
As it can be seen, the css class “title” is applied to all
tags that have titles. This will helpful in filtering out titles from rest of the content in the response object:
(“”). extract()
Here (.. ) is a function that helps extract content based on css selector passed to it. The ‘. ’ is used with the title because it’s a css. Also you need to use::text to tell your scraper to extract only text content of the matching elements. This is done because scrapy directly returns the matching element along with the HTML code. Look at the following two examples:
Notice how “::text” helped us filter and extract only the text content.
Extracting Vote counts for each post
Now this one is tricky, on inspecting, you get three scores:
The “score” class is applied to all the three so it can’t be used as a unique selector is required. On further inspection, it can be seen that the selector that uniquely matches the vote count that we need is the one that contains both “score” and “unvoted”.
When more than two selectors are required to identify an element, we use them both. Also since both are CSS classes we have to use “. ” with their names. Let’s try it out first by extracting the first element that matches:
(“”). extract_first()
See that the number of votes of the first post is correctly displayed. Note that on Reddit, the votes score is dynamic based on the number of upvotes and downvotes, so it’ll be changing in real time. We will add “::text” to our selector so that we only get the vote value and not the complete vote element. To fetch all the votes:
Note: Scrapy has two functions to extract the content extract() and extract_first().
Dealing with relative time stamps: extracting time of post creation
On inspecting the post it is clear that the “time” element contains the time of the post.
There is a catch here though, this is only the relative time(16 hours ago etc. ) of the post. This doesn’t give any information about the date or time zone the time is in. In case we want to do some analytics, we won’t be able to know by which date do we have to calculate “16 hours ago”. Let’s inspect the time element a little more:
The “title” attribute of time has both the date and the time in UTC. Let’s extract this instead:
(“time::attr(title)”). extract()
The (attributename) is used to get the value of the specified attribute of the matching element.
Extracting Number of comments:
I leave this as a practice assignment for you. If you have any issues, you can post them here: and the community will help you out .
So far:
response – An object that the scrapy crawler returns. This object contains all the information about the downloaded content.
(.. ) – Matches the element with the given CSS selectors.
extract_first(.. ) – Extracts the “first” element that matches the given criteria.
extract(.. ) – Extracts “all” the elements that match the given criteria.
Note: CSS selectors are a very important concept as far as web scraping is considered, you can read more about it here and how to use CSS selectors with scrapy.
2. 3 Writing Custom Spiders
As mentioned above, a spider is a program that downloads content from web sites or a given URL. When extracting data on a larger scale, you would need to write custom spiders for different websites since there is no “one size fits all” approach in web scraping owing to diversity in website designs. You also would need to write code to convert the extracted data to a structured format and store it in a reusable format like CSV, JSON, excel etc. That’s a lot of code to write, luckily scrapy comes with most of these functionality built in.
Creating a scrapy project
Let’s exit the scrapy shell first and create a new scrapy project:
scrapy startproject ourfirstscraper
This will create a folder “ourfirstscraper” with the following structure:
For now, the two most important files are:
– This file contains the settings you set for your project, you’ll be dealing a lot with it.
spiders/ – This folder is where all your custom spiders will be stored. Every time you ask scrapy to run a spider, it will look for it in this folder.
Creating a spider
Let’s change directory into our first scraper and create a basic spider “redditbot”:
scrapy genspider redditbot This will create a new spider “” in your spiders/ folder with a basic template:
Few things to note here:
name: Name of the spider, in this case it is “redditbot”. Naming spiders properly becomes a huge relief when you have to maintain hundreds of spiders.
allowed_domains: An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed.
parse(self, response): This function is called whenever the crawler successfully crawls a URL. Remember the response object from earlier? This is the same response object that is passed to the parse(.. ).
After every successful crawl the parse(.. ) method is called and so that’s where you write your extraction logic. Let’s add the earlier logic wrote earlier to extract titles, time, votes etc. in the parse function:
def parse(self, response):
#Extracting the content using css selectors
titles = (”). extract()
votes = (”). extract()
times = (‘time::attr(title)’). extract()
comments = (‘. comments::text’). extract()
#Give the extracted content row wise
for item in zip(titles, votes, times, comments):
#create a dictionary to store the scraped info
scraped_info = {
‘title’: item[0],
‘vote’: item[1],
‘created_at’: item[2],
‘comments’: item[3], }
#yield or give the scraped info to scrapy
yield scraped_info
Note: Here yield scraped_info does all the magic. This line returns the scraped info(the dictionary of votes, titles, etc. ) to scrapy which in turn processes it and stores it.
Save the file and head back to shell. Run the spider with the following command:
scrapy crawl redditbot
Scrapy would print a lot of stuff on the command line. Let’s focus on the data.
Notice that all the data is downloaded and extracted in a dictionary like object that meticulously has the votes, title, created_at and comments.
Exporting scraped data as a csv
Getting all the data on the command line is nice but as a data scientist, it is preferable to have data in certain formats like CSV, Excel, JSON etc. that can be imported into programs. Scrapy provides this nifty little functionality where you can export the downloaded content in various formats. Many of the popular formats are already supported.
Open the file and add the following code to it:
#Export as CSV Feed
FEED_FORMAT = “csv”
FEED_URI = “”
And run the spider:
This will now export all scraped data in a file Let’s see how the CSV looks:
What happened here:
FEED_FORMAT: The format in which you want the data to be exported. Supported formats are: JSON, JSON lines, XML and CSV.
FEED_URI: The location of the exported file.
There are a plethora of forms that scrapy support for exporting feed if you want to dig deeper you can check here and using css selectors in scrapy.
Now that you have successfully created a system that crawls web content from a link, scrapes(extracts) selective data from it and saves it in an appropriate structured format let’s take the game a notch higher and learn more about web scraping.
3. Case studies using Scrapy
Let’s now look at a few case studies to get more experience of scrapy as a tool and its various functionalities.
The advent of internet and smartphones has been an impetus to the e-commerce industry. With millions of customers and billions of dollars at stake, the market has started seeing the multitude of players. Which in turn has led to rise of e-commerce aggregator platforms which collect and show you the information regarding your products from across multiple portals? For example when planning to buy a smartphone and you would want to see the prices at different platforms at a single place. What does it take to build such an aggregator platform? Here’s my small take on building an e-commerce site scraper.
As a test site, you will scrape ShopClues for 4G-Smartphones
Let’s first generate a basic spider:
scrapy genspider shopclues This is how the shop clues web page looks like:
The following information needs to be extracted from the page:
Product Name
Product price
Product discount
Product image
Extracting image URLs of the product
On careful inspection, it can be seen that the attribute “data-img” of the tag can be used to extract image URLs:
(“img::attr(data-img)”). extract()
Extracting product name from tags
Notice that the “title” attribute of the tag contains the product’s full name:
(“img::attr(title)”). extract()
Similarly, selectors for price(“. p_price”) and discount(“. prd_discount”).
How to download product images?
Scrapy provides reusable images pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download their images locally).
The Images Pipeline has a few extra functions for processing images. It can:
Convert all downloaded images to a common format (JPG) and mode (RGB)
Thumbnail generation
Check images width/height to make sure they meet a minimum constraint
In order to use the images pipeline to download images, it needs to be enabled in the file. Add the following lines to the file:
ITEM_PIPELINES = {
”: 1}
IMAGES_STORE = ‘tmp/images/’
you are basically telling scrapy to use the ‘Images Pipeline’ and the location for the images should be in the folder ‘tmp/images/. The final spider would now be:
import scrapy
class ShopcluesSpider():
#name of spider
name = ‘shopclues’
#list of allowed domains
allowed_domains = [”]
#starting url
start_urls = [”]
#location of csv file
custom_settings = {
‘FEED_URI’: ‘tmp/’}
#Extract product information
titles = (‘img::attr(title)’). extract()
images = (‘img::attr(data-img)’). extract()
prices = (‘. p_price::text’). extract()
discounts = (‘. prd_discount::text’). extract()
for item in zip(titles, prices, images, discounts):
‘price’: item[1],
‘image_urls’: [item[2])], #Set’s the url for scrapy to download images
‘discount’: item[3]}
A few things to note here:
custom_settings: This is used to set settings of an individual spider. Remember that is for the whole project so here you tell scrapy that the output of this spider should be stored in a CSV file “” that is to be stored in the “tmp” folder.
scraped_info[“image_urls”]: This is the field that scrapy checks for the image’s link. If you set this field with a list of URLs,, scrapy will automatically download and store those images for you.
On running the spider the output can be read from “tmp/”:
You also get the images downloaded. Check the folder “tmp/images/full” and you will see the images:
Also, notice that scrapy automatically adds the download path of the image on your system in the csv:
There you have your own little e-commerce aggregator
If you want to dig in you can read more about scrapy’s Images Pipeline here
Scraping Techcrunch: Creating your own RSS Feed Reader
Techcrunch is one of my favourite blogs that I follow to stay abreast with news about startups and latest technology products. Just like many blogs nowadays TechCrunch gives its own RSS feed here:. One of scrapy’s features is its ability to handle XML data with ease and in this part, you are going to extract data from Techcrunch’s RSS feed.
Create a basic spider:
Scrapy genspider techcrunch
Let’s have a look at the XML, the marked portion is data of interest:
Here are some observations from the page:
Each article is present between
The title of the post is in
Link to the article can be found in tags.
The author name is enclosed between funny looking
Overview of XPath and XML
XPath is a syntax that is used to define XML documents. It can be used to traverse through an XML document. Note that XPath’s follows a hierarchy.
Extracting title of post
Let’s extract the title of the first post. Similar to (.. ), the function (.. ) in scrapy to deal with XPath. The following code should do it:
(“//item/title”). extract_first()
Output:
u’