Web Crawling Using Selenium Python

W

Web Scraping using Selenium and Python - ScrapingBee

Web Scraping using Selenium and Python – ScrapingBee


Updated:
08 July, 2021
9 min read
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.
Today we are going to take a look at Selenium (with Python ❤️) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago! ) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
Clicking on buttons
Filling forms
Scrolling
Taking a screenshot
It is also useful for executing Javascript code. Let’s say that you want to scrape a Single Page Application. Plus you haven’t found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
Chrome download page
Chrome driver binary
selenium package
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
from selenium import webdriver
DRIVER_PATH = ‘/path/to/chromedriver’
driver = (executable_path=DRIVER_PATH)
(”)
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).
You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
from import Options
options = Options()
options. headless = True
d_argument(“–window-size=1920, 1200”)
driver = (options=options, executable_path=DRIVER_PATH)
(“)
print(ge_source)
()
The ge_source will return the full page HTML code.
Here are two other interesting WebDriver properties:
gets the page’s title
rrent_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
Tag name
Class name
IDs
XPath
CSS selectors
We recently published an article explaining XPath. Don’t hesitate to take a look if you aren’t familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.
A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.
Let’s say that we want to locate the h1 tag in this HTML:

… some stuff

Super title



h1 = nd_element_by_name(‘h1’)
h1 = nd_element_by_class_name(‘someclass’)
h1 = nd_element_by_xpath(‘//h1’)
h1 = nd_element_by_id(‘greatID’)
All these methods also have find_elements (note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
all_links = nd_elements_by_tag_name(‘a’)
Some elements aren’t easily accessible with an ID or a simple class, and that’s when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It’s a powerful way to extract any element on a page, based on it’s absolute position on the DOM, or relative to another element.
WebElement
A WebElement is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
Accessing the text of the element with the property
Clicking on the element with ()
Accessing an attribute with t_attribute(‘class’)
Sending text to an input with: nd_keys(‘mypassword’)
There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That’s a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
Go to the login page using ()
Select the username input using nd_element_by_* and then nd_keys() to send text to the input
Follow the same process with the password input
Click on the login button using ()
Should be easy right? Let’s see the code:
login = nd_element_by_xpath(“//input”). send_keys(USERNAME)
password = nd_element_by_xpath(“//input[@type=’password’]”). send_keys(PASSWORD)
submit = nd_element_by_xpath(“//input[@value=’login’]”)()
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.
So, we’re going to check for the logout button. The logout button has the ID “logout” (easy)!
We can’t just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.
So we have to use a try/except block and catch the NoSuchElementException exception:
# dont forget from import NoSuchElementException
try:
logout_button = nd_element_by_id(“logout”)
print(‘Successfully logged in’)
except NoSuchElementException:
print(‘Incorrect login/password’)
We could easily take a screenshot using:
ve_screenshot(”)
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.
Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it’s simple and we don’t have to worry about these issues.
If you need to make screenshots at scale, feel free to try our new Screenshot API here.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
Use a (ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.
If you use a () you will probably use an arbitrary value. The problem is, you’re either waiting for too long or not enough.
Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.
With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.
element = WebDriverWait(driver, 5)(
esence_of_element_located((, “mySuperId”)))
finally:
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.
There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let’s say you want to take a screenshot of some information, but you first need to scroll a bit to see it.
You can easily do this with Selenium:
javaScript = “rollBy(0, 1000);”
driver. execute_script(javaScript)
Using a proxy with Selenium Wire
Unfortunately, Selenium proxy handling is quite basic. For example, it can’t handle proxy with authentication out of the box.
To solve this issue, you need to use Selenium Wire.
This package extends Selenium’s bindings and gives you access to all the underlying requests made by the browser.
If you need to use Selenium with a proxy with authentication this is the package you need.
pip install selenium-wire
This code snippet shows you how to quickly use your headless browser behind a proxy.
# Install the Python selenium-wire library:
# pip install selenium-wire
from seleniumwire import webdriver
proxy_username = “USER_NAME”
proxy_password = “PASSWORD”
proxy_url = ”
proxy_port = 8886
options = {
“proxy”: {
“”: f”{proxy_username}:{proxy_password}@{proxy_url}:{proxy_port}”,
“verify_ssl”: False, }, }
URL = ”
driver = (
executable_path=”YOUR-CHROME-EXECUTABLE-PATH”,
seleniumwire_options=options, )
(URL)
Blocking images and JavaScript
With Selenium, by using the correct Chrome options, you can block some requests from being made.
This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.
To do this, you need to launch Chrome with the below options:
chrome_options = romeOptions()
### This blocks images and javascript requests
chrome_prefs = {
“fault_content_setting_values”: {
“images”: 2,
“javascript”: 2, }}
chrome_options. experimental_options[“prefs”] = chrome_prefs
###
chrome_options=chrome_options, )
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don’t hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn’t have an API, it’s maybe* a good idea to automate it with Selenium, just don’t forget this xkcd:
Web Scraping Using Selenium Python - Analytics Vidhya

Web Scraping Using Selenium Python – Analytics Vidhya

Introduction: –
Machine learning is fueling today’s technological marvels such as driver-less cars, space flight, image, and speech recognition. However, one Data Science professional would need a large volume of data to build a robust & reliable machine learning model for such business problems.
Data mining or gathering data is a very primitive step in the data science life cycle. As per business requirements, one may have to gather data from sources like SAP servers, logs, Databases, APIs, online repositories, or web.
Tools for web scraping like Selenium can scrape a large volume of data such as text and images in a relatively short time.
Table of Contents: –
What is Web Scraping
Why Web Scraping
How Web Scraping is useful
What is Selenium
Setup & tools
Implementation of Image Web Scrapping using Selenium Python
Headless Chrome browser
Putting it altogether
End Notes
What is Web Scraping? :-
Web Scrapping also called “Crawling” or “Spidering” is the technique to gather data automatically from an online source usually from a website. While Web Scrapping is an easy way to get a large volume of data in a relatively short time frame, it adds stress to the server where the source is hosted.
This is also one of the main reasons why many websites don’t allow scraping all on their website. However, as long as it does not disrupt the primary function of the online source, it is fairly acceptable.
Why Web Scraping? –
There’s a large volume of data lying on the web that people can utilize to serve the business needs. So, one needs some tool or technique to gather this information from the web. And that’s where the concept of Web-Scrapping comes in to play.
How Web Scraping is useful? –
Web scraping can help us extract an enormous amount of data about customers, products, people, stock markets, etc.
One can utilize the data collected from a website such as e-commerce portal, Job portals, social media channels to understand customer’s buying patterns, employee attrition behavior, and customer’s sentiments and the list goes on.
Most popular libraries or frameworks that are used in Python for Web – Scrapping are BeautifulSoup, Scrappy & Selenium.
In this article, we’ll talk about Web-scrapping using Selenium in Python. And the cherry on top we’ll see how can we gather images from the web that you can use to build train data for your deep learning project.
What is Selenium: –
Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same.
Image Source
Now let us see how to use selenium for Web Scraping.
Setup & tools:-
Installation:
Install selenium using pip
pip install selenium
Install selenium using conda
conda install -c conda-forge selenium
Download Chrome Driver:
To download web drivers, you can choose any of below methods-
You can either directly download chrome driver from the below link-
Or, you can download it directly using below line of code-driver = (ChromeDriverManager(). install())
You can find complete documentation on selenium here. Documentation is very much self-explanatory so make sure to read it to leverage selenium with Python.
Following methods will help us to find elements in a Web-page (these methods will return a list):
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
Now let’s write one Python code to scrape images from web.
Implementation of Image Web Scrapping using Selenium Python: –
Step1: – Import libraries
import os
import selenium
from selenium import webdriver
import time
from PIL import Image
import io
import requests
from import ChromeDriverManager
from import ElementClickInterceptedException
Step 2: – Install Driver
#Install Driver
driver = (ChromeDriverManager(). install())
Step 3: – Specify search URL
#Specify Search URL
search_url=“q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568″
((q=’Car’))
I’ve used this specific URL so you don’t get in trouble for using licensed or images with copyrights. Otherwise, you can use also as a search URL.
Then we’re searching for Car in our Search URL Paste the link into to (“ Your Link Here ”) function and run the cell. This will open a new browser window for that link.
Step 4: – Scroll to the end of the page
#Scroll to the end of the page
driver. execute_script(“rollTo(0, );”)
(5)#sleep_between_interactions
This line of code would help us to reach the end of the page. And then we’re giving sleep time of 5 seconds so we don’t run in problem, where we’re trying to read elements from the page, which is not yet loaded.
Step 5: – Locate the images to be scraped from the page
#Locate the images to be scraped from the current page
imgResults = nd_elements_by_xpath(“//img[contains(@class, ‘Q4LuWd’)]”)
totalResults=len(imgResults)
Now we’ll fetch all the image links present on that particular page. We will create a “list” to store those links. So, to do that go to the browser window, right-click on the page, and select ‘inspect element’ or enable the dev tools using Ctrl+Shift+I.
Now identify any attributes such as class, id, etc. Which is common across all these images.
In our case class =”‘Q4LuWd” is common across all these images.
Step 6: – Extract the corresponding link of each Image
As we can the images are shown on the page are still the thumbnails not the original image. So to download each image, we need to click each thumbnail and extract relevant information corresponding to that image.
#Click on each Image to extract its corresponding link to download
img_urls = set()
for i in range(0, len(imgResults)):
img=imgResults[i]
try:
()
(2)
actual_images = nd_elements_by_css_selector(‘img. n3VNCb’)
for actual_image in actual_images:
if t_attribute(‘src’) and ” in t_attribute(‘src’):
(t_attribute(‘src’))
except ElementClickInterceptedException or ElementNotInteractableException as err:
print(err)
So, in the above snippet of code, we’re performing the following tasks-
Iterate through each thumbnail and then click it.
Make our browser sleep for 2 seconds (:P).
Find the unique HTML tag corresponding to that image to locate it on page
We still get more than one result for a particular image. But all we’re interested in the link for that image to download.
So, we iterate through each result for that image and extract ‘src’ attribute of it and then see whether “” is present in the ‘src’ or not. Since typically weblink starts with ‘’.
Step 7: – Download & save each image in the Destination directory
(‘C:/Qurantine/Blog/WebScrapping/Dataset1’)
for i, url in enumerate(img_urls):
file_name = f”{i:150}”
image_content = (url). content
except Exception as e:
print(f”ERROR – COULD NOT DOWNLOAD {url} – {e}”)
image_file = tesIO(image_content)
image = (image_file). convert(‘RGB’)
file_path = (baseDir, file_name)
with open(file_path, ‘wb’) as f:
(f, “JPEG”, quality=85)
print(f”SAVED – {url} – AT: {file_path}”)
print(f”ERROR – COULD NOT SAVE {url} – {e}”)
Now finally you have extracted the image for your project
Note: – Once you have written proper code then the browser is not important you can collect data without browser, which is called headless browser window, hence replace the following code with the previous one.
#Headless chrome browser
opts = romeOptions()
opts. headless =True
driver (ChromeDriverManager(). install())
In this case, the browser will not run in the background which is very helpful while deploying a solution in production.
Let’s put all this code in a function to make it more organizable and Implement the same idea to download 100 images for each category (e. g. Cars, Horses).
And this time we’d write our code using the idea of headless chrome.
Putting it all together:
Step 1 – Import all required libraries
(‘C:/Qurantine/Blog/WebScrapping’)
Step 2 – Install Chrome Driver
#Install driver
romeOptions()
opts. headless=True
driver = (ChromeDriverManager(). install(), options=opts)
In this step, we’re installing a Chrome driver and using a headless browser for web scraping.
Step 3 – Specify search URL
search_url = “q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568”
I’ve used this specific URL to scrape copyright-free images.
Step 4 – Write a function to take the cursor to the end of the page
def scroll_to_end(driver):
This snippet of code will scroll down the page
Step5. Write a function to get URL of each Image
#no license issues
def getImageUrls(name, totalImgs, driver):
((q=name))
img_count = 0
results_start = 0
while(img_count= totalImgs:
print(f”Found: {img_count} image links”)
break
else:
print(“Found:”, img_count, “looking for more image links… “)
load_more_button = nd_element_by_css_selector(“. mye4qd”)
driver. execute_script(“document. querySelector(‘. mye4qd’)();”)
results_start = len(thumbnail_results)
return img_urls
This function would return a list of URLs for each category (e. Cars, horses, etc. )
Step 6: Write a function to download each Image
def downloadImages(folder_path, file_name, url):
file_path = (folder_path, file_name)
This snippet of code will download the image from each URL.
Step7: – Write a function to save each Image in the Destination directory
def saveInDestFolder(searchNames, destDir, totalImgs, driver):
for name in list(searchNames):
(destDir, name)
if not (path):
(path)
print(‘Current Path’, path)
totalLinks=getImageUrls(name, totalImgs, driver)
print(‘totalLinks’, totalLinks)
if totalLinks is None:
print(‘images not found for:’, name)
continue
for i, link in enumerate(totalLinks):
downloadImages(path, file_name, link)
searchNames=[‘Car’, ‘horses’]
destDir=f’. /Dataset2/’
totalImgs=5
saveInDestFolder(searchNames, destDir, totalImgs, driver)
This snippet of code will save each image in the destination directory.
I’ve tried my bit to explain Web Scraping using Selenium with Python as simple as possible. Please feel free to comment on your queries. I’ll be more than happy to answer them.
You can clone my Github repository to download the whole code & data, click here!!
About the Author
Praveen Kumar Anwla
I’ve been working as a Data Scientist with product-based and Big 4 Audit firms for almost 5 years now. I have been working on various NLP, Machine learning & cutting edge deep learning frameworks to solve business problems. Please feel free to check out my personal blog, where I cover topics from Machine learning – AI, Chatbots to Visualization tools ( Tableau, QlikView, etc. ) & various cloud platforms like Azure, IBM & AWS cloud.
Web Scraping Using Selenium Python - Analytics Vidhya

Web Scraping Using Selenium Python – Analytics Vidhya

Introduction: –
Machine learning is fueling today’s technological marvels such as driver-less cars, space flight, image, and speech recognition. However, one Data Science professional would need a large volume of data to build a robust & reliable machine learning model for such business problems.
Data mining or gathering data is a very primitive step in the data science life cycle. As per business requirements, one may have to gather data from sources like SAP servers, logs, Databases, APIs, online repositories, or web.
Tools for web scraping like Selenium can scrape a large volume of data such as text and images in a relatively short time.
Table of Contents: –
What is Web Scraping
Why Web Scraping
How Web Scraping is useful
What is Selenium
Setup & tools
Implementation of Image Web Scrapping using Selenium Python
Headless Chrome browser
Putting it altogether
End Notes
What is Web Scraping? :-
Web Scrapping also called “Crawling” or “Spidering” is the technique to gather data automatically from an online source usually from a website. While Web Scrapping is an easy way to get a large volume of data in a relatively short time frame, it adds stress to the server where the source is hosted.
This is also one of the main reasons why many websites don’t allow scraping all on their website. However, as long as it does not disrupt the primary function of the online source, it is fairly acceptable.
Why Web Scraping? –
There’s a large volume of data lying on the web that people can utilize to serve the business needs. So, one needs some tool or technique to gather this information from the web. And that’s where the concept of Web-Scrapping comes in to play.
How Web Scraping is useful? –
Web scraping can help us extract an enormous amount of data about customers, products, people, stock markets, etc.
One can utilize the data collected from a website such as e-commerce portal, Job portals, social media channels to understand customer’s buying patterns, employee attrition behavior, and customer’s sentiments and the list goes on.
Most popular libraries or frameworks that are used in Python for Web – Scrapping are BeautifulSoup, Scrappy & Selenium.
In this article, we’ll talk about Web-scrapping using Selenium in Python. And the cherry on top we’ll see how can we gather images from the web that you can use to build train data for your deep learning project.
What is Selenium: –
Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same.
Image Source
Now let us see how to use selenium for Web Scraping.
Setup & tools:-
Installation:
Install selenium using pip
pip install selenium
Install selenium using conda
conda install -c conda-forge selenium
Download Chrome Driver:
To download web drivers, you can choose any of below methods-
You can either directly download chrome driver from the below link-
Or, you can download it directly using below line of code-driver = (ChromeDriverManager(). install())
You can find complete documentation on selenium here. Documentation is very much self-explanatory so make sure to read it to leverage selenium with Python.
Following methods will help us to find elements in a Web-page (these methods will return a list):
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
Now let’s write one Python code to scrape images from web.
Implementation of Image Web Scrapping using Selenium Python: –
Step1: – Import libraries
import os
import selenium
from selenium import webdriver
import time
from PIL import Image
import io
import requests
from import ChromeDriverManager
from import ElementClickInterceptedException
Step 2: – Install Driver
#Install Driver
driver = (ChromeDriverManager(). install())
Step 3: – Specify search URL
#Specify Search URL
search_url=“q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568″
((q=’Car’))
I’ve used this specific URL so you don’t get in trouble for using licensed or images with copyrights. Otherwise, you can use also as a search URL.
Then we’re searching for Car in our Search URL Paste the link into to (“ Your Link Here ”) function and run the cell. This will open a new browser window for that link.
Step 4: – Scroll to the end of the page
#Scroll to the end of the page
driver. execute_script(“rollTo(0, );”)
(5)#sleep_between_interactions
This line of code would help us to reach the end of the page. And then we’re giving sleep time of 5 seconds so we don’t run in problem, where we’re trying to read elements from the page, which is not yet loaded.
Step 5: – Locate the images to be scraped from the page
#Locate the images to be scraped from the current page
imgResults = nd_elements_by_xpath(“//img[contains(@class, ‘Q4LuWd’)]”)
totalResults=len(imgResults)
Now we’ll fetch all the image links present on that particular page. We will create a “list” to store those links. So, to do that go to the browser window, right-click on the page, and select ‘inspect element’ or enable the dev tools using Ctrl+Shift+I.
Now identify any attributes such as class, id, etc. Which is common across all these images.
In our case class =”‘Q4LuWd” is common across all these images.
Step 6: – Extract the corresponding link of each Image
As we can the images are shown on the page are still the thumbnails not the original image. So to download each image, we need to click each thumbnail and extract relevant information corresponding to that image.
#Click on each Image to extract its corresponding link to download
img_urls = set()
for i in range(0, len(imgResults)):
img=imgResults[i]
try:
()
(2)
actual_images = nd_elements_by_css_selector(‘img. n3VNCb’)
for actual_image in actual_images:
if t_attribute(‘src’) and ” in t_attribute(‘src’):
(t_attribute(‘src’))
except ElementClickInterceptedException or ElementNotInteractableException as err:
print(err)
So, in the above snippet of code, we’re performing the following tasks-
Iterate through each thumbnail and then click it.
Make our browser sleep for 2 seconds (:P).
Find the unique HTML tag corresponding to that image to locate it on page
We still get more than one result for a particular image. But all we’re interested in the link for that image to download.
So, we iterate through each result for that image and extract ‘src’ attribute of it and then see whether “” is present in the ‘src’ or not. Since typically weblink starts with ‘’.
Step 7: – Download & save each image in the Destination directory
(‘C:/Qurantine/Blog/WebScrapping/Dataset1’)
for i, url in enumerate(img_urls):
file_name = f”{i:150}”
image_content = (url). content
except Exception as e:
print(f”ERROR – COULD NOT DOWNLOAD {url} – {e}”)
image_file = tesIO(image_content)
image = (image_file). convert(‘RGB’)
file_path = (baseDir, file_name)
with open(file_path, ‘wb’) as f:
(f, “JPEG”, quality=85)
print(f”SAVED – {url} – AT: {file_path}”)
print(f”ERROR – COULD NOT SAVE {url} – {e}”)
Now finally you have extracted the image for your project
Note: – Once you have written proper code then the browser is not important you can collect data without browser, which is called headless browser window, hence replace the following code with the previous one.
#Headless chrome browser
opts = romeOptions()
opts. headless =True
driver (ChromeDriverManager(). install())
In this case, the browser will not run in the background which is very helpful while deploying a solution in production.
Let’s put all this code in a function to make it more organizable and Implement the same idea to download 100 images for each category (e. g. Cars, Horses).
And this time we’d write our code using the idea of headless chrome.
Putting it all together:
Step 1 – Import all required libraries
(‘C:/Qurantine/Blog/WebScrapping’)
Step 2 – Install Chrome Driver
#Install driver
romeOptions()
opts. headless=True
driver = (ChromeDriverManager(). install(), options=opts)
In this step, we’re installing a Chrome driver and using a headless browser for web scraping.
Step 3 – Specify search URL
search_url = “q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568”
I’ve used this specific URL to scrape copyright-free images.
Step 4 – Write a function to take the cursor to the end of the page
def scroll_to_end(driver):
This snippet of code will scroll down the page
Step5. Write a function to get URL of each Image
#no license issues
def getImageUrls(name, totalImgs, driver):
((q=name))
img_count = 0
results_start = 0
while(img_count= totalImgs:
print(f”Found: {img_count} image links”)
break
else:
print(“Found:”, img_count, “looking for more image links… “)
load_more_button = nd_element_by_css_selector(“. mye4qd”)
driver. execute_script(“document. querySelector(‘. mye4qd’)();”)
results_start = len(thumbnail_results)
return img_urls
This function would return a list of URLs for each category (e. Cars, horses, etc. )
Step 6: Write a function to download each Image
def downloadImages(folder_path, file_name, url):
file_path = (folder_path, file_name)
This snippet of code will download the image from each URL.
Step7: – Write a function to save each Image in the Destination directory
def saveInDestFolder(searchNames, destDir, totalImgs, driver):
for name in list(searchNames):
(destDir, name)
if not (path):
(path)
print(‘Current Path’, path)
totalLinks=getImageUrls(name, totalImgs, driver)
print(‘totalLinks’, totalLinks)
if totalLinks is None:
print(‘images not found for:’, name)
continue
for i, link in enumerate(totalLinks):
downloadImages(path, file_name, link)
searchNames=[‘Car’, ‘horses’]
destDir=f’. /Dataset2/’
totalImgs=5
saveInDestFolder(searchNames, destDir, totalImgs, driver)
This snippet of code will save each image in the destination directory.
I’ve tried my bit to explain Web Scraping using Selenium with Python as simple as possible. Please feel free to comment on your queries. I’ll be more than happy to answer them.
You can clone my Github repository to download the whole code & data, click here!!
About the Author
Praveen Kumar Anwla
I’ve been working as a Data Scientist with product-based and Big 4 Audit firms for almost 5 years now. I have been working on various NLP, Machine learning & cutting edge deep learning frameworks to solve business problems. Please feel free to check out my personal blog, where I cover topics from Machine learning – AI, Chatbots to Visualization tools ( Tableau, QlikView, etc. ) & various cloud platforms like Azure, IBM & AWS cloud.

Frequently Asked Questions about web crawling using selenium python

How do I use Python web scraping in Selenium?

Implementation of Image Web Scrapping using Selenium Python: –Step1: – Import libraries. … Step 2: – Install Driver. … Step 3: – Specify search URL. … Step 4: – Scroll to the end of the page. … Step 5: – Locate the images to be scraped from the page. … Step 6: – Extract the corresponding link of each Image.More items…•Aug 30, 2020

Is Selenium a web crawler?

Selenium is a Web Browser Automation Tool originally designed to automate web applications for testing purposes. It is now used for many other applications such as automating web-based admin tasks, interact with platforms which do not provide Api, as well as for Web Crawling.Jan 13, 2019

Can we use Selenium for web scraping?

Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable.

About the author

proxyreview

If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

By proxyreview

Recent Posts

Useful Tools