Extract Email From Website Python

E

How to Make an Email Extractor in Python

How to Make an Email Extractor in Python

·
4 min read
· Updated
sep 2021
· Ethical Hacking
· Web Scraping
An email extractor or harvester is a type of software used to extract email addresses from online and offline sources which generate a large list of addresses. Even though these extractors can serve multiple legitimate purposes such as marketing campaigns, unfortunately, they are mainly used to send spamming and phishing emails.
Since the web nowadays is the major source of information on the Internet, in this tutorial, you will learn how you can build such a tool in Python to extract email addresses from web pages using requests-html library.
Because many websites load their data using JavaScript instead of directly rendering HTML code, I chose the requests-html library as it supports JavaScript-driven websites.
Related: How to Send Emails in Python using smtplib Module.
Alright, let’s get started, we need to first install requests-html:
pip3 install requests-html
Let’s start coding:
import re
from requests_html import HTMLSession
We need re module here because we will be extracting emails from HTML content using regular expressions, if you’re not sure what a regular expression is, it is basically a sequence of characters that define a search pattern (check this tutorial for details).
I’ve grabbed the most used and accurate regular expression for email addresses from this stackoverflow answer:
url = ”
EMAIL_REGEX = r”””(? :[a-z0-9! #$%&’*+/=? ^_`{|}~-]+(? :\. [a-z0-9! #$%&’*+/=? ^_`{|}~-]+)*|”(? :[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(? :(? :[a-z0-9](? :[a-z0-9-]*[a-z0-9])? \. )+[a-z0-9](? :[a-z0-9-]*[a-z0-9])? |\[(? :(? :(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]? [0-9]))\. ){3}(? :(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]? [0-9])|[a-z0-9-]*[a-z0-9]:(? :[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])”””
I know, it is very long, but this is the best so far that defines how email addresses are expressed in a general way.
url string is the URL we want to grab email addresses from, I’m using a website that generates random email addresses (which loads them using Javascript).
Let’s initiate the HTML session, which is a consumable session for cookie persistence and connection pooling:
# initiate an HTTP session
session = HTMLSession()
Now let’s send the GET request to the URL:
# get the HTTP Response
r = (url)
If you’re sure that the website you’re grabbing email addresses from uses JavaScript to load most of the data, then you need to execute the below line of code:
# for JAVA-Script driven websites
()
This will reload the website in Chromium and replaces HTML content with an updated version, with Javascript executed. Of course, it’ll take some time to do that, that’s why you need to execute this only if the website is loading its data using JavaScript.
Note: Executing render() method as the first time will automatically download Chromium for you, so it will take some time to do that.
Now that we have the HTML content and our email address regular expression, let’s do it:
for re_match in nditer(EMAIL_REGEX, ()):
print(())
nditer() method returns an iterator over all non-overlapping matches in the string. For each match, the iterator returns a match object, that is why we’re accessing the matched string (the email address) using group() method.
Here is a result of my execution:
[email protected]
Awesome, only with few lines of code, we are able to grab email addresses from any web page we want!
You can extend this code to build a crawler to extract all website URLs and run this on every page you find, and then you save them to a file. However, some websites will discover that you’re a bot and not human browsing the website, so it’ll block your IP address, you need to use a proxy server in that case. Let us know what you did with this in the comments below!
Want to Learn More about Web Scraping?
Finally, if you want to dig more into web scraping with different Python libraries, not just requests_html, the below courses will definitely be valuable for you:
Modern Web Scraping with Python using Scrapy Splash Selenium.
Web Scraping and API Fundamentals in Python 2021.
Read Also: How to Build an XSS Vulnerability Scanner in Python.
Happy Scraping ♥
View Full Code
Read Also
Comment panel
PythonSnippets.Dev - Python easter egg - import this and the ...

PythonSnippets.Dev – Python easter egg – import this and the …

This is the second article in the series of python scripts. In this article we will see how to crawl all pages of a website and fetch all the emails.
Important: Please note that some sites may not want you to crawl their site. Please honour their file. In some cases it may lead to legal action. This article is only for educational purpose. Readers are requested not to misuse it.
Instead of explaining the code separately, I have embedded the comments over the source code lines. I have tried to explain the code wherever I felt the requirement. Please comment in case of any query.
You might need to install some packages like requests and BeautifulSoup for this script to work. It is recommended that you create a virtual environment and install packages in it.
import re
import requests
import requests. exceptions
from import urlsplit
from collections import deque
from bs4 import BeautifulSoup
# starting url. replace google with your own url.
starting_url = ”
# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])
# set of already crawled urls for email
processed_urls = set()
# a set of fetched emails
emails = set()
# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):
# move next url from the queue to the set of processed urls
url = unprocessed_urls. popleft()
(url)
# extract base url to resolve relative links
parts = urlsplit(url)
base_url = “{}{}”(parts)
path = url[(‘/’)+1] if ‘/’ in else url
# get url’s content
print(“Crawling URL%s”% url)
try:
response = (url)
except (requests. exceptions. MissingSchema, nnectionError):
# ignore pages with errors and continue with next url
continue
# extract all email addresses and add them into the resulting set
# You may edit the regular expression as per your requirement
new_emails = set(ndall(r”[a-z0-9\. \-+_][email protected][a-z0-9\. \-+_]+\. [a-z]+”,, re. I))
(new_emails)
print(emails)
# create a beutiful soup for the html document
soup = BeautifulSoup(, ‘lxml’)
# Once this document is parsed and processed, now find and process all the anchors i. e. linked urls in this document
for anchor in nd_all(“a”):
# extract link url from the anchor
link = [“href”] if “href” in else ”
# resolve relative links (starting with /)
if artswith(‘/’):
link = base_url + link
elif not artswith(”):
link = path + link
# add the new url to the queue if it was not in unprocessed list nor in processed list yet
if not link in unprocessed_urls and not link in processed_urls:
(link)
Constructive feedback is always welcomed.
extract-emails - PyPI

extract-emails – PyPI

Project description
Extract emails and linkedins profiles from a given website
Documentation
Requirements
Python >= 3. 7
Installation
pip install extract_emails
Simple Usage
from quests_browser import RequestsBrowser as Browser
from extract_emails import DefaultFilterAndEmailFactory as Factory
from extract_emails import DefaultWorker
browser = Browser()
url = ”
factory = Factory(website_url=url, browser=browser)
worker = DefaultWorker(factory)
data = t_data()
print(data)
“””
[
PageData(
website=”,
page_url=”,
data={’email’: [‘””‘, ”]}),
data={’email’: [‘”e2. “‘, ”]}), ]
Download files
Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.
Files for extract-emails, version 5. 0. 2
Filename, size
File type
Python version
Upload date
Hashes
(26. 2 kB)
Wheel
py3
Oct 7, 2021
View
(16. 9 kB)
Source
None
View

Frequently Asked Questions about extract email from website python

About the author

proxyreview

If you 're a SEO / IM geek like us then you'll love our updates and our website. Follow us for the latest news in the world of web automation tools & proxy servers!

By proxyreview

Recent Posts

Useful Tools