Parse HTML table to Python list? – Stack Overflow
I’d like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.
If, for example, I had an HTML table with three columns (marked by header tags), “Event”, “Start Date”, and “End Date” and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys “Event”, “Start Date”, and “End Date”.
Thanks for the help!
asked Jun 12 ’11 at 22:46
You should use some HTML parsing library like lxml:
from lxml import etree
s = “””
Event | Start Date | End Date |
---|---|---|
a | b | c |
d | e | f |
g | h | i |
“””
table = (s)(“body/table”)
rows = iter(table)
headers = [ for col in next(rows)]
for row in rows:
values = [ for col in row]
print dict(zip(headers, values))
prints
{‘End Date’: ‘c’, ‘Start Date’: ‘b’, ‘Event’: ‘a’}
{‘End Date’: ‘f’, ‘Start Date’: ‘e’, ‘Event’: ‘d’}
{‘End Date’: ‘i’, ‘Start Date’: ‘h’, ‘Event’: ‘g’}
answered Jun 12 ’11 at 22:59
Sven MarnachSven Marnach501k111 gold badges880 silver badges790 bronze badges
8
Hands down the easiest way to parse a HTML table is to use ad_html() – it accepts both URLs and HTML.
import pandas as pd
url = r”
tables = ad_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
Only downside is that read_html() doesn’t preserve hyperlinks.
answered Jul 14 ’17 at 23:48
zeluspzelusp2, 8582 gold badges26 silver badges54 bronze badges
7
Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:
from import ElementTree as ET
table = (s)
print(dict(zip(headers, values)))
same output as Sven Marnach’s answer…
Hugo23. 9k6 gold badges70 silver badges88 bronze badges
answered Sep 6 ’11 at 6:46
3
If the HTML is not XML you can’t do it with etree. But even then, you don’t have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser from I’ve the code of the simple derived HTMLParser class here in a github repo.
You can use that class (here named HTMLTableParser) the following way:
import quest
from html_table_parser import HTMLTableParser
target = ”
# get website content
req = quest(url=target)
f = quest. urlopen(req)
xhtml = ()(‘utf-8’)
# instantiate the parser and feed it
p = HTMLTableParser()
(xhtml)
print()
The output of this is a list of 2D-lists representing tables. It looks maybe like this:
[[[‘ ‘, ‘ Anmelden ‘]],
[[‘Land’, ‘Code’, ‘Für Kunden von’],
[‘Vereinigte Staaten’, ‘40404’, ‘(beliebig)’],
[‘Kanada’, ‘21212’, ‘(beliebig)’],…
[‘3424486444’, ‘Vodafone’],
[‘ Zeige SMS-Kurzwahlen für andere Länder ‘]]]
answered Mar 11 ’14 at 8:31
schmijosschmijos6, 9603 gold badges45 silver badges52 bronze badges
2
Not the answer you’re looking for? Browse other questions tagged python html or ask your own question.
Parsing HTML Tables in Python with BeautifulSoup and pandas
Something that seems daunting at first when switching from R to Python is replacing all the ready-made functions R has. For example, R has a nice CSV reader out of the box. Python users will eventually find pandas, but what about other R libraries like their HTML Table Reader from the xml package? That’s very helpful for scraping web pages, but in Python it might take a little more work. So in this post, we’re going to write a brief but robust HTML table parser.
Our parser is going to be built on top of the Python package BeautifulSoup. It’s a convenient package and easy to use. Our use will focus on the “find_all” function, but before we start parsing, you need to understand the basics of HTML terminology.
An HTML object consists of a few fundamental pieces: a tag. The format that defines a tag is
and it could have attributes which consistes of a property and a value. A tag we are interested in is the table tag, which defined a table in a website. This table tag has many elements. An element is a component of the page which typically contains content. For a table in HTML, they consist of rows designated by elements within the tr tags, and then column content inside the td tags. A typical example is
Hello! | Table |
It turns out that most sites keep data you’d like to scrape in tables, and so we’re going to learn to parse them.
Parsing a Table in BeautifulSoup
To parse the table, we are going to use the Python library BeautifulSoup. It constructs a tree from the HTML and gives you an API to access different elements of the webpage.
Let’s say we already have our table object returned from BeautifulSoup. To parse the table, we’d like to grab a row, take the data from its columns, and then move on to the next row ad nauseam. In the next bit of code, we define a website that is simply the HTML for a table. We load it into BeautifulSoup and parse it, returning a pandas data frame of the contents.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
from bs4 import BeautifulSoup
html_string = ”’
”’
soup = BeautifulSoup(html_string, ‘lxml’) # Parse the HTML as a string
table = nd_all(‘table’)[0] # Grab the first table
new_table = Frame(columns=range(0, 2), index = [0]) # I know the size
row_marker = 0
for row in nd_all(‘tr’):
column_marker = 0
columns = nd_all(‘td’)
for column in columns:
[row_marker, column_marker] = t_text()
column_marker += 1
new_table
As you can see, we grab all the tr elements from the table, followed by grabbing the td elements one at a time. We use the “get_text()” method from the td element (called a column in each iteration) and put it into our python object representing a table (it will eventually be a pandas dataframe).
Now, that we have our plan to parse a table, we probably need to figure out how to get to that point. That’s actually easier! We’re going to use the requests package in Python.
import requests
url = ”
response = (url)
[:100] # Access the HTML with the text property
‘\r\n\n\n\n
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = nd_all(‘th’)
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
(t_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names)! = n_columns:
raise Exception(“Column titles do not match the number of columns”)
columns = column_names if len(column_names) > 0 else range(0, n_columns)
df = Frame(columns = columns,
index= range(0, n_rows))
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col](float)
except ValueError:
pass
return df
Let’s do an example where we scrape a table from a website. We initialize the parser object and grab the table using our code above:
hp = HTMLTableParser()
table = rse_url(url)[0][1] # Grabbing the table from the tuple
()
Rank
Player
Team
Points
Games
Avg
0
Cam Newton
CAR
389. 1
24. 3
Tom Brady
NE
343. 7
21. 5
Russell Wilson
SEA
336. 4
21. 0
Blake Bortles
JAC
316. 1
19. 8
Carson Palmer
ARI
309. 2
19. 3
If you had looked at the URL above, you’d have seen that we were parsing QB stats from the 2015 season off of Our data has been prepared in such a way that we can immediately start an analysis.
7%matplotlib inline
import as plt
avg=table[‘Avg’]
(avg, bins = 50)
(‘Average QB Points Per Game in 2015’)
As you can see, this code may find it’s way into some scraper scripts once Football season starts again, but it’s perfectly capable of scraping any page with an HTML table. The code actually will scrape every table on a page, and you can just select the one you want from the resulting list. Happy scraping!

Reading HTML tables with Pandas – Practical Business Python
Introduction
The pandas read_html() function is a quick and convenient way to turn an HTML
table into a pandas DataFrame. This function can be useful for quickly incorporating tables
from various websites without figuring out how to scrape the site’s HTML.
However, there can be some challenges in cleaning and formatting the data before analyzing
it. In this article, I will discuss how to use pandas
read_html()
to read and
clean several Wikipedia HTML tables so that you can use them for further numeric analysis.
Basic Usage
For the first example, we will try to parse this table from the Politics section on
the Minnesota wiki page.
The basic usage is of pandas
read_html
is pretty simple and works well on many Wikipedia
pages since the tables are not complicated. To get started, I am including some extra imports
we will use for data cleaning for more complicated examples:
import pandas as pd
import numpy as np
import as plt
from unicodedata import normalize
table_MN = ad_html(”)
The unique point here is that
table_MN
is a list of all the tables on the page:
print(f’Total tables: {len(table_MN)}’)
With 38 tables, it can be challenging to find the one you need. To make the table selection easier,
use the
match
parameter to select a subset of tables. We can use the caption
“Election results from statewide races” to select the table:
table_MN = ad_html(”, match=’Election results from statewide races’)
len(table_MN)
df = table_MN[0]
()
Year
Office
GOP
DFL
Others
0
2018
Governor
42. 4%
53. 9%
3. 7%
1
Senator
36. 2%
60. 3%
3. 4%
2
53. 0%
4. 6%
3
2016
President
44. 9%
46. 4%
8. 6%
4
2014
44. 5%
50. 1%
5. 4%
Pandas makes it easy to read in the table and also handles the year column that spans multiple
rows. This is an example where it is easier to use pandas than to try to scrape it all yourself.
Overall, this looks ok until we look at the data types with
():
RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Year 24 non-null int64
1 Office 24 non-null object
2 GOP 24 non-null object
3 DFL 24 non-null object
4 Others 24 non-null object
dtypes: int64(1), object(4)
memory usage: 1. 1+ KB
We need to convert the GOP, DFL and Other columns to numeric values if we want to do any analysis.
If we try:
df[‘GOP’](‘float’)
We get an error:
ValueError: could not convert string to float: ’42. 4%’
The most likely culprit is the%. We can get rid of it using pandas
replace()
function. I covered this in some detail in a previous article.
df[‘GOP’]. replace({‘%’:”}, regex=True)(‘float’)
Which looks good:
0 42. 4
1 36. 2
2 42. 4
3 44. 9
<... >
21 63. 3
22 49. 1
23 31. 9
Name: GOP, dtype: float64
Note, that I had to use the
regex=True
parameter for this to work since the%
is a part of the string and not the full string value.
Now, we can call replace all the%
values and convert to numbers using
_numeric()
and
apply()
df = place({‘%’: ”}, regex=True)
df[[‘GOP’, ‘DFL’, ‘Others’]] = df[[‘GOP’, ‘DFL’, ‘Others’]](_numeric)
2 GOP 24 non-null float64
3 DFL 24 non-null float64
4 Others 24 non-null float64
dtypes: float64(3), int64(1), object(1)
42. 4
53. 9
3. 7
36. 2
60. 3
3. 0
4. 6
44. 9
46. 4
8. 5
50. 1
5. 4
This basic process works well. The next example is a little trickier.
More Advanced Data Cleaning
The previous example showed the basic concepts. Frequently more cleaning is needed.
Here is an example that was a little trickier. This example continues to use Wikipedia
but the concepts apply to any site that has data in an HTML table.
What if we wanted to parse the US GDP table show below?
This one was a little harder to use match to get only one table but matching on ‘Nominal GDP’
gets the table we want as the first one in the list.
table_GDP = ad_html(”, match=’Nominal GDP’)
df_GDP = table_GDP[0]
RangeIndex: 41 entries, 0 to 40
Data columns (total 9 columns):
0 Year 41 non-null object
1 Nominal GDP(in bil. US-Dollar) 41 non-null float64
2 GDP per capita(in US-Dollar) 41 non-null int64
3 GDP growth(real) 41 non-null object
4 Inflation rate(in percent) 41 non-null object
5 Unemployment (in percent) 41 non-null object
6 Budget balance(in% of GDP)[107] 41 non-null object
7 Government debt held by public(in% of GDP)[108] 41 non-null object
8 Current account balance(in% of GDP) 41 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 3. 0+ KB
Not surprisingly we have some cleanup to do. We can try to remove the%
like we did last time:
df_GDP[‘GDP growth(real)’]. replace({‘%’: ”}, regex=True)(‘float’)
Unfortunately we get this error:
ValueError: could not convert string to float: ‘−5. 9\xa0’
The issue here is that we have a hidden character,
xa0
that is causing some errors.
This is a “non-breaking Latin1 (ISO 8859-1) space”.
One option I played around with was directly removing the value using
replace. It worked
but I worried about whether or not it would break with other characters in the future.
After going down the unicode rabbit hole, I decided to use
normalize
to clean this
value. I encourage you to read this article for more details on the rationale for my approach.
I also have found issues with extra spaces getting into the data in some of the other tables.
I built a small function to clean all the text values. I hope others will find this helpful:
def clean_normalize_whitespace(x):
if isinstance(x, str):
return normalize(‘NFKC’, x)()
else:
return x
I can run this function on the entire DataFrame using
applymap:
df_GDP = lymap(clean_normalize_whitespace)
applymap
performance
Be cautious about using
This function is very slow so you should be judicious in using it.
The
function is a very inefficient pandas function. You should not
use it very often but in this case, the DataFrame is small and cleaning like this is tricky
so I think it is a useful trade-off.
One thing that
misses is the columns. Let’s look at one column in more detail:
‘Government debt held by public(in\xa0% of GDP)[108]’
We have that dreaded
xa0%
in the column names. There are a couple of ways we could
go about cleaning the columns but I’m going to use
clean_normalize_whitespace()
on the columns by converting the column to a series and using
apply
to run the function.
Future versions of pandas may make this a little easier.
lumns = _series()(clean_normalize_whitespace)
lumns[7]
‘Government debt held by public(in% of GDP)[108]’
Now we have some of the hidden characters cleaned out. What next?
Let’s try it out again:
ValueError: could not convert string to float: ‘−5. 9 ‘
This one is really tricky. If you look really closely, you might be able to tell that the
−
looks a little different than the
-. It’s hard to see but there is actually
a difference between the unicode dash and minus. Ugh.
Fortunately, we can use
replace
to clean that up too:
df_GDP[‘GDP growth(real)’]. replace({‘%’: ”, ‘−’: ‘-‘}, regex=True)(‘float’)
0 -5. 9
1 2. 2
2 3. 0
3 2. 3
4 1. 7
38 -1. 8
39 2. 6
40 -0. 2
Name: GDP growth(real), dtype: float64
One other column we need to look at is the
column. For 2020, it contains “2020 (est)”
which we want to get rid of. Then convert the column to an int. I can add to the dictionary but
have to escape the parentheses since they are special characters in a regular expression:
df[‘Year’]. replace({‘%’: ”, ‘−’: ‘-‘, ‘\(est\)’: ”}, regex=True)(‘int’)
0 2020
1 2019
2 2018
3 2017
4 2016
40 1980
Name: Year, dtype: int64
Before we wrap it up and assign these values back to our DataFrame, there is one other item
to discuss. Some of these columns should be integers and some are floats. If we use
meric()
we don’t have that much flexibility. Using
astype()
we can control the numeric type
but we don’t want to have to manually type this for each column.
function can take a dictionary of column names and data types.
This is really useful and I did not know this until I wrote this article. Here is how we
can define the column data type mapping:
col_type = {
‘Year’: ‘int’,
‘Nominal GDP(in bil. US-Dollar)’: ‘float’,
‘GDP per capita(in US-Dollar)’: ‘int’,
‘GDP growth(real)’: ‘float’,
‘Inflation rate(in percent)’: ‘float’,
‘Unemployment (in percent)’: ‘float’,
‘Budget balance(in% of GDP)[107]’: ‘float’,
‘Government debt held by public(in% of GDP)[108]’: ‘float’,
‘Current account balance(in% of GDP)’: ‘float’}
Here’s a quick hint. Typing this dictionary is slow. Use this shortcut to build up
a dictionary of the columns with
float
as the default value:
omkeys(lumns, ‘float’)
{‘Year’: ‘float’,
‘GDP per capita(in US-Dollar)’: ‘float’,
I also created a single dictionary with the values to replace:
clean_dict = {‘%’: ”, ‘−’: ‘-‘, ‘\(est\)’: ”}
Now we can call replace on this DataFrame, convert to the desired type and get our clean
numeric values:
df_GDP = place(clean_dict, regex=True). replace({
‘-n/a ‘:})(col_type)
0 Year 41 non-null int64
3 GDP growth(real) 41 non-null float64
4 Inflation rate(in percent) 41 non-null float64
5 Unemployment (in percent) 41 non-null float64
6 Budget balance(in% of GDP)[107] 40 non-null float64
7 Government debt held by public(in% of GDP)[108] 41 non-null float64
8 Current account balance(in% of GDP) 40 non-null float64
dtypes: float64(7), int64(2)
memory usage: 3. 0 KB
Which looks like this now:
Nominal GDP(in bil. US-Dollar)
GDP per capita(in US-Dollar)
GDP growth(real)
Inflation rate(in percent)
Unemployment (in percent)
Budget balance(in% of GDP)[107]
Government debt held by public(in% of GDP)[108]
Current account balance(in% of GDP)
2020
20234. 0
57589
-5. 9
0. 62
11. 1
NaN
79. 9
2019
21439. 0
64674
2. 2
1. 80
3. 5
-4. 6
78. 9
-2. 5
20580. 2
62869
3. 0
2. 40
3. 9
-3. 8
77. 8
-2. 4
2017
19519. 4
60000
2. 3
2. 10
4. 4
-3. 4
76. 1
-2. 3
18715. 0
57878
1. 7
1. 30
4. 1
76. 4
Just to prove it works, we can plot the data too:
(‘seaborn-whitegrid’)
(x=’Year’, y=[‘Inflation rate(in percent)’, ‘Unemployment (in percent)’])
If you are closely following along, you may have noticed the use of a chained
call:. replace({‘-n/a ‘:})
The reason I put that in there is that I could not figure out how to get the
n/a
cleaned using
the first dictionary
replace. I think the issue is that I could not predict the order in which
this data would get cleaned so I decided to execute the replace in two stages.
I’m confident that if there is a better way someone will point it out in the comments.
Full Solution
Here is a compact example of everything we have done. Hopefully this is useful to others that
try to ingest data from HTML tables and use them in a pandas DataFrame:
“”” Normalize unicode characters and strip trailing spaces
“””
# Read in the Wikipedia page and get the DataFrame
table_GDP = ad_html(
”,
match=’Nominal GDP’)
# Clean up the DataFrame and Columns
# Determine numeric types for each column
# Values to replace
# Replace values and convert to numeric values
Summary
The pandas
function is useful for quickly parsing HTML tables in pages – especially
in Wikipedia pages. By the nature of HTML, the data is frequently not going to be as clean as
you might need and cleaning up all the stray unicode characters can be time consuming.
This article showed several techniques you can use to clean the data and convert it to the
proper numeric format. If you find yourself needing to scrape some Wikipedia or other HTML tables,
these tips should save you some time.
If this is helpful to you or you have other tips, feel free to let me know in the comments.
Frequently Asked Questions about python parse html table
How do I scrape data from HTML table in python?
To scrape a website using Python, you need to perform these four basic steps:Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. … Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.More items…•Dec 19, 2019
How do you parse a table in HTML?
HOWTO parse HTML tables with NokogiriStep 1: Parse the document. Use the Nokogiri::HTML method to parse your HTML input: … Step 2: Select the target table element. If there is only one table in the document you can select it by the tag name: … Step 3: Select the table cell elements. … Step 4: Extract and output the cell data.
How do you parse HTML in Python?
Examplefrom html. parser import HTMLParser.class Parser(HTMLParser):# method to append the start tag to the list start_tags.def handle_starttag(self, tag, attrs):global start_tags.start_tags. append(tag)# method to append the end tag to the list end_tags.def handle_endtag(self, tag):More items…