Scraping Tables From Webpages Using BeautifulSoup

Archit Narain G
2 min readJul 24, 2021

In this blog we will be scraping a Wikipedia webpage and we will be extracting a table from it.

What Is BeautifulSoup?

It is a Python package that can be used to scrape websites

Modules required :-

· pandas

· bs4

· requests

Note: They can be installed with pip (pip install <package_name>)

We will be using This Wikipedia Link which contains the list of countries by population. I will only be using the first 4 columns as these columns contain useful data.

So first we shall import the libraries and provide the url:

from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

The lines below parse the html of the webpage

html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')

Now that the webpage has been parsed, we can find all the tables in it and also make an empty dataframe for our data

tables = soup.find_all('table')list_of_countries = pd.DataFrame(columns=['Country (or territory)', 'Region', 'Population', '% of world'])

The following lines iterate through the tables and get our column data. After we have our data, we can append it into the data frame.

#tables[0] is used to specify the table number
for row in tables[0].tbody.find_all("tr"):
col = row.find_all('td')
if (col != []):
country = col[0].text
region = col[1].text
population = col[2].text
per_of_world = col[3].text
list_of_countries = list_of_countries.append({'Country (or territory)': country, 'Region': region, 'Population': population, '% of world': per_of_world}, ignore_index=True

Now let’s view our results. The data is quite large so I will be using the .head() method to view only the first 5 rows.

list_of_countries.head()

Our output:

The output

The full code:

from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')
tables = soup.find_all('table')list_of_countries = pd.DataFrame(columns=['Country (or territory)', 'Region', 'Population', '% of world'])#tables[0] is used to specify the table number
for row in tables[0].tbody.find_all("tr"):
col = row.find_all('td')
if (col != []):
country = col[0].text
region = col[1].text
population = col[2].text
per_of_world = col[3].text
list_of_countries = list_of_countries.append({'Country (or territory)': country, 'Region': region, 'Population': population, '% of world': per_of_world}, ignore_index=True)

list_of_countries.head()

We have now successfully extracted a table from a webpage. This process can be done for any other website as well.

Thanks For Reading

--

--