How to scrape Wikipedia Table using Python & Beautiful Soup

Sateesh Babu
5 min readMay 8, 2019
Photo by Glenn Carstens-Peters on Unsplash

When I thought of scraping a table from a Wikipedia page, I started exploring data content and wrote a small scraper (program) using “Beautiful Soup” to collect the data from the Wikipedia.

I learned a lot from this experience and I want to share it. You can find the finished script on my GitHub.

What’s Web Scraping?

Web scraping is a data mining technique involving the extracting of content from a website. Its also called Web data extraction, screen scraping or Web harvesting.

If you are a data enthusiast, it’s fun, exciting and satisfies your curiosity. The important Thumb rule is- Be Polite with the website and don’t get blocked or blacklisted.

Steps I Prefer to follow are:

1. Define your Objective

2. Study the HTML tags of the website

3. Code in Python

  • Extract the content using ‘Requests’
  • Parse HTML using ‘Beautiful Soup’
  • Save to DataFrame (Pandas)
  • Do the required clean up
  • Finally, prepare visuals.

1.Define your Objective

My objective was to extract a specific table — Countries & their population — from the Wikipedia page- ‘List of countries and dependencies by population’ and create a pie-chart visual of top five populated countries in the world.

2.Study the HTML tags of the website

We have to understand the website’s structure before scraping the data. I am not an expert in HTML (Hyper Text Markup Language) or web developer, but during this process, I learned few HTML — tags, attributes and tables from web.

- Right click on the web pages, then click ‘Inspect’.
- A Inspect Elements opens (right side of the page).
- Click on the arrow box next to 'Elements'. When you hover on the web page, the corresponding html tag or attribute is shown in the Inspect Elements Page.
Wikipedia Page and its Inspect Elements Page
Another method is, right click on the web page and click “view page source”. Search for table class, which a wrapper of html table. It’s an important input for HTML parsing with ‘Beautiful Soup’.
HTML Content of Wikipedia Page

HTML table row in web page

1st Row of the Wikipedia HTML Table

And its corresponding html script including hyperlinks.

HTML script of Table’s 1st Row

While parsing the html, capture hyperlinks of “Official population clock” in a new column. Note: This html design may be different for other webpages.

Now, starts the fun part — Pythonic way.

3.Code in Python:

Parse HTML using ‘Beautiful Soup’:

The following two Python libraries are important for HTML extraction and parsing:

  • requests for performing your HTTP requests to fetch web content.
  • BeautifulSoup4 for wrangling HTML content as per your requirements.
Import packages — Beautiful Soup & Requests

By default, Request will keep waiting for a response indefinitely. Therefore, it is advised to set the timeout parameter and also use requests. Sessions(), helps in initiating multiple url requests. If your request is successful, then expected HTTP response status code is 200. You may check Wikipedia reference — HTTP status codes — for important error codes like — 404 (Not Found), 403 (Forbidden), 408 (Request Timeout).

The BeautifulSoup constructor parses raw HTML strings and produces an object that mirrors the HTML document’s structure

Url of Wikipedia page
Request & Response

Title Page :

Wikipedia Page Title

Scrap the right table:

Scrap table — ‘wikitable sortable’

Get the table column attributes:

Get table rows:

I did some clean up for data visualization:

  • Renamed the column names
  • Filtered Top 5 populous countries
  • Converted “Population” column from string to float

Data Visualization:

Out of Top 5 populous countries, China takes the major pie with 39.3%. Also, China’s population is 18.1% compared to the population of the world. Next most populated country is India, with 37.9% of top 5 populated countries and its population density is 17.5% of world population.

Run the code in Google Colab

https://github.com/Sateesh110/Rep_Medium/tree/master/A1_WikiTables_Scraping

“click open in Colab”

Conclusion

As data is growing in volume, variety, and velocity, Web scraping is another data extraction vertical by itself. It’s an invaluable and efficient way of gathering data, then navigating through a bad API. Thanks to web scraping tools that are available as hosted solutions, packages like Beautiful Soup, Lxml, Selenium WebDriver and web scraping framework like Scrapy . If you to want add additional skill sets and find exploring new data techniques fun and exciting, then please give it a try.

Happy Web Scraping !

--

--

Sateesh Babu

Sr. Data Architect | Solution Consultant. As continuous learner, always curious about Machine Learning and AI.