Python Pandas web scraping

This is a short post about scraping tables from the web using the Pandas pd.read_html’ command.

If you really want to do a more thorough job of web scraping, then use beautifulsoup. But for simple purposes this is a great little command to read HTML tables from a URL.

I was using Jupyter Lab for some analyses on refereeing performance in the SPFL for season 2018-19. I wanted to show the final league table.

This simple piece of code helped.

tables = pd.read_html('https://www.soccerstats.com/latest.asp?league=scotland_2019')
table = tables[12]
print(table.head(3))

It returns a list of tables in the URL and took me a minute or so to find the correct table. It was table 12 and I stored it in the variable table, for obvious reasons.

The output was:

Screenshot of output

It’s a bit messy with some unnamed columns and some data I’m not interested in but we can clean this up.

table.drop(['form','last 8','CS','FTS'],axis=1,inplace=True)
table=table.rename({'Unnamed: 0': 'Position', 'Unnamed: 1': 'Team'}, axis=1)
print(table)

I’ve removed the columns I don’t need and renamed 2 columns which were classed as ‘Unnamed’.

The clean output is now:

Screenshot of cleaned DataFrame

The code is self-explanatory and simple. Don’t forget to add a comment with any ideas you may have!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s