I started writing web scrapers last week. If you don’t know, web scraper code can read web pages on the Internet and pull information from them.
I have to thank the Ontario Minister of Health for prompting me to do this. The Minister used to share COVID-19 information on twitter, but then chose recently to no longer do that. You can come to your own conclusions as to why she stopped. As for me, I was irritated by the move. Enough so that I decided to get the information and publish it myself.
Fortunately I had two things to start with. One, this great book: Automate the Boring Stuff with Python. There is a chapter in there on how to scrape web pages using Python and something called Beautiful Soup. Two, I had the minister’s own web site: https://covid-19.ontario.ca/. It had the data I wanted right there! I wrote a little program called covid.py to scrape the data from the page and put it all on one line of output which I share on twitter every day.
Emboldened by my success, I decided to write more code like this. The challenge is finding a web page where the data is clearly marked by some standard HTML. For example, the COVID data I wanted is associated with paragraph HTML tag and it has a class label of covid-data-block__title and covid-data-block__data. Easy.
My next bit of code was obit.py: this program scrapes the SaltWire web site (Cape Breton Post) for obituaries listed there, and writes it out into HTML. Hey, it’s weird, but again the web pages are easy to scrape. And it’s an easy way to read my hometown’s obits to see if any of my family or friends have died. Like the Covid data, the obit’s were associated with some html, this time it was a div statement of class sw-obit-list__item. Bingo, I had my ID to get the data.
My last bit of code was somewhat different. The web page I was scraping was on the web but instead of HTML it was a CSV file. In this case I wrote a program called icu.sh to get the latest ICU information on the province of Ontario. (I am concerned Covid is going to come roaring back and the ICUs will fill up again.) ICU.sh runs a curl command and in conjunction with the tail command gets the latest ICU data from an online CSV file. ICU.sh then calls a python program to parse that CSV data and get the ICU information I want.
I learned several lessons from writing this code. First, when it comes to scraping HTML, it’s necessary that the page is well formed and consistent. In the past I tried scraping complex web pages that were not and I failed. With the COVID data and the obituary data, those pages were that way and I succeeded. Second, not all scraping is going to be from HTML pages: sometimes there will be CSV or other files. Be prepared to deal with the format you are given. Third, once you have the data, decide how you want to publish / present it. For the COVID and ICU data, I present them in a simple manner on twitter. Just the facts, but facts I want to share. For the obit data, that is just fun and for myself. For that, I spit it into a temporary HTML file and open it in a browser to review.
If you want to see the code I wrote, you can go to my repo in Github. Feel free to fork the code and make something of your own. If you want to see some data you might want to play with, Toronto has an open data site, here. Good luck!
I’ve been meaning to learn Python, so I might need to explore this further soon.
Go for it! Find a source of information on the web, view source, and see if there are tags ( or that have some ways to easily identify them, then take my code and start breaking it down