My First Time With Web Scraping

July 11, 2023

“Look Ma! I’m a big boy now!”

That’s right my friend.

I finally finished my Spotify Python Project (after 2 weeks), so I’m moving on to bigger and better things. Here’s my issue with the Spotify Project. It is not something that’ll benefit me daily. There’s no real-life value.

So I’m starting a new project.

Nowadays, communication matters almost more than the tech skills you have. So it’s a no-brainer to consistently learn and improve your communication skills, as you would your math or programming skills.

But I’m a busy guy and don’t have time to surf the web every day.

That's why I’m automating the learning process by using Python to do the following:

  • Surf the web for content
  • Create a database
  • Write a script that automatically sends me an email every morning

Since I’m no "web scraping genius" yet, I’m just playing around with it for now. Here’s the first script I wrote yesterday…

The first web scraping script I've written

I’ve used the requests library before, but BeautifulSoup is new to me. If you don’t know how web scraping works with these libraries, here’s the TLDR-esque outline for you:

  • Requests: sends a message to the URL’s server and asks for permission to get the content on the site
  • BeautifulSoup: takes content you’ve scraped and parses it/fetches the details we want

But here’s the catch…

Just because you request content doesn’t mean you’ll get it. A 200 Status Code means the server fulfilled your request. But some of the URLs I have return a 403 Error, which means the server received my request but would not fulfill it.

That’s why I’m creating the new_list variable.

It’s not essential to the scraping process, but it helps me understand which URLs do and don’t give me the content I want.

Moving on to the for loop, when I iterate through the list of URLs, if my request is fulfilled with a 200 status code, I do a few things:

  1. Add the URL to new_list
  2. The response.text line turns the content I receive into a string
  3. Use BeatutifulSoup to parse through the text as XML instead of HTML
  4. Finds the first instance of an H1 header in the code

In the end, I print the lengths of each list to see how many URLs don’t give me access to their content.

As you can see, this code is nothing crazy.

But this is about an hour of study and research. So give me a few more days, and I’m gonna shock you.