Welcome to my first blog series! If you haven’t been following along, about two weeks ago I had a grand idea to make a web-crawler that generates a WordCloud from any image. Instead of pounding out 2000+ words to describe the whole code, I thought it might be better in a series. So today we will be talking about the Web-Crawling portion. If you want to jump to a particular section, see below.
- Part 1 – How to Crawl Your Website and Extract Key Words
- Part 2 – Working with Images using OpenCV – “Binarize”
- Part 3 – Creating the Word Cloud
For Part 1, we will focus on the crawler that:
- ingests a url
- retrieves the source code from the given url
- prints out the sentences
- tokenizes the sentences into words/characters
Sounds easy enough right? I’m glad you disagree, let’s begin with defining a process for our code. Simple stuff first.
Coding always starts with a simple process. Next thing you know,
you’ve spent an entire couple of days coding and testing you are encountering problems you didn’t know existed, overcoming those obstacles and sparking new ideas to make your code better. Luckily for you, I’ve pounded my head on the desk for hours so you won’t have to.
As I have written before in other blogposts, web-crawling is a BIG DEAL when it comes to SEO. If you aren’t familiar with the below package list, don’t worry, you will understand their use in coding context:
- URL Lib (urllib2) – is useful for opening web addresses
- BeautifulSoup (bs4) – is a super-simple HTML parser
Web-crawling is kind of a messy business. You can spend hours fixing all nuances just to find out the webpage gets updated the following week. We will be using Urllib2 to open a web address, utilize BeautifulSoup to extract the website’s source code, save that source code in a text file, search for hyperlinks called <a href=”hyperlinks”>, extract the content within <p>paragraph<p/> tags and ultimately save the words for use later in the word cloud generation.
Training Wheels are Off
First, let’s figure out how to open a webpage from Python. Urllib2 is a great package to use for this operation. (if you don’t already have it — pip install urllib2). Urllib2 has an easy method called urlopen(someURL).read() to take care of the work for you. All you will need to do is create a function to pass a web address to it. After opening the web address, it should save it to a file and return the source.
Now that you can open a web address, we need to read the source code returned from the webpage. BeautifulSoup is a great HTML parser that seems to be very simplistic. I normally use Python’s regex package but for this tutorial I thought it would be good to use something new. BS4 allows you to search for tags of your website source code and return the contents found. The find_all method is what we will be using to track down the hyperlinks and eventually the paragraph contents containing our words.
The search for find_all(‘a’) will return every hyperlink available in the source code. Unfortunately that means it comes with a bunch of noise. I won’t go into much detail about weeding out the noise, but you will need to create a method of tracking which sites you have already visited and those links you don’t really care about. I created two arrays called visitedLinks and badList for use in determining what links are useful and which links I don’t want to waste time chasing. The badList array is also helpful in modifying strings that are recurrent later.
The Crux of Web-Crawling
Remember the function I created at the beginning of the tutorial? We’re going to use that function again as we find new hyperlinks to pursue. For each new hyperlink found in the first website’s source code, we can iterate through by opening each new hyperlink and grabbing its source code. This cycle is the crux of “web-crawling”. For our purposes, we are only interested in one webpage for now so we can limit our crawler to only open the first hyperlink. Review the code below.
You will find that we are searching for the <p> tag in the new page’s source code. This is where the sentences reside on the blog post. The <p> tag stands for “paragraph” and contains all the content that is written on a blog post. When we call p.string it will print out the contents of whatever is between the <p></p> tags that were found. So naturally, we can crawl Automate My Life’s website and return the sentences found:
I’ll bet you’re wondering what the — ‘ascii’ codec can’t encode character u’\u2019′ in position 5: ordinal not in range(128) — means. In some cases, there are special characters that will mess with Python reading a string. No worries, I just did a try/except to print out the problem so that the code can keep trucking. I would spend more time fixing it but you will have to ponder that one. (Hint: Sklearn has a tfidfvectorizer to help) For now, we can just use the NLTK.word_tokenize(someSentence) to split the sentence found in p.String into an array. I’ll just show you a quick graph of the words that occur more than once. It’s not perfect — because it doesn’t account for special characters — but you get the idea.
Where To Next?
As promised, I showed you how to open a website, read its source code and search for both <a href> and <p> tags. We also extracted the p.string (or sentences) found on our own website Automate My Life. Now, for the next hat trick — Natural Language Processing.
Continue with Part 2 of the tutorial –> Working with Images Using OpenCV