Part 1 – How to Crawl Your Website and Extract Key Words

Welcome to my first blog series! If you haven’t been following along, about two weeks ago I had a grand idea to make a web-crawler that generates a WordCloud from any image. Instead of pounding out 2000+ words to describe the whole code, I thought it might be better in a series. So today we will be talking about the Web-Crawling portion. If you want to jump to a particular section, see below.

Objective

For Part 1, we will focus on the crawler that:

  1. ingests a url
  2. retrieves the source code from the given url
  3. prints out the sentences
  4. tokenizes the sentences into words/characters

Sounds easy enough right? I’m glad you disagree, let’s begin with defining a process for our code. Simple stuff first.



Coding always starts with a simple process. Next thing you know, you’ve spent an entire couple of days coding and testing  you are encountering problems you didn’t know existed, overcoming those obstacles and sparking new ideas to make your code better. Luckily for you, I’ve pounded my head on the desk for hours so you won’t have to.

Automate My Life's Word Cloud Generator
=======================================
---Objective --------------------------
Create a web-crawler that ingests a url,
retrieves the words from said url
and generates a word cloud image
---------------------------------------
The Process
1. Create a simple web crawler
2. Save words to a file
3. Process a colored input image
4. Output a wordCloud
---------------------------------------

 

Web-Crawling 2.0

As I have written before in other blogposts, web-crawling is a BIG DEAL when it comes to SEO. If you aren’t familiar with the below package list, don’t worry, you will understand their use in coding context:

  • URL Lib (urllib2) – is useful for opening web addresses
  • BeautifulSoup (bs4) – is a super-simple HTML parser

Web-crawling is kind of a messy business. You can spend hours fixing all nuances just to find out the webpage gets updated the following week. We will be using Urllib2 to open a web address, utilize BeautifulSoup to extract the website’s source code, save that source code in a text file, search for hyperlinks called <a href=”hyperlinks”>, extract the content within <p>paragraph<p/> tags and ultimately save the words for use later in the word cloud generation.

Training Wheels are Off

First, let’s figure out how to open a webpage from Python. Urllib2 is a great package to use for this operation. (if you don’t already have it — pip install urllib2). Urllib2 has an easy method called urlopen(someURL).read() to take care of the work for you. All you will need to do is create a function to pass a web address to it. After opening the web address, it should save it to a file and return the source.



def openPageGetSource(url):
"""opens a url and ingests the sites source code"""
try:
soup = urlopen(url).read()
except Exception,e:
print str(e)
#save the source code just in case you want to run offline
saveFile = open('source.txt','w')
saveFile.write(soup)
saveFile.write('\n')
saveFile.close()

Now that you can open a web address, we need to read the source code returned from the webpage. BeautifulSoup is a great HTML parser that seems to be very simplistic. I normally use Python’s regex package but for this tutorial I thought it would be good to use something new. BS4 allows you to search for tags of your website source code and return the contents found. The find_all method is what we will be using to track down the hyperlinks and eventually the paragraph contents containing our words.

        for a in soup.body.find_all('a', href=True):

The search for find_all(‘a’) will return every hyperlink available in the source code. Unfortunately that means it comes with a bunch of noise. I won’t go into much detail about weeding out the noise, but you will need to create a method of tracking which sites you have already visited and those links you don’t really care about. I created two arrays called visitedLinks and badList for use in determining what links are useful and which links I don’t want to waste time chasing. The badList array is also helpful in modifying strings that are recurrent later.

            link = a['href']
#first level data check
if link not in visitedLinks and link not in badList and "automatemylife.org" in link and "-" in link and "category" not in link:
#second level data check



 

The Crux of Web-Crawling

Remember the function I created at the beginning of the tutorial? We’re going to use that function again as we find new hyperlinks to pursue. For each new hyperlink found in the first website’s source code, we can iterate through by opening each new hyperlink and grabbing its source code. This cycle is the crux of “web-crawling”. For our purposes, we are only interested in one webpage for now so we can limit our crawler to only open the first hyperlink. Review the code below.

                        c +=1
#follow link to get new sourcecode
linkContent = openPageGetSource(link)
linkSC = BeautifulSoup(linkContent)
print "Fetched source from --> ",link
#add this link to the visited list
visitedLinks.append(link)
print "\n-------------",link
for p in linkSC.find_all('p'):
if p.string is not None:

You will find that we are searching for the <p> tag in the new page’s source code. This is where the sentences reside on the blog post. The <p> tag stands for “paragraph” and contains all the content that is written on a blog post. When we call p.string it will print out the contents of whatever is between the <p></p> tags that were found. So naturally, we can crawl Automate My Life’s website and return the sentences found:

 

Found --> https://automatemylife.org/the-wonderful-world-of-adsense/
Fetched source from --> https://automatemylife.org/the-wonderful-world-of-adsense/
------------- https://automatemylife.org/the-wonderful-world-of-adsense/
Tokenizing....
2 weeks with AdSense, 38 new visitors, 9 Tweets, 2 Likes, 4 Followers and $4 in revenue is enough to demand a thoughtful blog post on the wonderful world of online advertisement. If you haven’t already noticed, I have now added Google AdSense to my blog website with the aspiring hopes my incessant, geek-ified ramblings will make me a moderately successful, part-time thousandaire. $4 means I’m in the money!
'ascii' codec can't encode character u'\u2019' in position 5: ordinal not in range(128)
'ascii' codec can't encode character u'\u2019' in position 1: ordinal not in range(128)
Tokenizing....
A business works off of operating costs plus profit margins and I am no different than any other business. Let’s first create a target number using a break-even analysis:

I’ll bet you’re wondering what the — ‘ascii’ codec can’t encode character u’\u2019′ in position 5: ordinal not in range(128) — means. In some cases, there are special characters that will mess with Python reading a string. No worries, I just did a try/except to print out the problem so that the code can keep trucking. I would spend more time fixing it but you will have to ponder that one. (Hint: Sklearn has a tfidfvectorizer to help) For now, we can just use the NLTK.word_tokenize(someSentence) to split the sentence found in p.String into an array. I’ll just show you a quick graph of the words that occur more than once. It’s not perfect — because it doesn’t account for special characters — but you get the idea.

                        
for p in linkSC.find_all('p'):
if p.string is not None:
print "Tokenizing...."
print p.string
#add to the words to content array
tokens = nltk.word_tokenize(p.string)
simpleWordGraph

Crawler Word Graphs (not perfect, but you get the idea)

Where To Next?

As promised, I showed you how to open a website, read its source code and search for both <a href> and <p> tags. We also extracted the p.string (or sentences) found on our own website Automate My Life. Now, for the next hat trick — Natural Language Processing.

Continue with Part 2 of the tutorial –> Working with Images Using OpenCV

-j



Leave a Reply

Your email address will not be published. Required fields are marked *