Python web crawler

For one of my (upcoming) projects I needed to write a simple webcrawler. Just as I always do, I searched Google (obviously), but couldn’t find anything simple enough. Or good enough for me for that matter. So spent an hr or so writing most simplistic webcrawler myself.

Application logic is extremely simple:

  • Retrieve specified page
  • If cannot retrieve, try another from stored in the DB
  • Store all found URLs (<a href>) in the DB
  • Return one (and remove from DB)

I store all pages in memory, but for more serious crawling, consider storing data on a physical file, that way it’ll be more memory efficient. Considering the size of the internets, even if you store only URL strings you will not last for very long…

Anyway, happy crawling! (and let me know if there are any issues, so far it worked fine for me)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/usr/bin/env python
 
from BeautifulSoup import BeautifulSoup
import urllib2, random, re, sqlite3, logging
 
DATABASE   = ':memory:'
LOG_LEVEL  = logging.INFO
ABS_URL_RE = re.compile('(?P<url>https?://.+?/)')
 
class WebCrawler:
    def __init__(self, logging_level=LOG_LEVEL, database=DATABASE):
        logging.basicConfig(level=logging_level)
        self.conn = sqlite3.connect(database)
        self.conn.execute('create table url_stack (url text)')
        self.conn.execute('create table url_visited (url text)')
 
    def _form_url(self, url, link):
        if link[0:4] == 'http':
            ret_url = link
        elif link[0] == '/':
            m = ABS_URL_RE.search(url)
            ret_url = "%s%s" % (m.group('url'), link[1:])
        else:
            ret_url = "%s%s" % (url, link)
        return ret_url
 
    def _pop_from_db(self):
        logging.debug("Retrieving one URL from the DB...")
        res = self.conn.execute('select url from url_stack limit 1').fetchone()
        logging.debug("Query result: %s", res)
        url = res[0]
        self.conn.execute("delete from url_stack where url = ?", (url,))
        return url
 
    def _push_to_db(self, url):
        logging.debug("Inserting record into DB...")
        logging.debug("Check if it hasn't been visited yet")
        if not self.conn.execute('select url from url_visited where url=?', (url,)).fetchone():
            logging.debug("URL not found in list of visited URLs, inserting")
            logging.debug("URL to insert: %s" % url)
            self.conn.execute('insert into url_stack values (?)', (url,))
            self.conn.execute('insert into url_visited values (?)', (url,))
        else:
            logging.debug("URL has already been visited or added for processing, skipping")
 
    def crawl(self, url):
        work_url = url
        logging.debug("Work URL: %s" % work_url)
        while True:
            try:
                logging.debug("Trying to open and parse the URL...")
                page = urllib2.urlopen(work_url)
                soup = BeautifulSoup(page)
                logging.debug("Parsed successfuly")
            except:
                logging.debug("Failed to parse, attempting to get next URL from DB")
                work_url = self._pop_from_db()
                continue
            links = soup('a')
            logging.debug("Found total of %d links (<a href=...>)" % len(links))
            for link in soup('a'):
                logging.debug("Processing link object: %s" % link)
                try:
                    if link['href'] != '':
                        self._push_to_db(self._form_url(work_url, link['href']))
                except:
                    logging.debug("An exception has occured, this may be ok (href attribute may be missing)")
                    logging.debug("  ... but can also indicate error in insert code")
            logging.debug("Finished adding URLs")
            logging.debug("Getting a new URL for processing from DB")
            work_url = self._pop_from_db()
            logging.info("Found URL: %s" % work_url)
            yield work_url
 
if __name__ == '__main__':
    wc = WebCrawler()
    for url in wc.crawl('http://www.google.com/'):
        pass

As you can see it’s very easy to use. Effectively it is an infinite generator (well, depending what you pass as initial URL). It’s then up to you what you’re going to do with the resulting URL…

One thing to bear in mind when you use it: this crawler pays absolutely no attention to the domain it’s searching. It just blindly collects links, selects one and follows it. So depending on the site link relative location it may stay for a while on a site, or may just wonder away quite quickly.

I wanted to have something that does not stick around for long on one site, so this suits me well, if you want to have something more pedantic, you may want to modify the code, so that it leaves current domain only when it has visited all pages and there are no new pages within the domain to analyse.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • LinkedIn
  • Live
  • Netvibes
  • NewsVine
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
  • Yahoo! Bookmarks

Related posts:

  1. Changing menu order
  2. Top level menu in Arras theme