Application logic is extremely simple:
- Retrieve specified page
- If cannot retrieve, try another from stored in the DB
- Store all found URLs (<a href>) in the DB
- Return one (and remove from DB)
I store all pages in memory, but for more serious crawling, consider storing data on a physical file, that way it’ll be more memory efficient. Considering the size of the internets, even if you store only URL strings you will not last for very long…
Anyway, happy crawling! (and let me know if there are any issues, so far it worked fine for me)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | #!/usr/bin/env python from BeautifulSoup import BeautifulSoup import urllib2, random, re, sqlite3, logging DATABASE = ':memory:' LOG_LEVEL = logging.INFO ABS_URL_RE = re.compile('(?P<url>https?://.+?/)') class WebCrawler: def __init__(self, logging_level=LOG_LEVEL, database=DATABASE): logging.basicConfig(level=logging_level) self.conn = sqlite3.connect(database) self.conn.execute('create table url_stack (url text)') self.conn.execute('create table url_visited (url text)') def _form_url(self, url, link): if link[0:4] == 'http': ret_url = link elif link[0] == '/': m = ABS_URL_RE.search(url) ret_url = "%s%s" % (m.group('url'), link[1:]) else: ret_url = "%s%s" % (url, link) return ret_url def _pop_from_db(self): logging.debug("Retrieving one URL from the DB...") res = self.conn.execute('select url from url_stack limit 1').fetchone() logging.debug("Query result: %s", res) url = res[0] self.conn.execute("delete from url_stack where url = ?", (url,)) return url def _push_to_db(self, url): logging.debug("Inserting record into DB...") logging.debug("Check if it hasn't been visited yet") if not self.conn.execute('select url from url_visited where url=?', (url,)).fetchone(): logging.debug("URL not found in list of visited URLs, inserting") logging.debug("URL to insert: %s" % url) self.conn.execute('insert into url_stack values (?)', (url,)) self.conn.execute('insert into url_visited values (?)', (url,)) else: logging.debug("URL has already been visited or added for processing, skipping") def crawl(self, url): work_url = url logging.debug("Work URL: %s" % work_url) while True: try: logging.debug("Trying to open and parse the URL...") page = urllib2.urlopen(work_url) soup = BeautifulSoup(page) logging.debug("Parsed successfuly") except: logging.debug("Failed to parse, attempting to get next URL from DB") work_url = self._pop_from_db() continue links = soup('a') logging.debug("Found total of %d links (<a href=...>)" % len(links)) for link in soup('a'): logging.debug("Processing link object: %s" % link) try: if link['href'] != '': self._push_to_db(self._form_url(work_url, link['href'])) except: logging.debug("An exception has occured, this may be ok (href attribute may be missing)") logging.debug(" ... but can also indicate error in insert code") logging.debug("Finished adding URLs") logging.debug("Getting a new URL for processing from DB") work_url = self._pop_from_db() logging.info("Found URL: %s" % work_url) yield work_url if __name__ == '__main__': wc = WebCrawler() for url in wc.crawl('http://www.google.com/'): pass |
As you can see it’s very easy to use. Effectively it is an infinite generator (well, depending what you pass as initial URL). It’s then up to you what you’re going to do with the resulting URL…
One thing to bear in mind when you use it: this crawler pays absolutely no attention to the domain it’s searching. It just blindly collects links, selects one and follows it. So depending on the site link relative location it may stay for a while on a site, or may just wonder away quite quickly.
I wanted to have something that does not stick around for long on one site, so this suits me well, if you want to have something more pedantic, you may want to modify the code, so that it leaves current domain only when it has visited all pages and there are no new pages within the domain to analyse.
Related posts: