<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>GrenadePod &#187; web</title>
	<atom:link href="http://www.grenadepod.com/tag/web/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.grenadepod.com</link>
	<description>Dispersing the Seeds</description>
	<lastBuildDate>Mon, 22 Feb 2010 20:30:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=abc</generator>
		<item>
		<title>Python web crawler</title>
		<link>http://www.grenadepod.com/2009/12/13/python-web-crawler/</link>
		<comments>http://www.grenadepod.com/2009/12/13/python-web-crawler/#comments</comments>
		<pubDate>Sun, 13 Dec 2009 20:57:51 +0000</pubDate>
		<dc:creator>pulegium</dc:creator>
				<category><![CDATA[IT Technology]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.grenadepod.com/?p=633</guid>
		<description><![CDATA[For one of my (upcoming) projects I needed to write a simple webcrawler. Just as I always do, I searched Google (obviously), but couldn&#8217;t find anything simple enough. Or good enough for me for that matter. So spent an hr or so writing most simplistic webcrawler myself. Application logic is extremely simple: Retrieve specified page [...]


Related posts:<ol><li><a href='http://www.grenadepod.com/2009/11/10/changing-menu-order/' rel='bookmark' title='Permanent Link: Changing menu order'>Changing menu order</a></li>
<li><a href='http://www.grenadepod.com/2009/11/04/top-level-menu-in-arras-theme/' rel='bookmark' title='Permanent Link: Top level menu in Arras theme'>Top level menu in Arras theme</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p id="top" />For one of my (upcoming) projects I needed to write a simple webcrawler. Just as I always do, I searched Google (obviously), but couldn&#8217;t find anything simple enough. Or good enough for me for that matter. So spent an hr or so writing most simplistic webcrawler myself.</p>
<p>Application logic is extremely simple:</p>
<ul>
<li>Retrieve specified page</li>
<li>If cannot retrieve, try another from stored in the DB</li>
<li>Store all found URLs (&lt;a href&gt;) in the DB</li>
<li>Return one (and remove from DB)</li>
</ul>
<p>I store all pages in memory, but for more serious crawling, consider storing data on a physical file, that way it&#8217;ll be more memory efficient. Considering the size of the internets, even if you store only URL strings you will not last for very long&#8230;</p>
<p>Anyway, happy crawling! (and let me know if there are any issues, so far it worked fine for me)</p>

<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
</pre></td><td class="code"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">#!/usr/bin/env python</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">from</span> BeautifulSoup <span style="color: #ff7700;font-weight:bold;">import</span> BeautifulSoup
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">urllib2</span>, <span style="color: #dc143c;">random</span>, <span style="color: #dc143c;">re</span>, sqlite3, <span style="color: #dc143c;">logging</span>
&nbsp;
DATABASE   = <span style="color: #483d8b;">':memory:'</span>
LOG_LEVEL  = <span style="color: #dc143c;">logging</span>.<span style="color: black;">INFO</span>
ABS_URL_RE = <span style="color: #dc143c;">re</span>.<span style="color: #008000;">compile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'(?P&lt;url&gt;https?://.+?/)'</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> WebCrawler:
    <span style="color: #ff7700;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, logging_level=LOG_LEVEL, database=DATABASE<span style="color: black;">&#41;</span>:
        <span style="color: #dc143c;">logging</span>.<span style="color: black;">basicConfig</span><span style="color: black;">&#40;</span>level=logging_level<span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">conn</span> = sqlite3.<span style="color: black;">connect</span><span style="color: black;">&#40;</span>database<span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'create table url_stack (url text)'</span><span style="color: black;">&#41;</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'create table url_visited (url text)'</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> _form_url<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, url, link<span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">if</span> link<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span>:<span style="color: #ff4500;">4</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'http'</span>:
            ret_url = link
        <span style="color: #ff7700;font-weight:bold;">elif</span> link<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span> == <span style="color: #483d8b;">'/'</span>:
            m = ABS_URL_RE.<span style="color: black;">search</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>
            ret_url = <span style="color: #483d8b;">&quot;%s%s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>m.<span style="color: black;">group</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'url'</span><span style="color: black;">&#41;</span>, link<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>:<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">else</span>:
            ret_url = <span style="color: #483d8b;">&quot;%s%s&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span>url, link<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> ret_url
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> _pop_from_db<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Retrieving one URL from the DB...&quot;</span><span style="color: black;">&#41;</span>
        res = <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'select url from url_stack limit 1'</span><span style="color: black;">&#41;</span>.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Query result: %s&quot;</span>, res<span style="color: black;">&#41;</span>
        url = res<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
        <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;delete from url_stack where url = ?&quot;</span>, <span style="color: black;">&#40;</span>url,<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">return</span> url
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> _push_to_db<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, url<span style="color: black;">&#41;</span>:
        <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Inserting record into DB...&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Check if it hasn't been visited yet&quot;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> <span style="color: #ff7700;font-weight:bold;">not</span> <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'select url from url_visited where url=?'</span>, <span style="color: black;">&#40;</span>url,<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>.<span style="color: black;">fetchone</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>:
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;URL not found in list of visited URLs, inserting&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;URL to insert: %s&quot;</span> <span style="color: #66cc66;">%</span> url<span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'insert into url_stack values (?)'</span>, <span style="color: black;">&#40;</span>url,<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
            <span style="color: #008000;">self</span>.<span style="color: black;">conn</span>.<span style="color: black;">execute</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'insert into url_visited values (?)'</span>, <span style="color: black;">&#40;</span>url,<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">else</span>:
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;URL has already been visited or added for processing, skipping&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #ff7700;font-weight:bold;">def</span> crawl<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>, url<span style="color: black;">&#41;</span>:
        work_url = url
        <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Work URL: %s&quot;</span> <span style="color: #66cc66;">%</span> work_url<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">while</span> <span style="color: #008000;">True</span>:
            <span style="color: #ff7700;font-weight:bold;">try</span>:
                <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Trying to open and parse the URL...&quot;</span><span style="color: black;">&#41;</span>
                page = <span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span>work_url<span style="color: black;">&#41;</span>
                soup = BeautifulSoup<span style="color: black;">&#40;</span>page<span style="color: black;">&#41;</span>
                <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Parsed successfuly&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">except</span>:
                <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Failed to parse, attempting to get next URL from DB&quot;</span><span style="color: black;">&#41;</span>
                work_url = <span style="color: #008000;">self</span>._pop_from_db<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">continue</span>
            links = soup<span style="color: black;">&#40;</span><span style="color: #483d8b;">'a'</span><span style="color: black;">&#41;</span>
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Found total of %d links (&lt;a href=...&gt;)&quot;</span> <span style="color: #66cc66;">%</span> <span style="color: #008000;">len</span><span style="color: black;">&#40;</span>links<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">for</span> link <span style="color: #ff7700;font-weight:bold;">in</span> soup<span style="color: black;">&#40;</span><span style="color: #483d8b;">'a'</span><span style="color: black;">&#41;</span>:
                <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Processing link object: %s&quot;</span> <span style="color: #66cc66;">%</span> link<span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                    <span style="color: #ff7700;font-weight:bold;">if</span> link<span style="color: black;">&#91;</span><span style="color: #483d8b;">'href'</span><span style="color: black;">&#93;</span> <span style="color: #66cc66;">!</span>= <span style="color: #483d8b;">''</span>:
                        <span style="color: #008000;">self</span>._push_to_db<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>._form_url<span style="color: black;">&#40;</span>work_url, link<span style="color: black;">&#91;</span><span style="color: #483d8b;">'href'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span>:
                    <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;An exception has occured, this may be ok (href attribute may be missing)&quot;</span><span style="color: black;">&#41;</span>
                    <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;  ... but can also indicate error in insert code&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Finished adding URLs&quot;</span><span style="color: black;">&#41;</span>
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">debug</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Getting a new URL for processing from DB&quot;</span><span style="color: black;">&#41;</span>
            work_url = <span style="color: #008000;">self</span>._pop_from_db<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
            <span style="color: #dc143c;">logging</span>.<span style="color: black;">info</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Found URL: %s&quot;</span> <span style="color: #66cc66;">%</span> work_url<span style="color: black;">&#41;</span>
            <span style="color: #ff7700;font-weight:bold;">yield</span> work_url
&nbsp;
<span style="color: #ff7700;font-weight:bold;">if</span> __name__ == <span style="color: #483d8b;">'__main__'</span>:
    wc = WebCrawler<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">for</span> url <span style="color: #ff7700;font-weight:bold;">in</span> wc.<span style="color: black;">crawl</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'http://www.google.com/'</span><span style="color: black;">&#41;</span>:
        <span style="color: #ff7700;font-weight:bold;">pass</span></pre></td></tr></table></div>

<p>As you can see it&#8217;s very easy to use. Effectively it is an infinite generator (well, depending what you pass as initial URL). It&#8217;s then up to you what you&#8217;re going to do with the resulting URL&#8230;</p>
<p>One thing to bear in mind when you use it: this crawler pays absolutely no attention to the domain it&#8217;s searching. It just blindly collects links, selects one and follows it. So depending on the site link relative location it may stay for a while on a site, or may just wonder away quite quickly.</p>
<p>I wanted to have something that does not stick around for long on one site, so this suits me well, if you want to have something more pedantic, you may want to modify the code, so that it leaves current domain only when it has visited all pages and there are no new pages within the domain to analyse.</p>


<p>Related posts:<ol><li><a href='http://www.grenadepod.com/2009/11/10/changing-menu-order/' rel='bookmark' title='Permanent Link: Changing menu order'>Changing menu order</a></li>
<li><a href='http://www.grenadepod.com/2009/11/04/top-level-menu-in-arras-theme/' rel='bookmark' title='Permanent Link: Top level menu in Arras theme'>Top level menu in Arras theme</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.grenadepod.com/2009/12/13/python-web-crawler/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
