NewsBlur/apps/rss_feeds/importer.py

import urllib2
import logging
import re
import multiprocessing

class PageImporter(object):
    
    def __init__(self, url, feed):
        self.url = url
        self.feed = feed
        self.lock = multiprocessing.Lock()
    
    def fetch_page(self):
        request = urllib2.Request(self.url)
        
        try:
            response = urllib2.urlopen(request)
        except urllib2.HTTPError, e:
            logging.error('The server couldn\'t fulfill the request. Error: %s' % e.code)
        except urllib2.URLError, e:
            logging.error('Failed to reach server. Reason: %s' % e.reason)
        else:
            data = response.read()
            html = data
            html = self.rewrite_page(html)
            self.save_page(html)
    
    def rewrite_page(self, response):
        base_code = u'<base href="%s" />' % (self.feed.feed_link,)
        try:
            html = re.sub(r'<head(.*?\>)', r'<head\1 '+base_code, response)
        except:
            response = response.decode('latin1').encode('utf-8')
            html = re.sub(r'<head(.*?\>)', r'<head\1 '+base_code, response)
        
        return html
        
    def save_page(self, html):
        self.feed.page_data = html
        self.lock.acquire()
        try:
            self.feed.save()
        finally:
            self.lock.release()
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00			`import urllib2`
			`import logging`
			`import re`
Adding semaphore locks on all mysql db requests in threads/processes. 2009-09-16 02:34:04 +00:00			`import multiprocessing`
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00
			`class PageImporter(object):`

			`def __init__(self, url, feed):`
			`self.url = url`
			`self.feed = feed`
Adding semaphore locks on all mysql db requests in threads/processes. 2009-09-16 02:34:04 +00:00			`self.lock = multiprocessing.Lock()`
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00
			`def fetch_page(self):`
			`request = urllib2.Request(self.url)`

			`try:`
			`response = urllib2.urlopen(request)`
Better imports through error handling. 2009-08-15 15:10:21 +00:00			`except urllib2.HTTPError, e:`
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00			`logging.error('The server couldn\'t fulfill the request. Error: %s' % e.code)`
Better imports through error handling. 2009-08-15 15:10:21 +00:00			`except urllib2.URLError, e:`
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00			`logging.error('Failed to reach server. Reason: %s' % e.reason)`
			`else:`
			`data = response.read()`
			`html = data`
			`html = self.rewrite_page(html)`
			`self.save_page(html)`

			`def rewrite_page(self, response):`
			`base_code = u'<base href="%s" />' % (self.feed.feed_link,)`
			`try:`
Updating algorithm to insert the necessary base information into the feed's page. 2009-08-26 03:12:55 +00:00			`html = re.sub(r'<head(.*?\>)', r'<head\1 '+base_code, response)`
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00			`except:`
			`response = response.decode('latin1').encode('utf-8')`
Updating algorithm to insert the necessary base information into the feed's page. 2009-08-26 03:12:55 +00:00			`html = re.sub(r'<head(.*?\>)', r'<head\1 '+base_code, response)`
Importing feed's original page along with RSS stories. Wowzers. 2009-08-13 03:26:12 +00:00
			`return html`

			`def save_page(self, html):`
			`self.feed.page_data = html`
I just don't feel comfortable with Django's thread-unsafe ORM. Putting locks back in place around all database .save() calls. It's The Right Thing To Do. 2009-09-16 04:00:37 +00:00			`self.lock.acquire()`
			`try:`
			`self.feed.save()`
			`finally:`
			`self.lock.release()`