After using PodGrab for two years it became more and more buggy. It is using a lot of native libraries of Python 2 and so it has a code base which is very complex, hard to maintain and annoying.
So that mostly the reason to program a little tool to get all my favorite podcasts.
- use feedburner for easy implementation
- Feeds sorted into subdirectories
- check for episodes allready existing
- managing feeds easily
At first we define Python 3 as running environment and import some iibraries. For this project we need feed feedparser, which will handle the rss-parsing-part. You might install this from pypy, but don’t do this. If you run this project with 5.1.3 of feedparser which is distributed by pypy, you will get an error. There is a patched version around google code, but you can easily use this link. Unzip it and run “python3 setup.py install”.
# /usr/bin/env python3
from smtplib import SMTP_SSL as SMTP
from email.mime.text import MIMEText
in step two we define two variables. With ‘FEEDSLISTFILE’ we will met requirement three. This is a simple text file containing all feedlinks you want to parse. Each url needs to be in a seperate line. It actually should be in the same directory as this script is. The second variable ‘FEEDDOWNLOADDIR’ is the directory where all files are to be downloaded.
FEEDLISTFILE = 'feeds.txt'
FEEDDOWNLOADDIR = './podcasts'
Further on we start with our class ‘pypodcatcher’ and a method ‘getFeedsFromList’ which takes the list with our feeds. Within this method we define two lists for some information about new episodes and error handling.
Then we open our file with feed-urls and read it to a list (line 17 & 18). For every url in the list we run the method ‘loadFeeds’ which handles parsing and downloading. In line 20 you see there is a restriction defined. Only feeds which doesn’t begin with an ‘#’ are going to be parsed. It might be useful, when you want to exclude some feeds temporarily. After all feeds are parsed we can email the result to us (line 23).
def getFeedsFromList(self, feedlistfile):
self.newepisodes = 
self.errorlist = 
fin = open(feedlistfile, 'r')
feeds = fin.readlines()
for feed in feeds:
if not feed.startswith('#'):
Now we are going on with the ‘loadFeeds’ method. It takes the feedurl u want to parse. Parsing is done in line 24. You just need to call ‘parse’ from the feedparser library and the result ist stored in ‘feed’.
Above we have defined a directory to store all downloaded files, but storing every file in the same directory is inproper and confusing. For this reason we will create a subdirectory for every feed. to name them automatically we use the title of a feed, remove some special characters which are not allowed in a filename and use os.path.join to create a full path including our ‘FEEDDOWNLOADDIR’ (line 25).
As any filemanager python will throw an exception if you try to create a directory which allready exists. So we check if our ‘feeddldir is allready there (line 26) and create if it if not (line 31).
Next we going through the list of episodes (line 28). Each file will get an unique name which is formatted like ‘<date_published>_<episode_title>.mp3’ (Line 30 & 31). Some operating system or filesystems have limitation about the length of fullpath of a file. For this reason we cut every filename longer than 60 characters (line 33 – 36).
Now we create a full path for our download including download directory and filename (line 37) and check if this file is allready existing. If a file ist not exists we are going to download it (line 38 – 41) and add the title to the list if new episodes. When a download fails we add the http error code and link to the errorlist.
Notice: It will be checked if a file is allready existing, but not whether it was downloaded completely. From time to time you should check your downloads für improper filesizes.
def loadFeeds(self, feedurl):
feed = feedparser.parse(feedurl)
feeddldir = os.path.join(FEEDDOWNLOADDIR, feed['feed']['title'].replace('| ', '').replace('/','-'))
if not os.path.isdir(feeddldir):
for f in feed['entries']:
rssfilename = ''
timeentry = datetime.datetime.strptime(time.strftime("%a, %d %b %Y %H:%M:%S %z", f['published_parsed']),"%a, %d %b %Y %H:%M:%S %z")
rssfilename = timeentry.strftime("%Y%m%d%H%M%S") + '_' + f['title'].replace('/','-') + ".mp3"
if len(rssfilename) > 60:
rssfilename = rssfilename[0:60]
if not rssfilename.endswith('.mp3'):
rssfilename = rssfilename + '.mp3'
rssdlpath = os.path.join(feeddldir, rssfilename)
if not os.path.isfile(rssdlpath):
except urllib.error.URLError as e:
self.errorlist.append([e.code, e.reason, f['link']])
When the script is done with parsing, it can send us an email. If you run your script as cronjob it is very useful to get those informations. I am not going to explain this part in deep.
errorstring = '''\n\terrors:\n'''
titlesstring = ''
for er in self.errorlist:
ecode, ereason, elink = er
errorstring = errorstring + '''\t%s | %s | %s \n''' % (ecode, ereason, elink)
for t in self.newepisodes:
feedname, episodename = t
titlesstring = titlesstring + '\tnew: %s, %s\n' % (feedname, episodename)
text = '''
pypdcatcher has done and
%s episodes are new
''' % (str(len(self.newepisodes)), titlesstring, errorstring)
msg = MIMEText(text, 'plain')
msg['from'] = "pypodcatcher"
msg['Subject'] = "podcatcher result"
me = '<<from_email_address>>'
msg['To'] = '<<to_email_address>>'
conn = SMTP('mail.server.com')
conn.sendmail(me, msg['To'], msg.as_string())
Finally we have a very simple script to parse rss-feeds and download podcasts. To run this just at two lines more at the end and run it.
if __name__ == '__main__':