Yeah. You're right, up until now everything was description. Let's make the class a little more interesting.
Have a look at the source of our problem:
This is where your data sits, in mbox format. We don't like to search in there, that's why we're here, right?
Let's get the mails first:
16 require 'open-uri' 17 18 module MBoxParser 19 20 def self.parse(url) 21 puts 'PARSING: ' + url 22 mbox = open(url).read 23 24 return mbox.split(/\n*^From\s.*\d{4}$/)[1..-1] 25 end 26 end
Well, that was easy right? For less... Regex-proof people:
split(/\n*^From\s.*\d{4}$/) just says: 'split, where there might be
newlines (\n*), followed by 'From' at the start of a new line (^From), a
space (\s) and some other stuff (.*), until the line ends with 4 digits
(\d{4}$).'
Basically, it will match lines like this one:
From name.surename at example.com Thu Jan 27 03:30:03 2005
As it will also match the first line (and so splits there as well), we're just taking the array values from the second entry onwards.
Wasn't that nice? I like Regex. ^_^
Now we have a nice utility function to download and parse textfiles in mbox format.
Now, we want to get some information on the email, until now we just have the raw thing.
2 require 'date' 3 4 module ModEMailParser 5 def parse_mail(raw) 6 header, body = raw.split(/\n\n/, 2) 7 8 header[/^From:\s*(.*)$/]; from = $1 9 header[/^Message-ID:\s*<(.*)>\s*$/]; id = $1 10 header[/^Date:\s*(.*)\s*$/]; date = $1 11 header[/^Subject:\s*(.*)\s*$/]; subject = $1 12 13 time = DateTime.parse(date) 14 15 return from, id, time, subject, header, body 16 end 17 end
Yeeeehaw, more regex fun!
If you need more information on the email, why not just extend this function?
This is one more utility function, which can be used by our model.
Save those two code blocks in app/mail/mail_utils.rb