How to build a mail search engine using Nitro

By Kashia.

Part 3 - The Mail Parsing

All that Nitro stuff is boring, gimme some action

Yeah. You're right, up until now everything was description. Let's make the class a little more interesting.

Have a look at the source of our problem:

This is where your data sits, in mbox format. We don't like to search in there, that's why we're here, right?

Malbox parsing fun

Let's get the mails first:

16 require 'open-uri'
17 
18 module MBoxParser
19   
20   def self.parse(url)
21     puts 'PARSING: ' + url
22     mbox = open(url).read
23     
24     return mbox.split(/\n*^From\s.*\d{4}$/)[1..-1]
25   end
26 end

Well, that was easy right? For less... Regex-proof people: split(/\n*^From\s.*\d{4}$/) just says: 'split, where there might be newlines (\n*), followed by 'From' at the start of a new line (^From), a space (\s) and some other stuff (.*), until the line ends with 4 digits (\d{4}$).'

Basically, it will match lines like this one:

From name.surename at example.com Thu Jan 27 03:30:03 2005

As it will also match the first line (and so splits there as well), we're just taking the array values from the second entry onwards.

Wasn't that nice? I like Regex. ^_^

Now we have a nice utility function to download and parse textfiles in mbox format.

Mail parsing fun

Now, we want to get some information on the email, until now we just have the raw thing.

 2 require 'date'
 3 
 4 module ModEMailParser
 5   def parse_mail(raw)
 6     header, body = raw.split(/\n\n/, 2)
 7     
 8     header[/^From:\s*(.*)$/]; from = $1
 9     header[/^Message-ID:\s*<(.*)>\s*$/]; id = $1
10     header[/^Date:\s*(.*)\s*$/]; date = $1
11     header[/^Subject:\s*(.*)\s*$/]; subject = $1
12     
13     time = DateTime.parse(date)
14     
15     return from, id, time, subject, header, body
16   end
17 end

Yeeeehaw, more regex fun!

If you need more information on the email, why not just extend this function?

This is one more utility function, which can be used by our model.

Save those two code blocks in app/mail/mail_utils.rb

first
last