Detecting a footer of an email

Submitted by Milos on Tue, 06/21/2016 - 13:22

Detecting a footer of an email

This is the 5th blog post of the Google Summer of Code 2016 project - Mailhandler.

Implementing authentication and authorization for a mail sender provided an additional layer of security for Mailhandler project. The module was extended to support both PGP signed and unsigned messages.

The goal for the last week was to create a mail Footer analyzer and to add support for node (content) type detection via mail subject. The pull request has been created and it is in the review status. This analyzer has a purpose of stripping the message footer/signature from the message body. As of now, 2 types of signature/footer separators are supported:

  • -- \n as the separator line between the body and the signature of a message recommended by RFC 3676
  • On {day}, {month} {date}, {year} at {hour}:{minute} {AM|PM} pattern which is trickier and currently used by Gmail to separate replied message from the response.

First of all, we had to create inmail.analyzer.footer config entity and the corresponding analyzer plugin - FooterAnalyzer. Since footer, subject and content type properties are relevant for all types of mail messages supported by Mailhandler, these properties were put in MailhandlerAnalyzerResultBase class.

FooterAnalyzer currently depends on the analyzed result provided by MailhandlerAnalyzer. The reason why one plugin depends on another is to support PGP signed messages. MailhandlerAnalyzer will try to analyze the message body of signed (and unsigned) messages and extract the actual mail body. Next, FooterAnalyzer will parse the processed body stored in MailhandlerAnalyzerResult. As mentioned above, the footer analyzer currently supports footers separated by -- \n and On {day}, {month} {date}, {year} at {hour}:{minute} {AM|PM} lines. The content after these lines is put into the footer property of the analyzer result. In case the body message has one of the supported separators, detected footer is stripped out from the actual message body.

Furthermore, the content type detection via message subject has been implemented. As we are going to support creating comments via email in the following weeks, we had to create a “protocol” that will allow us to differentiate between nodes and comments. We agreed to add [{entity_type}][{bundle}] before the actual message subject. For now, only node entity type and its bundle (content/node type) are parsed and extracted. All the assertions of the analyzed message are happening in the handler plugin (MailhandlerNode). The handler plugin will check if the configured content type is set to “Detect” mode and if so, it will get the parsed content type and create an entity of the parsed node type.

This week, students and their mentors are requested to submit mid-term evaluations. The evaluation represents a sum of the project after 5 weeks of the work. By finishing FooterAnalyzer, Mailhandler is now capable of processing signed (and unsigned) emails, extracting the actual body and creating a node of the detected node type for an authorized user.

The plan for the next week is to extend the project with validation support. We will use entity (node) validation and extend content type to bundle validation too. Also, I will work on splitting the Mailhandler analyzer to the smaller analyzers and adapting the handler to the changes.

 

 

Comments

Submitted by Antonio (not verified) on Mon, 06/27/2016 - 16:44

Permalink

Hi Miloš,

I dealt with email footers in the past ad they are not that trivial to detect.

First you have to define clearly the concept of "mail footer" in your context, and that is not easy if you want to handle any email with no assumptions.

For instance let's define a mail footer as:
"The original quoted message" OR "a signature block"

I split the definition because the text added by gmail as On {day}, {month} {date}, {year} at {hour}:{minute} {AM|PM} is semantically different from a signature separator: it is a marker for the beginning of the original quoted message.

The original quoted message happens to be at the bottom when people top-post, but as soon as they bottom-post, or (as I prefer) use the interleaved style, the marker from above it does not act as a footer separator anymore.

IMHO it's not even trivial to decide if a signature block is started by the first or the last "-- \n" separator, signatures can contain that separator themselves, some might argue that the footer starts at the first "-- \n" separator.

Anyways, my point is that software should not try to be too smart, especially on free text written by humans.

Of course there is still use for the code you wrote in a controlled scenario: the person who write the emails knows the limitations of the system; so just make sure to document these limitations. :)

Thanks,
Antonio