Friday, January 4, 2008

Email threading headaches (threadaches?)

Abstract: "Waaaaah Outlook, waaaaah Blackberry."

I am working on a project which has me sorting email roughly into threads. Not necessarily in a tree thread, but more of a flat threading (a'la Gmail). The goal is to ingest email by script and then use XML-RPC to either attach them to an existing email thread ticket in Trac on another machine, or create a new thread ticket.

The easy answer is simply to examine the "In-Reply-To" and/or "References" headers and keep an index of Message IDs, handily discussed in RFC 822 (In-Reply-To and References), RFC 2822 (In-Reply-To and References), and RFC 1036 (References, USENET-style). In fact, ├╝ber-geek, Netscaper, nightclub owner, and all-around smart guy Jamie Zawinski posted his own guide to message threading. Jamie's article is fantastically interesting especially because he gives some analysis of historical message archives to determine what "real" headers look like.

So this is great, right? Well, almost. Jamie's guide even gives the warning that
"[o]f course these numbers are very much dependent on the sample set, which, in this case, was probably skewed toward Unix users, and/or toward people who had been on the net for quite some time (due to the age of the archives I checked.)"
Problem the first: my project deals in large part with Microsoft Exchange and Blackberry users. Okay, well I have my own corpus of email to mine, and I can verify that Outlook/Exchange does an exceptional job of following the RFC rules.

Probelm the second: Blackberry via Exchange, however, does not play nicely with In-Reply-To and References headers.

Here is an example message structure. This is the shape of the thread that I am examining:
  • Subject: Query
    From A, with no agent header but including the header "X-MimeOLE: Produced By Microsoft Exchange V6.5" and some message ID, maybe "<0001@exch04.example.lcl>". Generated with Outlook.
    • Subject: Re: Query
      From me, with agent header "User-Agent: Thunderbird 1.5.0.7 (Windows/20060909)", a vaild In-Reply-To and a valid References header, and some message id "<0a3f@example.com>". Generated, obviously, with Thunderbird.
      • Subject: Re: Query
        From P, again no agent header but the same X-MimeOLE header as A's query. There is no In-Reply-To or References header, but there are two headers "Thread-Topic" and "Thread-Index." Has a valid Message ID. This message was sent via Blackberry using their enterprise server, I think, or maybe via the desktop sync (Poor Man's Push) application. Not sure.
        • Subject: Re: Query
          From me, with a valid and correct In-Reply-To and References header, a logical Message ID.
      • Subject: Re: Query
        From A, again using Outlook. This Exchange message does have a valid In-Reply-To and References headers, and has "Thread-Topic" and "Thread-Index" headers.
Okay, so what do we learn from this exercise? Well, Thunderbird respects headers and includes the appropriate identification header fields. And this makes sense, since Thunderbird code ostensibly came from Mr. Zawinski at some point in some way, being that it was a fork of Netscape mail. The second thing is that Outlook also observes the identifier fields. Blackberries, however, not so much. Oh, and since there are no Agent headers, you cannot immediately differentiate a Blackberry/Exchange message from an Outlook/Exchange message. Exchange also includes the "Thread Subject" and "Thread Index" fields. The "Thread Subject" field is static throughout the thread (and is the original message subject), while the Thread Index changes for each message.

One interesting thing to note is that the Blackberry uses base64 encoding for all MIME parts, including the plain text portion of the email. Outlook includes plain text as actual plain text, at least in version 2003.

So here's the rub: how can I get the Blackberry reply back into the message thread? And I don't even need it in the "correct" thread structure, just into a flat thread. Gmail can do it, how?

Off the top of my head, Gmail probably uses the Subject field (sans "Re:", and so on) to find what might be similar messages, and then uses some fuzzy text block matching algorithm to determine whether it's an actual reply. Gmail already does this fuzzy comparison (or something like it) to do its "Show Hidden Text" feature in replies. In this way, Gmail was able to handle the Blackberry reply listed above.

Interestingly, if the Blackberry reply has a different To: address than previously seen in the thread, however, it breaks the threading. So, for example, if I get a message from P and reply with my @gmail.com email, and then P replies using my @example.com email using his Blackberry, the thread is broken in Gmail with the third message. So the behavior is not seemingly entirely consistent in Gmail.

The Blackberry reply-matching is pretty important to my project. Critical, actually. It seems like the only way to do it so far is by doing some complex (and expensive) text matching. The fuzzy search algorithm is attractive, but since I want to use this in some ticketing system like Trac, I would probably need to use some second corpus of text to perform the matching (remember, for me Trac is accessed by XML-RPC between two machines, so exhaustive ticket searches are very expensive).

I went back and looked at more Blackberry messages. The thread example above is originally from 2006, and in the mean time P upgraded his Blackberry from something pre-Curve to an 8830. The behavior is the same, in that it does not respect In-Reply-To or References. I checked emails from two other Blackberry users, with the same results.

So, I will trudge forward and shoot to use the fuzzy text matching if I can. Maybe something will fix itself in the mean time (not likely). But my guess is that the reason Blackberry threading works in Gmail due to its (incredibly) smart thread finding.

2 comments:

Jim said...

It's April 2008 and Yahoo Mail still leaves out In-Reply-To and References headers. Amazing.

Tony said...

I'm working on a similar project, and found this issue myself. I'm now looking for a workaround that doesn't involve a fuzzy solution.

I don't suppose you found any other headers that are respected and returned by Yahoo or Blackberries?