如何从电子邮件中删除引用的文本并仅显示新文本

2022-09-03 05:25:12

我正在解析电子邮件。当我看到对电子邮件的回复时,我想删除引用的文本,以便我可以将文本附加到上一封电子邮件中(即使它是回复)。

通常,您会看到以下内容:

第一封电子邮件(对话开始)

This is the first email

第2封邮件(首先回复)

This is the second email

Tim said:
This is the first email

此输出将仅为“这是第二封电子邮件”。尽管不同的电子邮件客户端引用文本的方式不同,但如果有某种方法可以仅获取新的电子邮件文本,那也是可以接受的。


答案 1

我使用以下正则表达式来匹配引用文本的引线(最后一个是计数的正则表达式):

  /** general spacers for time and date */
  private static final String spacers = "[\\s,/\\.\\-]";

  /** matches times */
  private static final String timePattern  = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\\s)?[AP]M)?";

  /** matches day of the week */
  private static final String dayPattern   = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";

  /** matches day of the month (number and st, nd, rd, th) */
  private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";

  /** matches months (numeric and text) */
  private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
                                              "|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";

  /** matches years (only 1000's and 2000's, because we are matching emails) */
  private static final String yearPattern  = "(?:[1-2]?[0-9])[0-9][0-9]";

  /** matches a full date */
  private static final String datePattern     = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
                                                "(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
                                                 spacers + "+" + yearPattern;

  /** matches a date and time combo (in either order) */
  private static final String dateTimePattern = "(?:" + datePattern + "[\\s,]*(?:(?:at)|(?:@))?\\s*" + timePattern + ")|" +
                                                "(?:" + timePattern + "[\\s,]*(?:on)?\\s*"+ datePattern + ")";

  /** matches a leading line such as
   * ----Original Message----
   * or simply
   * ------------------------
   */
  private static final String leadInLine    = "-+\\s*(?:Original(?:\\sMessage)?)?\\s*-+\n";

  /** matches a header line indicating the date */
  private static final String dateLine    = "(?:(?:date)|(?:sent)|(?:time)):\\s*"+ dateTimePattern + ".*\n";

  /** matches a subject or address line */
  private static final String subjectOrAddressLine    = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*\n";

  /** matches gmail style quoted text beginning, i.e.
   * On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
   */
  private static final String gmailQuotedTextBeginning = "(On\\s+" + dateTimePattern + ".*wrote:\n)";


  /** matches the start of a quoted section of an email */
  private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
                                                                        "(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
                                                                        gmailQuotedTextBeginning + ")"
                                                                      );

我知道在某些方面这是矫枉过正(而且可能很慢!),但它效果很好。如果您发现任何与此不匹配的内容,请告诉我,以便我可以改进它!


答案 2

查看谷歌对此的专利:http://www.google.com/patents/US7222299

总之,他们对文本的部分内容(大概是类似句子的东西)进行哈希处理,然后在以前的消息中查找与哈希值匹配的内容。超级快,他们可能也将其用作线程算法的输入。真是个好主意!


推荐