Annotating Needles In The Haystack Without Looking: Product Information Extraction From Emails

Weinan Zhang,Amr Ahmed,Jie Yang,Vanja Josifovski,Alex J. Smola
DOI: https://doi.org/10.1145/2783258.2788580
2015-01-01
Abstract:Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g. purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices.However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information.In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, such as Conditional Random Field (CRF), can solve this problem well.To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.
What problem does this paper attempt to address?