Making Sense of Thousands of Email Messages
IQSS has developed a research tool that facilitates grouping and organizing large sets of digital documents, helping the user make sense of them through an interactive process. We propose to build an ingest front-end to the tool that will allow the Harvard Library to use this tool to understand, organize and label email archives and similar digital content worth preserving.
Mercè Crosas, Director of Product Development, Institute for Quantitative Social Science
Andrea Goethals, Manager of Digital Preservation and Repository Services, Harvard Library Office for Information Systems
Wendy Gogel, Manager of Digital Content and Projects, Harvard Library Office for Information Systems
March 2012 Update
This project started in January. During the first part of the year, we worked on optimizing the text clustering application (named “Consilience”) to allow users to quickly navigate through possible clustering solutions, and choose their preferred clustering or partition for a given document set.
The IQSS and the Library team met in February to review the format of the email messages and differentiate between the actual text to be analyzed and the metadata fields. The Library team identified a first set of emails to work on, which includes around 500 email messages.
The IQSS team is now working on re-formating this first set and extracting the metadata. In the next weeks, the re-formatted document set will be processed through the following steps: 1) calculate the word count in each email document, 2) run more than 100 clustering algorithms for the set, and 3) calculate new meaningful clustering solutions in the Bell’s space.
Download the proposal:
