Coder Social home page Coder Social logo

enron_xls's Introduction

Enron Spreadsheet Corpus

The Enron Corpus is a massive database of emails amassed in the investigation of the former Enron Corporation. The original corpus is available as a series of PST email archives. The emails include tens of thousands of spreadsheets.

Various sources point to the existence of a version of the dataset with all attachments. Those links have been removed but the dataset was preserved on the Internet Archive

The original dataset included personally identifiable information such as birth dates and Social Security numbers. Nuix produced a cleaner dataset and made it available to the community. Unfortunately this dataset was removed.

The spreadsheets in this dataset are in their original format, including BIFF2, TSV, semicolon-delimited values, SYLK, and HTML files saved as XLS. They have been de-duplicated by MD5 hash.

Files

Methodology

EDRM Dataset

The included <parse.mjs> script runs in NodeJS 16 and automates the process of downloading the spreadsheets from the archive.

The actual dataset is a series of ZIP files. It is possible to cherry-pick from the ZIP archives without having to download the entire 74 GB archive.

Nuix Dataset

Starting from the cleaned email set, each PST file was downloaded and processed using the excellent pst-extractor Node module.

Every available XLS attachment (these emails predate XLSX, which was introduced in 2007) was extracted, and the files were de-duplicated based on MD5 checksum.

Any duplicates of files in the EDRM set were removed.

References

Analytics

enron_xls's People

Contributors

sheetjsdev avatar reviewher avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.