Coder Social home page Coder Social logo

nicbet / mailboxminer Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 6.0 8.75 MB

Tool to mine Mail Repositories in the form of MBOX archives into a PostgreSQL Database representation. Some of the perks include: robustness to different MBOX file formats, detection of message encoding and transformation into UTF8, unpacking of MIME messages, separate saving of attachments, automatic copies of HTML emails into plain text (useful for data mining), resolution of multiple identities (matches aliases by name heuristics), and many more.

License: GNU General Public License v3.0

Java 100.00%

mailboxminer's People

Contributors

bramadams avatar nicbet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mailboxminer's Issues

Large mbox file: Exception in thread "main" java.lang.OutOfMemoryError, heap increase doesn't work

Processing /target/tmp/mail.mbox
Exception in thread "main" java.lang.OutOfMemoryError
	at java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:214)
	at java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:206)
	at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:173)
	at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:538)
	at java.base/java.lang.StringBuilder.append(StringBuilder.java:174)
	at ca.queensu.cs.sail.mailboxmina2.common.MailboxParser.parseMessages(MailboxParser.java:103)
	at ca.queensu.cs.sail.mailboxmina2.main.modules.InsertModule.storeInDatabase(InsertModule.java:121)
	at ca.queensu.cs.sail.mailboxmina2.main.modules.InsertModule.run(InsertModule.java:88)
	at ca.queensu.cs.sail.mailboxmina2.main.Main.main(Main.java:79)

The mbox file is 2.7GB. I've increased heap size to 12000M (see below, on a 16GB RAM macbook) but the error persists.

scripts:

echo "Creating database ${DB_NAME}"
java -Xmx4096M -jar $CWD/mailboxmina2-0.0.1-SNAPSHOT.jar -connection $URL/ -username $USER -password $PASS -dbname $DB_NAME -module create -drop true

echo "Populating database ${DB_NAME}"
# -Djavax.activation.debug=true
java -Xmx12000M -jar $CWD/mailboxmina2-0.0.1-SNAPSHOT.jar -connection $URL/$DB_NAME -username $USER -password $PASS -module insert -debug -logfile ${BASE}-insert.log -verbosity 3 -path $TMP
rm -Rf $TMP

echo "Unifying duplicate email addresses in database ${DB_NAME}"
java -Xmx12000M -jar $CWD/mailboxmina2-0.0.1-SNAPSHOT.jar -connection $URL/$DB_NAME -username $USER -password $PASS -module persons -debug -logfile ${BASE}-persons.log -verbosity 3

echo "Recovering threads from database ${DB_NAME}"
java -Xmx12000M -jar $CWD/mailboxmina2-0.0.1-SNAPSHOT.jar -connection $URL/$DB_NAME -username $USER -password $PASS -module threads -debug -logfile ${BASE}-threads.log -verbosity 3

Message body cant be cleaned

hi nicolas
I tried to clean the message body with the following command.

#Cleaning message body
    java -Xmx1024M -jar /home/hema/MailboxMiner-mvn-build/MailboxMiner2/target/mailboxmina2-0.0.1-SNAPSHOT.jar -connection //localhost:5432/test_db -username hema -password PASSWORD -module clean -debug -logfile test-clean.log -verbosity 3 -path PATH/TO/FILE

The field body_text_cleaned_text of the bodies table is empty.

It seems that the clean module doesnt contain any implementation or it seems missing.

Records wont get inserted in postgresql

hi,
i have followed the steps stated under 'usage' to reconstruct the database from dump.
Then i executed the import.sh file with the required arguments.
here is what i have done

[root@localhost dist]# sh import.sh /home/hema/test_data //localhost:5432 testdb hema password

i am getting the following comments displayed.

Unpacking mbox files to /home/hema/MailboxMiner-master/MailboxMiner2/dist/tmp
Creating database testdb
[LOG] Connection to PostgreSQL 9.4.6 successful!
Dropping existing database testdb
Error dropping database testdb : ERROR: database "testdb" is being accessed by other users
Detail: There is 1 other session using the database.
Creating database testdb
Error creating database testdb : ERROR: database "testdb" already exists
Connection closed!
Populating database testdb
Running insertion module...
Processing /home/hema/MailboxMiner-master/MailboxMiner2/dist/tmp/2000-July.mbox.gz.mbox
InsertModule finished successfully!
Unifying duplicate email addresses in database testdb
0
PersonalitiesModule finished successfully!
Recovering threads from database testdb
Running threads module...
ThreadsModule finished successfully!

There are no error messages displayed.But when i view the tables in pgadmin (IDE for postgresql) the records are not inserted. kindly do advice.where have i gone wrong.

note i have attached the log

log.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.