govcert-lu / eml_parser Goto Github PK
View Code? Open in Web Editor NEWpython eml parser module
Home Page: http://eml-parser.readthedocs.io/
License: GNU Affero General Public License v3.0
python eml parser module
Home Page: http://eml-parser.readthedocs.io/
License: GNU Affero General Public License v3.0
my attachment parse result:
{'filename': 'part-000',
'size': 430,
'hash':
{'md5': '65c2b2c519925d7c6df9a990f03c80ca',
'sha1': '2746fe51cc7d21e8f701a1c86261801d44a27513',
'sha256': '9fc8fde257977393b0...0960f45fdb7', 'sha512': 'c31...db65'}, 'raw': b'VGhpcyBpcyBhbiBhdXRvbWF0aWNhbGx5IGdlbmVyYXRlZCBtZXNzYWdlIGZyb20gU2VuZEdyaWQuDQoNCkknbSBzb3JyeSB0byBoYXZlIHRvIHRlbGwgeW91IHRoYXQgeW91ciBtZXNzYWdlIHdhcyBub3QgYWJsZSB0byBiZQ0KZGVsaXZlcmVkIHRvIG9uZSBvZiBpdHMgaW50ZW5kZWQgcmVjaXBpZW50cy4NCg0KSWYgeW91IHJlcXVpcmUgYXNzaXN0YW5jZSB3aXRoIHRoaXMsIHBsZWFzZSBjb250YWN0IFNlbmRHcmlkIHN1cHBvcnQuDQoNCm51enplbDoyMDUzNTc6PGZvb2JhckBmb29iYXIuY29tPiA6IDE5OC4zNy4xNTIuMzQgOiBteDAzLm1haWwuZ29vLm5lLmpwOlsyMTAuMTY1LjEwLjFdIDogNTUwIDUuMS4xIHNpZD1pMDFLMW4wMGwwa24xRW0wMSBBZGRyZXNzIHJlamVjdGVkIGZvb2JhckBmb29iYXIuY29tLiBbY29kZT0yOF0gIGluIFJDUFQgVE8NCg==',
'content_header':
{'content-type': ['text/plain'],
'content-disposition': ['inline'],
'content-transfer-encoding': ['7bit'],
'content-description': ['Notification']}}
But the file name in Microsoft Outlook is 189844630
:
which file name is right?
raw .eml
is:
------------=_1395792079-24137-58419
Content-Type: message/rfc822; name="189844630"
Content-Disposition: inline; filename="189844630"
Content-Description: Undelivered Message
Content-Transfer-Encoding: 7bit
I think 189844630
is right value. maybe got wrong value?
Hi, Is possible to get content of email?
Hey there,
First, thanks for this great library that helps me for over 6 months now.
I recently encountered a problem when I used eml-parser to extract datas from more than 20k emails. After a while, my python process was starting to eat all my 32Gb of RAM. At first, everything was fine, and suddenly, every second, hundreds of megabytes were added and used by the python process where eml-parser works.
I found out what eml file was responsible of the problem but due to confidential issues, I can't upload it. The most important thing I can tell you and I think it's the main problem is: there is an attachment in the eml file to... another eml file. I think maybe the library gets confused and try recursively and indefinitely to parse the eml.
The problem doesn't appear when eml-parser is initialised like this:
parser = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=False, parse_attachments=False)
If I set parse_attachments
to True, it goes forever and eats all my RAM.
Are there any branches to extract the email body?
pip install eml_parser==1.17.0
ERROR: Could not find a version that satisfies the requirement eml_parser==1.17.0 (from versions: 0.9, 1.0, 1.1, 1.3, 1.4, 1.5,
1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.11.1, 1.11.2, 1.11.4, 1.11.5, 1.11.6, 1.11.7)
ERROR: No matching distribution found for eml_parser==1.17.0
if I use pip install eml_parser
gets the version of 1.11.7
and happends to this error:
AttributeError: module 'eml_parser' has no attribute 'EmlParser'
I've been trying to parse EMLs (which this library is amazing for), I just have issues getting attachment content. It seems its always None (null in the JSON):
{
"content_header": {
"content-disposition": [
"attachment; filename=\"rekha resume3.doc\""
],
"content-transfer-encoding": [
"base64"
],
"content-type": [
"application/msword; name=\"rekha resume3.doc\""
],
"x-attachment-id": [
"f_inzzd1g90"
]
},
"extension": "doc",
"filename": "rekha resume3.doc",
"hash": {
"md5": "019ba196161169e6028d2c4761663c49",
"sha1": "eb59d3fba44bde4585e885d0923a0727d18a0ab4",
"sha256": "4510a62b2e582167ebbabe67c79e9ab54040c68f4c67b6434a60ca78fe8d502a",
"sha512": "d4c35801db7940fe858f52d8bba92bb12ad07d23630e446cbae789414f70df9f759c642961cca196f7b979b61d736c72c9355f4112bd6cc1b9bd579ab2afa76f"
},
"raw": null,
"size": 80642
}
]
I initially the parsed content as: msg = eml_parser.eml_parser.decode_email_b(fdata, include_raw_body=True, include_attachment_data=True)
per instructions on the readthedocs pages. I can't seem to get back actual content for attachment (it does give the hashes so it does have filedata).
I have a problem to use library eml-parser
My environment is like this
Use WSL in window(ubuntu 18.04)
Use python 3.7 and 3.6
Use Jupyter notebook
in the jupyter those error messages are came out
-> module 'eml_parser' has no attribute 'EmlParser'
when i using code like this(here is my code)
import datetime
import json
import eml_parser
def json_serial(obj):
if isinstance(obj, datetime.datetime):
serial = obj.isoformat()
return serial
with open('YourECG.eml', 'rb') as fhdl:
raw_email = fhdl.read()
ep = eml_parser.EmlParser()
parsed_eml = ep.decode_email_bytes(raw_email)
print(json.dumps(parsed_eml, default=json_serial))
Thank you, a lot!
In a bounced message I got:
The undelivered mail returned to sender with Content-Type: text/plain; charset=us-ascii
The attached message returned has Content-Type: text/plain; charset=iso-8859-1
eml_parser return the following error:
`Traceback (most recent call last):
File "/home/joao/Projects/email-engine/eml_parser_test.py", line 18, in
parsed_eml = eml_parser.eml_parser.decode_email_b(message, include_raw_body=False, include_attachment_data=True)
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 417, in decode_email_b
return parse_email(msg, include_raw_body, include_attachment_data, pconf, parse_attachments=parse_attachments)
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 893, in parse_email
report_struc['attachment'] = traverse_multipart(msg, 0, include_attachment_data)
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 206, in traverse_multipart
attachments.update(traverse_multipart(part, counter, include_attachment_data)) # type: ignore
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 203, in traverse_multipart
prepare_multipart_part_attachment(msg, counter, include_attachment_data)) # type: ignore
File "/home/joao/Projects/email-engine/venv/lib64/python3.7/site-packages/eml_parser/eml_parser.py", line 249, in prepare_multipart_part_attachment
data = bytes(payload[0])
File "/usr/lib64/python3.7/email/message.py", line 164, in bytes
return self.as_bytes()
File "/usr/lib64/python3.7/email/message.py", line 178, in as_bytes
g.flatten(self, unixfrom=unixfrom)
File "/usr/lib64/python3.7/email/generator.py", line 116, in flatten
self._write(msg)
File "/usr/lib64/python3.7/email/generator.py", line 195, in _write
self._write_headers(msg)
File "/usr/lib64/python3.7/email/generator.py", line 418, in _write_headers
self._fp.write(self.policy.fold_binary(h, v))
File "/usr/lib64/python3.7/email/policy.py", line 200, in fold_binary
folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
File "/usr/lib64/python3.7/email/policy.py", line 214, in _fold
return self.header_factory(name, ''.join(lines)).fold(policy=self)
File "/usr/lib64/python3.7/email/headerregistry.py", line 258, in fold
return header.fold(policy=policy)
File "/usr/lib64/python3.7/email/_header_value_parser.py", line 157, in fold
return _refold_parse_tree(self, policy=policy)
File "/usr/lib64/python3.7/email/_header_value_parser.py", line 2698, in _refold_parse_tree
part.ew_combine_allowed, charset)
File "/usr/lib64/python3.7/email/_header_value_parser.py", line 2785, in _fold_as_ew
encoded_word = _ew.encode(to_encode_word, charset=encode_as)
File "/usr/lib64/python3.7/email/_encoded_words.py", line 222, in encode
bstring = string.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6: ordinal not in range(128)
Process finished with exit code 1`
An option to NOT parse attached messages would be great, or just change the encoding to open the attachment would work, I think.
I've attached the message too.
Thanks in advance.
amostra_eml.txt
import datetime
import json
import eml_parser
def json_serial(obj):
if isinstance(obj, datetime.datetime):
serial = obj.isoformat()
return serial
with open('sample.eml', 'rb') as fhdl:
raw_email = fhdl.read()
parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
print(json.dumps(parsed_eml, default=json_serial))
AttributeError Traceback (most recent call last)
in ()
13 raw_email = fhdl.read()
14
---> 15 parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
16
17 print(json.dumps(parsed_eml, default=json_serial))
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eml_parser/eml_parser.py in decode_email_b(eml_file, include_raw_body, include_attachment_data, pconf, policy)
320 msg = email.message_from_bytes(eml_file, policy=policy)
321
--> 322 return parse_email(msg, include_raw_body, include_attachment_data, pconf)
323
324
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/eml_parser/eml_parser.py in parse_email(msg, include_raw_body, include_attachment_data, pconf)
426 # parse and decode subject
427 subject = msg.get('subject', '')
--> 428 headers_struc['subject'] = eml_parser.decode.decode_field(subject)
429
430 # If parsing had problem... report it...
AttributeError: module 'eml_parser' has no attribute 'decode'
Hello,
Need help parsing a particular EML file that can not be attached here due to confidentiality. If EML is still needed after this information, I can send it via email or in some other way.
OS: Debian 9.2
Python: 3.5.3
eml_parser: 1.8
file-magic: 0.3.0
EML generated by: User-agent: Microsoft-MacOutlook/10.c.0.180410
Has attachments: Yes
test.py (code from readme)
import eml_parser
def json_serial(obj):
if isinstance(obj, datetime.datetime):
serial = obj.isoformat()
return serial
with open('sample.eml', 'rb') as fhdl:
raw_email = fhdl.read()
parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
print(json.dumps(parsed_eml, default=json_serial))
Running python3.5 test.py
returns an error
Traceback (most recent call last):
File "test.py", line 13, in <module>
parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 317, in decode_email_b
return parse_email(msg, include_raw_body, include_attachment_data, pconf)
File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 774, in parse_email
report_struc['attachment'] = traverse_multipart(msg, 0, include_attachment_data)
File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 196, in traverse_multipart
attachments.update(traverse_multipart(part, counter, include_attachment_data)) # type: ignore
File "/usr/local/lib/python3.5/dist-packages/eml_parser/eml_parser.py", line 234, in traverse_multipart
attachments[file_id]['mime_type'] = magic_none.buffer(data)
File "/usr/local/lib/python3.5/dist-packages/magic.py", line 152, in buffer
return str(r, 'utf-8')
TypeError: coercing to str: need a bytes-like object, NoneType found
Do you have an idea without reviewing the EML file?
Thank you!
Hello. I've been trying to use this package but I can't get it working because of an error in dependancy. I'm working with Python 3.6.1 on Windows 7. Below is full traceback. Let me know if I can provide additional info.
Traceback (most recent call last):
File "C:\Program Files (x86)\Python\lib\site-packages\django\core\handlers\exception.py", line 35, in inner
response = get_response(request)
File "C:\Program Files (x86)\Python\lib\site-packages\django\core\handlers\base.py", line 128, in _get_response
response = self.process_exception_by_middleware(e, request)
File "C:\Program Files (x86)\Python\lib\site-packages\django\core\handlers\base.py", line 126, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "C:/Users/USER/Desktop/Antoine/GitHub/ChronoManager\Chronos\views\views.py", line 18, in index
import eml_parser
File "C:\Program Files (x86)\Python\lib\site-packages\eml_parser\__init__.py", line 8, in <module>
from . import eml_parser
File "C:\Program Files (x86)\Python\lib\site-packages\eml_parser\eml_parser.py", line 63, in <module>
import magic
File "C:\Program Files (x86)\Python\lib\site-packages\magic.py", line 23, in <module>
_libraries['magic'] = _init()
File "C:\Program Files (x86)\Python\lib\site-packages\magic.py", line 20, in _init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\Program Files (x86)\Python\lib\ctypes\__init__.py", line 426, in LoadLibrary
return self._dlltype(name)
File "C:\Program Files (x86)\Python\lib\ctypes\__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
TypeError: bad argument type for built-in operation
$ PYTHONPATH=.. python3 ../examples/recursively_extract_attachments.py
You are using python-magic, though this module requires file-magic. Disabling magic usage due to incompatibilities.
Parsing: sample_body_data.eml
Traceback (most recent call last):
File "../examples/recursively_extract_attachments.py", line 25, in <module>
for a_id, a in m['attachments'].items():
KeyError: 'attachments'
import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self.system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser_init.py", line 8, in
from . import eml_parser
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser\eml_parser.py", line 63, in
import magic
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 23, in
_libraries['magic'] = _init()
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 20, in init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 426, in LoadLibrary
return self.dlltype(name)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None
Environment -:
Python 2.7.10
Virtualenviroment
Installing eml-parser via pip
Steps to reproduce:
$ pip install eml-parser
Collecting eml-parser
Using cached eml_parser-1.7-py2.py3-none-any.whl
Requirement already satisfied: cchardet in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: python-dateutil in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: file-magic in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: typing in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from eml-parser)
Requirement already satisfied: six>=1.5 in /x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages (from python-dateutil->eml-parser)
Installing collected packages: eml-parser
Successfully installed eml-parser-1.7
(connect_to_cloud) my_pc:ccccc xxxxx$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/init.py", line 8, in
from . import eml_parser
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/eml_parser.py", line 82
def get_raw_body_text(msg: email.message.Message) -> typing.List[typing.Tuple[typing.Any, typing.Any, typing.Any]]:
^
SyntaxError: invalid syntax
import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/init.py", line 8, in
from . import eml_parser
File "/x/y/virtuals/connect_to_cloud/lib/python2.7/site-packages/eml_parser/eml_parser.py", line 82
def get_raw_body_text(msg: email.message.Message) -> typing.List[typing.Tuple[typing.Any, typing.Any, typing.Any]]:
^
SyntaxError: invalid syntax
Hello,
I'm getting this error on some EML files with the newest version of the library.
With the old version 1.11.7, I don't have this problem.
python3
Python 3.8.7 (default, Dec 30 2020, 10:13:09)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import eml_parser
>>> parsed_eml = eml_parser.eml_parser.decode_email("1.eml",
include_raw_body=True, include_attachment_data=False, parse_attachments=True)
>>> parsed_eml
{'body': [{'uri': ['....
python3
Python 3.8.7 (default, Dec 30 2020, 10:13:09)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import eml_parser
>>> parser = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=False, parse_attachments=True)
>>> parsed_eml = parser.decode_email("1.eml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 979, in decode_email
return decode_email_b(eml_file=raw_email,
File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 1040, in decode_email_b
return ep.decode_email_bytes(eml_file)
File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 192, in decode_email_bytes
return self.parse_email()
File "/usr/local/lib/python3.8/site-packages/eml_parser/eml_parser.py", line 315, in parse_email
parsed_routing = eml_parser.routing.parserouting(received_line_flat)
File "/usr/local/lib/python3.8/site-packages/eml_parser/routing.py", line 164, in parserouting
out[item.strip()] = cleanline(reparseg.group(item.strip())) # type: ignore
AttributeError: 'NoneType' object has no attribute 'group'
I sent an EML with this problem to George's email address.
Regards
Hi,
I'm trying to parse eml files with text attachments, example :
--===============0219148833355454106==
Content-Type: text/plain; Name="text.txt"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="text.txt"
TG9yZW0gaXBzdW0gZG9sb3Igc2l0IGFtZXQsIGNvbnNlY3RldHVyIGFkaXBpc2NpbmcgZWxpdCwg
c2VkIGRvIGVpdXNtb2QgdGVtcG9yIGluY2lkaWR1bnQgdXQgbGFib3JlIGV0IGRvbG9yZSBtYWdu
YSBhbGlxdWEuIFV0IGVuaW0gYWQgbWluaW0gdmVuaWFtLCBxdWlzIG5vc3RydWQgZXhlcmNpdGF0
aW9uIHVsbGFtY28gbGFib3JpcyBuaXNpIHV0IGFsaXF1aXAgZXggZWEgY29tbW9kbyBjb25zZXF1
YXQuIER1aXMgYXV0ZSBpcnVyZSBkb2xvciBpbiByZXByZWhlbmRlcml0IGluIHZvbHVwdGF0ZSB2
ZWxpdCBlc3NlIGNpbGx1bSBkb2xvcmUgZXUgZnVnaWF0IG51bGxhIHBhcmlhdHVyLiBFeGNlcHRl
dXIgc2ludCBvY2NhZWNhdCBjdXBpZGF0YXQgbm9uIHByb2lkZW50LCBzdW50IGluIGN1bHBhIHF1
aSBvZmZpY2lhIGRlc2VydW50IG1vbGxpdCBhbmltIGlkIGVzdCBsYWJvcnVtLg==
the problem seems to occur when parsing the message, at line 850 (eml_parser.py) :
if ('content-disposition' in lower_keys and msg.get_content_disposition() != 'inline') \ or msg.get_content_maintype() != 'text':
always enters in the second condition ( except when the attachment is text )
Thanks.
Hello,
First of all I would like to say that I have been using eml_parser to extract data from emails for a while now and so far I am very happy with it!
One thing that I am struggling with is to distinguish the body of attached emails from a certain email. I have looked through the json multiple times but cannot figure out a good way.
To parse attachments and the full body I am using:
ep = eml_parser.EmlParser(include_raw_body=True, include_attachment_data=True, parse_attachments=True) parsed_eml = ep.decode_email_bytes(raw_email)
This works great for emails that do not have attachments, or emails that have attachments but no attachments that are eml files themselves. If an email is attached to an email, I can correctly read that it is there as it is listed in
parsed_eml["attachment"]
But the body of this attached email is appended to the body of the email. So
parsed_eml["body"][0]["content"]
would give me the body of the main email in text format. If the main email also has html, I can retrieve the content using
parsed_eml["body"][1]["content"]
And the attached email's body in text format can then be retrieved using
parsed_eml["body"][2]["content"]
And if the attached email also has HTML then I can retrieve it with
parsed_eml["body"][3]["content"]
...etc
This is fine, but problems arise when the main email is only in text format. Because then the first body (body[0]) is the main email, and body[1] would already give me the attached email's body instead of the html of the main email. I currently do not see how I can distinguish the bodies from each other to determine to which email they belong.
I hope I was able to describe the problem in a clear way.
Thank you for your time
Patrick
This issue refers to #4
I used your example code and one eml file. But for some reason, every time I run, I have a different conclusion. What could be the reason for this?
Hi. Thanks for this module!
Btw, can you show any simple example of the workflow?
Personally, I need to save all attachments from bunch of .eml-s
thank you in advance!
When running the example usage, Python returned such error message:
AttributeError: module 'eml_parser' has no attribute 'EmlParser'
import eml_parser
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self.system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser_init.py", line 8, in
from . import eml_parser
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\eml_parser\eml_parser.py", line 63, in
import magic
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 20, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 23, in
_libraries['magic'] = _init()
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\site-packages\magic.py", line 20, in init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 426, in LoadLibrary
return self.dlltype(name)
File "C:\ProgramData\Anaconda3\envs\pycharm_testing\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None
Trying to installl on OSX mojave gives me this error:
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated] ld: library not found for -lstdc++ clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command 'g++' failed with exit status 1
I'm using eml_parser with TheHive project and all analysis fails with
Unexpected Error: list index out of range
I'm not sure where else to find any other logging or info to troubleshoot this.
Hello,
I don't think this is an issue on your side, but you might be able to advise/make the code more generic on different platforms.
When trying to execute one simple example:
import eml_parser
def json_serial(obj):
if isinstance(obj, datetime.datetime):
serial = obj.isoformat()
return serial
with open('sample-message.eml', 'rb') as fhdl:
raw_email = fhdl.read()
parsed_eml = eml_parser.eml_parser.decode_email_b(raw_email)
print(json.dumps(parsed_eml, default=json_serial))
it fails saying
AttributeError: dlsym(RTLD_DEFAULT, magic_open): symbol not found
I am using OSX, python3.6
Could you please help on this?
Thanks in advance.
Hi,
My context requires me to use python-magic instead of file-magic.
When using eml_parser, I got the AttributeError: module 'magic' has no attribute 'open'
.
I noticed that in eml_parser.py
, instead of having:
try:
import magic
except ImportError:
magic = None
magic_mime = None
magic_none = None
else:
# MAGIC_MIME_TYPE gives the real mime-type
magic_mime = magic.open(magic.MAGIC_MIME_TYPE)
magic_mime.load()
# MAGIC_NONE gives the meta-information on the analysed file
magic_none = magic.open(magic.MAGIC_NONE)
magic_none.load()
If I put
import magic
magic = None
magic_mime = None
magic_none = None
The parsing works but I don't get the mime
info, which is fine in my use-case.
Do you know a way to handle the wrong magic module (python-magic), in other words parsing the eml even with python-magic and not file-magic ?
Thanks in advance.
There is a reason why for example "from" or "message-id" are lists instead of just strings?
A simple import eml_parser
fails with version 1.6:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-dbfbe77cbceb> in <module>()
----> 1 import eml_parser
~/.virtualenvs/XXXXX/lib/python3.5/site-packages/eml_parser/__init__.py in <module>()
7
8 from . import eml_parser
----> 9 from . import version
10
11 __version__ = '1.6'
ImportError: cannot import name 'version'
v1.5 works (it does not contain the problematic import statement).
Probably a missing file. Since the code for v1.6 is not on github, I can't provide a fix.
Hi, the parser always shows me, that the body of every mail is emtpy. Everything else, like parsing headers or attachments work.
edit: email body appears to be listed as two attachments (one simple text file and one html text file)
Currently the code does not handle the bug mentioned here https://bugs.python.org/issue30681
Following part of the code breaks since we are not catching TypeError
and ValueError
.
753: try:
754: raw_body.append((encoding, raw_body_str, msg.items()))
755: except AttributeError:
https://github.com/GOVCERT-LU/eml_parser/blob/master/eml_parser/eml_parser.py#L753-L755
Sample eml that can raise this bug
From: <[email protected]>
Orig-Date: Wed Jul 2020 23:11:43 +0100
I get empty To, From, Subject and Body of an email when parsed through eml_parser, but the eml has all these when opened in a mail client.
8E662D2D05B0.zip
Hello,
This is another question about the parser.
I am trying to parse the famous Enron dataset EML files:
https://archive.org/download/edrm.enron.email.data.set.v2.xml
(full folder: https://archive.org/download/edrm.enron.email.data.set.v2.xml/edrm-enron-v2_harris-s_xml.zip
)
Unfortunately, there seems to be messages "not fully/correctly parsed".
For instance the attached file contains all the needed data (Subject, From address, etc) but these do not appear in the parsing results.
3.287079.LTUWB1UEUURY0AMLCGNSEUNK52PXR2CPB.eml.zip
Again, this is just a question to check if this is due to a wrong format of the EML file, or to something I am using incorrectly in the parser.
In advance, thank you for your answer.
Best regards.
Hello,
I'm using IDLE with python 3.6.1 on Windows 10
I installed the eml_parser library, then I tried to execute the example code from your page.
I get the following error:
Traceback (most recent call last):
File "C:\Users\980\Desktop\temp\Thunderbird export\x-spam data extractor.py", line 17, in
ep = eml_parser.EmlParser()
AttributeError: module 'eml_parser' has no attribute 'EmlParser'
Can you please help me?
Thanks
Bruno
I've tried to parse an email with the following particular part-message:
--_----------=_1235550737204165
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="utf-8"
# HTML CODE HERE
This part is recognized as an attachment, but it isn't one, because it's the actual text of the email.
I found a solution by editing the line (116-117) in eml_parser.py
as follows:
Original
if ('content-disposition' not in msg and msg.get_content_maintype() == 'text') or (
filename.endswith('.html') or filename.endswith('.htm')):
Modified
if (msg.is_attachment() == False and msg.get_content_maintype() == 'text') or (
filename.endswith('.html') or filename.end
Could this solution work for you? Or does this break something I've not thought about?
Hello there,
Using the latest version of the package, I get an error on a particular EML with the following subject:
Subject: [MANAGER COMME UN COACH 1 - Presentiel] Une évaluation à
=?ISO-8859-1?Q ?compl=E9ter=20?=sur votre plateforme de formation
Here is the full stack trace:
<ipython-input-28-eb0e5e7f17ee> in extract_email_to_txt_file(eml_file, destination)
29
30 # convert eml raw file to an iterable object
---> 31 parsed_eml = ep.decode_email(eml_file)
32
33 # write subject and content of the email to a txt file
/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/eml_parser/eml_parser.py in decode_email(self, eml_file, ignore_bad_start)
151 raw_email = fp.read()
152
--> 153 return self.decode_email_bytes(raw_email, ignore_bad_start=ignore_bad_start)
154
155 def decode_email_bytes(self, eml_file: bytes, ignore_bad_start: bool = False) -> dict:
/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/eml_parser/eml_parser.py in decode_email_bytes(self, eml_file, ignore_bad_start)
190 self.msg = email.message_from_bytes(_eml_file, policy=self.policy)
191
--> 192 return self.parse_email()
193
194 def parse_email(self) -> dict:
/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/eml_parser/eml_parser.py in parse_email(self)
223
224 # parse and decode subject
--> 225 subject = self.msg.get('subject', '')
226 headers_struc['subject'] = eml_parser.decode.decode_field(subject)
227
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/message.py in get(self, name, failobj)
469 for k, v in self._headers:
470 if k.lower() == name:
--> 471 return self.policy.header_fetch_parse(k, v)
472 return failobj
473
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/policy.py in header_fetch_parse(self, name, value)
161 # We can't use splitlines here because it splits on more than \r and \n.
162 value = ''.join(linesep_splitter.split(value))
--> 163 return self.header_factory(name, value)
164
165 def fold(self, name, value):
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/headerregistry.py in __call__(self, name, value)
587
588 """
--> 589 return self[name](name, value)
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/headerregistry.py in __new__(cls, name, value)
195 def __new__(cls, name, value):
196 kwds = {'defects': []}
--> 197 cls.parse(value, kwds)
198 if utils._has_surrogates(kwds['decoded']):
199 kwds['decoded'] = utils._sanitize(kwds['decoded'])
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/headerregistry.py in parse(cls, value, kwds)
270 @classmethod
271 def parse(cls, value, kwds):
--> 272 kwds['parse_tree'] = cls.value_parser(value)
273 kwds['decoded'] = str(kwds['parse_tree'])
274
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/_header_value_parser.py in get_unstructured(value)
1100 if value.startswith('=?'):
1101 try:
-> 1102 token, value = get_encoded_word(value)
1103 except errors.HeaderParseError:
1104 # XXX: Need to figure out how to register defects when
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/_header_value_parser.py in get_encoded_word(value)
1046 value = ''.join(remainder)
1047 try:
-> 1048 text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
1049 except ValueError:
1050 raise errors.HeaderParseError(
/usr/local/Caskroom/miniconda/base/lib/python3.7/email/_encoded_words.py in decode(ew)
176 # Recover the original bytes and do CTE decoding.
177 bstring = cte_string.encode('ascii', 'surrogateescape')
--> 178 bstring, defects = _cte_decoders[cte](bstring)
179 # Turn the CTE decoded bytes into unicode.
180 try:
KeyError: 'q\t'
I think the email regex should be tuned ...
[
'6f@k', '6f@k', '%@3', '%@3', 'i@0', '%@3', 'i@0', 'i@0', '/8r@gbj', 'i@0', '/8r@gbj', '/8r@gbj', '8@c', '8@c', '8@c', 'n@jg', 'xf9}8q@f', 'n@jg', 'xf9}8q@f', 'bt@b2x', 'n@jg', 'xf9}8q@f', 'bt@b2x', 'n@jg', 'xf9}8q@f', 'bt@b2x', 'bt@b2x', '5pa@ao', '5pa@ao', 'b2h$xa@vljqq', '8@d', 'b2h$xa@vljqq', '8@d', 'b2h$xa@vljqq', '8@d', '/8@d', '/8@d', 'm8y@v', '/8@d', 'm8y@v', '/8@d', 'm8y@v', 'm8y@v', 'r@gez', 'r@gez', 'zx%f@p', 'zx%j@s', 'zx%f@p', 'zx%j@s', 'kc@pl', 'zx%f@p', 'zx%j@s', 'kc@pl', 'zx%f@p', 'zx%j@s', 'kc@pl', 'zx%f@p', 'zx%j@s', 'kc@pl', 'kc@pl', 'ke@n', 'ke@n', 'ke@n', 'w@o', 'w@o', 'w@o', 'w@o', 'w@o', 'h.@oa', 'h.@oa', '5.@q', '8@d', '5.@q', '8@d', '5.@q', '8@d', '5.@q', '8@d', '^/#@tc', '^/#@tc', '/65@kt', '/65@kt', '5@kt', '9@o', 'g@x', 'g@x', '0#@p1', 'p@x', 'xt@5w', '0#@p1', 'p@x', 'xt@5w', 'xt@5w', 'i@sr', 'i@sr', '6z@n', '6z@n', '.@-ks', '.@-ks', "$'@cjd", 'ng@mxjj', "$'@cjd", 'ng@mxjj', "$'@cjd", 'ng@mxjj', '01@djll', '60@a7cf', '60@a7cf', '60@a7cf', '60@a7cf', '60@a7cf', '}@9j', '}@9j', '}@9j', '}@9j', '}@9j', 'l@el', 'l@el', 'l@el', 'l@el', 'p@y', 'p@y', 'p@y', 'p@y', 'ou@m0', 'y@64', '}@3', 'y@64', '}@3', 'y@64', '}@3', '}@3', 'nhyfhu@b', '&@he', 'nhyfhu@b', '&@he', 'nhyfhu@b', '&@he', '&@he', '&@he', '.t*@oh', '.t*@oh', '!}qv@k', '!}qv@k', 'ljd@z', 'ljd@z', 'ljd@z', 'ljd@z', 'c=@i', 'c=@i', 'm@qcoyub', 'c=@i', 'm@qcoyub', 'm@qcoyub', 'm@qcoyub', '&@rm', '&@rm', 'o$@t', 'o$@t', 'o$@t', 'o$@t', 'o$@t', '=e%@gyw', 'cq@u', 'cq@u', '*_ioj5@2', '*_ioj5@2', '*_ioj5@2', '#tid*{4|@8', '#tid*{4|@8', 'ny@f', 'ny@f', 'ny@f', 'fn8b@i', '0}.`@ne.9u', '0}.`@ne.9u', 'qq}@i', ',@u6xr', 'qq}@i', ',@u6xr', 'qq}@i', ',@u6xr', 'a@zf', 'a@zf', '3v@e', '}s_@a', 'a@zf', '3v@e', '}s_@a', '3v@e', '}s_@a', '3v@e', '}s_@a', 'gf@b', 'gf@b', 'ud@qt', 'gf@b', 'ud@qt', 'gf@b', 'ud@qt', 'gf@b', 'ud@qt', 'nis@eond', '-f@ynw', 'nis@eond', '-f@ynw', 'nis@eond', '-f@ynw',](url)
The include_attachment_data
flag should be implemented like include_raw_body
, but the only reference to the flag in eml_parser.eml_parser.parse_email
is in line 875:
report_struc['attachment'] = traverse_multipart(msg, 0, include_attachment_data)
This doesn't stop the processing of the attachment when include_attachment_data=False
, which affects processing time (when you don't want the attachment) and can throw a lot of binascii errors.
I suggest an if include_attachment_data:
around the block from line 874-890 in eml_parser.eml_parser.
Thanks!
Hi,
I am using python3, but i am unable to import eml_parser,
error
Traceback (most recent call last):
File "", line 1, in
File "/Users/rameshchowdeshetty/venv/lib/python3.4/site-packages/eml_parser/init.py", line 8, in
from . import eml_parser
File "/Users/rameshchowdeshetty/venv/lib/python3.4/site-packages/eml_parser/eml_parser.py", line 59, in
import magic
File "/Users/rameshchowdeshetty/venv/lib/python3.4/site-packages/magic.py", line 61, in
_open = _libraries['magic'].magic_open
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/ctypes/init.py", line 364, in getattr
func = self.getitem(name)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/ctypes/init.py", line 369, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: dlsym(RTLD_DEFAULT, magic_open): symbol not found
Hi @sim0nx,
First of all thanks for the module, it helped a lot easing procedures. I have been testing it and I saw that the module sometimes is not working as expected. I am using the following:
m = eml_parser.decode_email(k, include_raw_body=True)
to extract the data from the eml and getting the urls, but the same code with the same eml being processed, half of the times gets the text and process correctly all the data and the other half doesn't get it. The header gets processed always though.
As no error is being prompted, I have no idea what is the source of the problem. Have you seen this error before or do you have any idea of what the problem could be?
Thanks in advance :)
In test_headeremail2list_2
, it mentions Python bug 27257. However, Bug 27257 appears to be related to empty groups in the header, not issues with obsolete period. With Python 3.7, I do not have any issues with the decoded value, unless the eml_parser should include address groups.
eml_parser/tests/test_emlparser.py
Line 131 in f98980a
From the bug:
To: unlisted-recipients: ;,
""@pop.kundenserver.de (no To-header on input)
The current output below appears to be the expected output.
'to': ['@pop.kundenserver.de']
From the RFC:
To: A Group:Ed Jones [email protected],[email protected],John [email protected];
Again, the current output below appears to be the expected output.
'to': ['[email protected]', '[email protected]', '[email protected]']
I have not found a related issue in the Python bug tracker, but perhaps something like the following in _header_value_parser.py
would be appropriate to prevent the exception:
Have you seen this error before?
$ python eml_parser.py
Traceback (most recent call last):
File "eml_parser.py", line 515, in <module>
main()
File "eml_parser.py", line 507, in main
m = decode_email(msgfile)
File "eml_parser.py", line 318, in decode_email
fp = open(eml_file)
TypeError: coercing to Unicode: need string or buffer, NoneType found
In the simple URL regex, many URLs that don't include the scheme in the href or src are skipped.
<a target="_blank" href="www.wikipedia.org">
Wikipedia (opens in new tab)
</a>
Should URLs like this be extracted?
I received a mail where the header looked like below:
...
Date: Fri, 19 Feb 2021 19:36:50 +0000
Message-ID:
<[email protected]>MIME-Version: 1.0
Accept-Language: en-US, en-GB
...
After parsing it:
ep = eml_parser.EmlParser(include_raw_body=True)
parsed_eml = ep.decode_email_bytes(raw_email)
The value of parsed_eml["header"]["header"]["message-id"][0]
looked like this:
<[email protected]>MIME-Version: 1.0
Hello.
Thank you for library.
A problem occurred during use.
ep = eml_parser.EmlParser(parse_attachments=False)
parsed_eml = ep.decode_email_bytes(raw_email)
I am trying to parse the eml file like this. but specific eml file is infinite wait.
If it is forcibly stopped, the following error occurs.
File "/usr/local/lib/python3.7/site-packages/eml_parser/eml_parser.py", line 192, in decode_email_bytes
return self.parse_email()
File "/usr/local/lib/python3.7/site-packages/eml_parser/eml_parser.py", line 431, in parse_email
list_observed_urls = self.get_uri_ondata(body_slice)
File "/usr/local/lib/python3.7/site-packages/eml_parser/eml_parser.py", line 641, in get_uri_ondata
for match in eml_parser.regex.url_regex_simple.findall(body):
I'm not sure why this is happening.
What is the workaround?
Attach the sample in txt format.
Hello. This is my first time making an issue, so be easy on me. I have tried importing eml_parser on 3.6.1, 3.6.2, and 3.6.3 and when I import eml_parser I get an error. I posted the error below. Any help would be greatly appreciated.
Traceback (most recent call last):
File "getError.py", line 5, in
import eml_parser
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\eml_parser_init_.py", line 8, in
from . import eml_parser
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\eml_parser\eml_parser.py", line 63, in
import magic
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\magic.py", line 23, in
_libraries['magic'] = _init()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\magic.py", line 20, in init
return ctypes.cdll.LoadLibrary(find_library('magic'))
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\ctypes_init.py", line 426, in LoadLibrary
return self.dlltype(name)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\ctypes_init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None
Hello,
Did you forget to extend this list with the previous iteration result ?
eml_parser/eml_parser/eml_parser.py
Line 431 in a72d5c2
Regards,
Problem:
file-magic
from requirements.txt results in AttributeError: module 'magic' has no attribute 'magic_open'.
Solution:
Use python-magic
instead of file-magic
`
I am hitting what appears to be an infinite loop when trying to parse the following eml file:
eml_sample.txt
import eml_parser
f = open("eml_sample.txt", "rb")
data = f.read()
f.close()
result = eml_parser.eml_parser.decode_email_b(data, include_raw_body=True)
The problem is with the following line. The email body contains a URL that is immediately followed by multiple occurrences or \xef\xbf\xbd
.
xcxcxcxcxcxcxcxcxcxc
http://xxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxx.com������������������������������������������������
If I remove those, the parsing is just fine. It is also fine if I remove only some of them, but the parsing takes longer. So I assume that it is not an infinite loop, but it rather throws of the regex and creates some kind of cycle...
Is it a problem with the regex, or the way I am reading that file (not decoding Unicode characters)?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.