Skip to content

email.message_from_bytes heavy memory use #115512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cnicodeme opened this issue Feb 15, 2024 · 4 comments
Open

email.message_from_bytes heavy memory use #115512

cnicodeme opened this issue Feb 15, 2024 · 4 comments
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@cnicodeme
Copy link

cnicodeme commented Feb 15, 2024

Bug report

Bug description:

Hi!

Investigating some memory issues on my lamdba, I discovered an odd usage coming from email.message_from_bytes

When opening an .eml that contains close to no text but a 30Mb attachment, the memory usage jumps to +238Mb !
9 times the size of the file!!

Here's what was my tests:

from email import message_from_bytes
import resource

print('Init ram: {}kb'.format(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss))

data = None
with open('file.eml', 'rb') as f:
    data = f.read()

print('File loaded: {}kb'.format(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss))
print('    (file size: {}kb)'.format(len(data) / 1024))

mail = message_from_bytes(data)

print('After message_from_bytes: {}kb'.format(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss))

And the output:

Init ram: 7168kb
File loaded: 37120kb
    (file size: 29900kb)
After message_from_bytes: 279296kb

The EML in question contains an attachment (a CSV file) encoded in Base64. I suspect that BytesParser is converting that content to binary data, but I find it surprising that doing this takes 9 times the filesize.
Wouldn't it be faster and more efficient to convert that only when accessing, and having a way to not convert it at all (getting it raw, in base64) ?

(Maybe there is already and I missed it?)

I tested this in:

  • Python 3.10.13
  • Python 3.12.1

And got the same results.

CPython versions tested on:

3.10

Operating systems tested on:

Linux

Linked PRs

@cnicodeme cnicodeme added the type-bug An unexpected behavior, bug, or error label Feb 15, 2024
@gaogaotiantian
Copy link
Member

I was wondering how much of those memory usage is actually used to keep the data. I won't be able to tell because I don't have the eml file, but you could try a memory profiler, there might be significant overheads for "converting" the data (not storing). It would also be interested to see if gc.collect() helps in this case and if a second parsing of the same file increases the memory usage the same amount.

@cnicodeme
Copy link
Author

Sorry for my late reply here.

I'm not at ease on using a memory profiler and looking at gc.collect() but I can say that a basic email that contains a PDF file will do the job.
Ideally, a quite heavy PDF file, around 20Mb, will clearly show the memory jump in term of usage.

I hope this helps :)

@JAJames
Copy link

JAJames commented Apr 17, 2025

I was wondering how much of those memory usage is actually used to keep the data. I won't be able to tell because I don't have the eml file, but you could try a memory profiler, there might be significant overheads for "converting" the data (not storing). It would also be interested to see if gc.collect() helps in this case and if a second parsing of the same file increases the memory usage the same amount.

@gaogaotiantian I've been doing some analysis and fixes for this and a vaguely similar memory usage issue in _header_value_parser. Analysis has been largely driven by using a combination of memray and tracemalloc. In the rest of this post, overhead ratio refers to "the peak byte size of allocated memory (i.e: tracemalloc.get_traced_memory()[1]) during the function call, divided by the byte size of the buffer passed to the function (i.e: the bytes passed to message_from_bytes or get_address_list)"

memory_from_bytes

The main memory usage issue with memory_from_bytes comes from an excessive number of string copies going on, which is especially noticeable with long lines of text, because of how feedparser currently constructs one line at a time and passes it to the rest of the feedparser. This seems especially rough with the StringIO usage, since it has to allocate 4 times the size of the actual underlying byte string, since StringIO is apparently backed by a UTF-32 buffer. As the email buffer get larger, say 10MiB, the overhead ratio approaches a little over 9.0. This means if you have an email in memory with a 1 GiB attachment for example, you can expect an additional 9 GiB to be allocated when calling message_from_bytes. I have more exact stats on my work laptop.

I have some changes around this that basically amount to removing the StringIO usage, as well as enabling the parser to avoid line-by-line parsing when no boundaries are present or possibly present. This is extremely helpful for emails with large attachments or large body parts. For a 10MiB email, this was measured to reduce memory overheard down to about 1.01, or a roughly 80 MiB reduction, if memory (ha) serves correctly. I think this is about as good as can be reasonably achieved without rewriting the feedparser entirely, or moving some of the code into C. It's a slightly large change, but nothing crazy.

I hope to have a PR open for those changes sometime in the next few days or week, since I already got the approval from work to PR those to CPython. I've tested the changes against about a million emails, and observed no differences in behavior, other than the memory usage and runtime performance both improving. I should have more exact stats when I open the PR.

_header_value_parser

get_address_list has a dramatically larger overhead ratio issue, which comes from an extremely large number of string copies. It seems the original author of _header_value_parser believed that string slicing would not result in copies (maybe it didn't at the time?), or that the buffer being copied would be very small. In practice however, this is resulting in a very large number of copies of the underlying buffer being stored in memory to perform the parse. For example, an email with 6000 randomly generated addresses in an address header will result in an overhead ratio around 3155x.

I currently have some changes that brings this down to the 30s, but hope to have some further reductions before opening a pull request. Unfortunately, this has been a substantially more involved refactoring effort to bring memory usage down, but I'm happy to say the changes I have so far require no test changes and are so far localized to just _header_value_parser.py. The main changes involve avoiding slicing the value in get_address_list, and all of the other methods in the call tree. Instead, I'm just passing around the entire value unmodified, but adding start and end optional parameters to each method, similar to other parsing-oriented functions in Python. Since these methods all return a slice as a part of the return tuple, I've had to add _get_<whatever>_spans() variants of each of those methods, to avoid that slice. The existing methods now just call directly into those methods, and are in turn implemented in terms of those methods. This has greatly reduced the number of string copies going on and being held onto in memory, and is responsible for the bulk of the memory usage reduction -- I think this change alone reduced the overhead ratio to something like 150x. There's also other improvements, such as replacing duplicate copies of the same exact token with just references to the previous instance of that token, specific to parsing address lists. It'll also be going through some torture testing similar to what the memory_from_bytes changes went through, before I open a PR for it.

I had originally attempted to merely split email addresses using a regular expression and pass those to get_address and similar methods, in order to produce a more minimalist change. However after running into some issues involving some logic around semicolons immediately preceding a comma, I abandoned the approach. If a more minimalist change is required (maybe for say backporting), that might still be feasible, although I think just adding support for start and end parameters has so far proven an overall good change to have.

If I should open a new issue for the email._header_value_parser changes, please let me know. Otherwise, I might just link it to this issue, albeit still as a separate PR.

Thanks,
Jessica

@gaogaotiantian
Copy link
Member

Hi @JAJames , thanks for the detailed analysis. I'm not the expert in email lib, but I do know it does not have a dedicated owner, and it's kind of in maintenance mode. There are a lot of things in that library that we are not thrilled about, but also have concerns about refactoring it, as the code is quite complicated.

Here's my suggestion: do not try to get the best possible outcome. Make something that's easy to be proven equivalency. 10x sounds bad, but 1G attachment is just super rare. If you do need to deal with super large attachments, you should use a more decicated (and updated) library.

It's not easy to find core devs to review code changes in email library, it's going to be even harder, if it's a huge refactor. If there's a simple change that can bring 10x to 5x, that is much more likely to be merged, than a complicated and delicate change that bring 10x to 1.01x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants