-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
email.message_from_bytes heavy memory use #115512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was wondering how much of those memory usage is actually used to keep the data. I won't be able to tell because I don't have the |
Sorry for my late reply here. I'm not at ease on using a memory profiler and looking at gc.collect() but I can say that a basic email that contains a PDF file will do the job. I hope this helps :) |
@gaogaotiantian I've been doing some analysis and fixes for this and a vaguely similar memory usage issue in memory_from_bytesThe main memory usage issue with I have some changes around this that basically amount to removing the StringIO usage, as well as enabling the parser to avoid line-by-line parsing when no boundaries are present or possibly present. This is extremely helpful for emails with large attachments or large body parts. For a 10MiB email, this was measured to reduce memory overheard down to about 1.01, or a roughly 80 MiB reduction, if memory (ha) serves correctly. I think this is about as good as can be reasonably achieved without rewriting the I hope to have a PR open for those changes sometime in the next few days or week, since I already got the approval from work to PR those to CPython. I've tested the changes against about a million emails, and observed no differences in behavior, other than the memory usage and runtime performance both improving. I should have more exact stats when I open the PR. _header_value_parser
I currently have some changes that brings this down to the 30s, but hope to have some further reductions before opening a pull request. Unfortunately, this has been a substantially more involved refactoring effort to bring memory usage down, but I'm happy to say the changes I have so far require no test changes and are so far localized to just I had originally attempted to merely split email addresses using a regular expression and pass those to If I should open a new issue for the Thanks, |
Hi @JAJames , thanks for the detailed analysis. I'm not the expert in email lib, but I do know it does not have a dedicated owner, and it's kind of in maintenance mode. There are a lot of things in that library that we are not thrilled about, but also have concerns about refactoring it, as the code is quite complicated. Here's my suggestion: do not try to get the best possible outcome. Make something that's easy to be proven equivalency. 10x sounds bad, but 1G attachment is just super rare. If you do need to deal with super large attachments, you should use a more decicated (and updated) library. It's not easy to find core devs to review code changes in email library, it's going to be even harder, if it's a huge refactor. If there's a simple change that can bring 10x to 5x, that is much more likely to be merged, than a complicated and delicate change that bring 10x to 1.01x. |
Bug report
Bug description:
Hi!
Investigating some memory issues on my lamdba, I discovered an odd usage coming from
email.message_from_bytes
When opening an .eml that contains close to no text but a 30Mb attachment, the memory usage jumps to +238Mb !
9 times the size of the file!!
Here's what was my tests:
And the output:
The EML in question contains an attachment (a CSV file) encoded in Base64. I suspect that
BytesParser
is converting that content to binary data, but I find it surprising that doing this takes 9 times the filesize.Wouldn't it be faster and more efficient to convert that only when accessing, and having a way to not convert it at all (getting it raw, in base64) ?
(Maybe there is already and I missed it?)
I tested this in:
And got the same results.
CPython versions tested on:
3.10
Operating systems tested on:
Linux
Linked PRs
The text was updated successfully, but these errors were encountered: