Skip to content

A modular rewrite featuring numerous enhancements and bug fixes #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

9ao9ai9ar
Copy link

@9ao9ai9ar 9ao9ai9ar commented Aug 6, 2024

The only downsides of my rewrite that I can think of are the higher memory usage (by about 50%) and increased dependencies.

@mhdadk
Copy link
Owner

mhdadk commented Aug 22, 2024

Hey @9ao9ai9ar, thank you very much for this PR! It seems like you really cleaned up the code. Adding tests was an especially nice touch.

Before merging this PR, because it is quite large (49 files changed), I'd like to review it in more detail to make sure it is OK. Unfortunately, I'm not sure when I will have the time in the near future to do this. I'm going to keep this PR open until I do so.

Nevertheless, I wanted to write to you to give you some positive feedback on your work.

@9ao9ai9ar
Copy link
Author

Thanks for your words of encouragement. I will be pushing out minor commits in the meantime, like updating the README and upgrading dependencies (Pydantic v2.9.0 is just around the corner, promising to fix the high memory consumption).

I do have an idea to improve the output format, but I will withhold work on that until next year. The idea is to first move all non-body fields of the posts into the (non-standard but widely supported) YAML metadata section at the beginning of the Markdown files. Secondly, individual answers should probably be separated into individual files to keep them as close to the original as possible without modifying the Markdown content by inserting the section markers. Lastly, the licenses of the posts (currently there are 3 versions of CC in use) should be included and linked. I can open an issue if you're interested in further discussion.

@mhdadk
Copy link
Owner

mhdadk commented Aug 23, 2024

All these changes sounds good to me. I can have a look at them as well when you push them (and when I have some more time).

@9ao9ai9ar 9ao9ai9ar changed the title A complete rewrite, but keeping the format you've defined for now A modular rewrite bringing numerous enhancements and bug fixes Nov 8, 2024
@9ao9ai9ar 9ao9ai9ar changed the title A modular rewrite bringing numerous enhancements and bug fixes A modular rewrite featuring numerous enhancements and bug fixes Nov 8, 2024
@9ao9ai9ar
Copy link
Author

I have three news to share with you, one good, one bad, and one neutral:

  1. I've thought of another important feature to add to this project: include an option to download all images as files or base64 encoded data blobs. In the data dumps, only the image URLs are saved, so there is still a risk of URL rot and losing the pictures. I can probably start the work in the second half of the year.
  2. The "slight" increase in dependencies has negatively affected the ability to run the script on some platforms. For example, I wasn't able to get the project to build on NetBSD due to compile errors in PyO3. Pydantic 2.9 also didn't fix the high memory consumption; they are tackling the memory issues again in Pydantic 2.11, but if no progress were made either in this release, I might have to reconsider the choice of incorporating this library in the code, which would be a shame as I kind of like their API.
  3. Since this is more a rewrite than a refactor, it's less important to do piecemeal commits in my humble opinion, so as I intend to do a git rebase to squash the commits into one, I will mark this pull request as draft for the time being. In the future, I will open feature branches to work on and do the rebases there.

@9ao9ai9ar 9ao9ai9ar marked this pull request as draft February 23, 2025 09:57
@mhdadk
Copy link
Owner

mhdadk commented Mar 2, 2025

I have three news to share with you, one good, one bad, and one neutral:

1. I've thought of another important feature to add to this project: include an option to download all images as files or base64 encoded data blobs. In the data dumps, only the image URLs are saved, so there is still a risk of URL rot and losing the pictures. I can probably start the work in the second half of the year.

2. The "slight" increase in dependencies has negatively affected the ability to run the script on some platforms. For example, I wasn't able to get the project to build on NetBSD due to compile errors in PyO3. Pydantic 2.9 also didn't fix the high memory consumption; they are tackling the memory issues again in Pydantic 2.11, but if no progress were made either in this release, I might have to reconsider the choice of incorporating this library in the code, which would be a shame as I kind of like their API.

3. Since this is more a rewrite than a refactor, it's less important to do piecemeal commits in my humble opinion, so as I intend to do a git rebase to squash the commits into one, I will mark this pull request as draft for the time being. In the future, I will open feature branches to work on and do the rebases there.

Nice work!

Some common development practices are introduced to the project:
* data (de)serialization and validation, via Pydantic
* API client code generation, via OpenAPI
* continuous integration, via a suite of tests and utilities

Known bugs like posts on meta sites weren't being backed up and mhdadk#1 are
fixed. More consideration is given to rate limiting while driving down
quota usage by using static filters; a few CLI options are also added.
@9ao9ai9ar 9ao9ai9ar marked this pull request as ready for review April 14, 2025 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants