Skip to content

Commit 8be8d51

Browse files
committed
TXT Document Support
Describe basic features of PyMuPDF support for plain text files.
1 parent bbe897e commit 8be8d51

File tree

6 files changed

+561
-6
lines changed

6 files changed

+561
-6
lines changed

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,22 @@ This repository contains demos and examples to help you create PDF, XPS, and eBo
55

66
Some examples were initially created in the early days of the package. API changes implemented over time may have caused discrepancies in the scripts. We may not update them every time an update is released, so there's no guarantee they all will work as originally expected. If you look at the scripts as what they are intended to be, examples, then they will give you a good start.
77

8-
Up to PyMupdf 1.18.x, methods and attributes have been renamed according to the snake-case standard. For the time being including versions 1.19.x, old and new names will coexist. For example, `doc.newPage()` can be used as well as `doc.new_page()` to create a new page.
8+
## "TXT" Documents
9+
PyMuPDF now (v1.23.x) also supports **plain text files** as a `Document`, like PDF, XPS, EPUB etc. They will behave just like any other document: you can search and extract text, render pages as Pixmaps etc.
910

10-
> In versions 1.19.x, a deprecation warning is issued if camel-case names are used. In newest versions only snake-case names are valid. You may want to use the `alias-changer.py` script in this folder to keep your code up-to-date. Alternatively, use `fitz.restore_aliases()`.
11+
This offers ways to access program sources, markdown documents and basically any file, as long as it is encoded in ASCII, UTF-8 or UTF-16.
1112

12-
## OCR Support
13-
Starting with version 1.19.0, PyMuPDF supports MuPDF's integrated Tesseract OCR features. Over time, we will add examples for using this.
13+
Please navigate to folder [text-documents](https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/text-documents) for details.
1414

15-
There are nonetheless also other ways to use OCR tools in PyMuPDF scripts.
1615

16+
## OCR Support
1717
There are now two demo examples in the new folder [OCR](https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/OCR) which use MuPDF OCR, Tesseract OCR and `easyocr` respectively.
1818

1919
To see more "interactive" demos of the new OCR features, please also have a look at the notebook collection in the [jupyter-notebooks](https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/jupyter-notebooks) folder.
2020

2121
## Advanced TOC Handling
2222
Handling of table of contents (TOC) has been significantly improved in v1.18.6. I have therefore created another new [folder](https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/advanced-toc) dealing specifically with this subject.
2323

24-
2524
## Font Replacement
2625
New for PyMuPDF v1.17.6 is the ability to replace selected fonts in existing PDFs. This is a set of two scripts and their documentation in [this](https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/font-replacement) folder.
2726

text-documents/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# "TXT" Document Support
2+
3+
PyMuPDF supports files in plain text formats as a `Document`.
4+
5+
They can be opened as usual, text can be etracted and searched for, and pages can be rendered as Pixmaps and more.
6+
7+
You can use it for general text files, program sources, markdown documents and many more: ASCII, UTF-8 and UTF-16 are supported.
8+
9+
File extensions ".txt" and ".text" are natively recognized. Files with other extensions can be opened via the `filetype` argument: `doc = fitz.open("myscript.py", filetype="txt")`.
10+
11+
TXT documents are "reflowable" so [doc.is_reflowable](https://pymupdf.readthedocs.io/en/latest/document.html#Document.is_reflowable) will return `True`, you can re-layout them via [doc.layout()](https://pymupdf.readthedocs.io/en/latest/document.html#Document.layout), or open them with the additional, layout-oriented options, namely a rectangle, width, height and font size.
12+
13+
Like with e-books (EPUB, MOBI, etc.), there is **_no fixed page size_** (dimension - width / height). At open time, the defaults `width=400`, `height=600` and `fontsize=11` will be used.
14+
15+
When extracting text details, a Courier-equivalent monospace font will be reported for unicodes from the extended Latin range. But consistently with the occurring characters, any language-compliant font names will also be shown for CJK, Hindi, Tamil, Thai etc.
16+
17+
> At least for Latin text. If however the text contains characters from the CJK unicode range, other fonts will autonamtically be considered, making things more complicated.
18+
19+
If only (extended) Latin characters occur (usually the case in program text), it is easy to predict the number of characters per line:
20+
21+
* Each character of the Courier font has a width of `0.6 * fontsize`.
22+
23+
* There exist left and right margins of `2 * fontsize` each. A character written in the left-most position will have a bbox where `x0 = 2 * fontsize`. The last character's bbox will end before start of the right margin, `x1 <= page.rect.width - 2 * fontsize`.
24+
25+
* The default page (width of 400 points, font size 11) therefore can contain up to 53 characters per line: `int((400 - 4 * 11) / (0.6 * 11))`.
26+
27+
* If you want to conveniently layout your program source (e.g. 80 characters per line), you could do the following
28+
- Page `width = (4 + 80 * 0.6) * 11`, which is 572.
29+
- Using an "A4" rectangle with a width of 595 points can contain up to 83 characters per line. "Letter" rectangles even yield up to 86 characters per line.
30+
- `doc = fitz.open("myscript.py", filetype="txt", rect=fitz.paper_rect("letter"))` should normally give you a layout with only a few "unintended" line breaks.
31+
32+
Other, rarely used document methods may receive more attention with TXT documents: [Document.make_bookmark()](https://pymupdf.readthedocs.io/en/latest/document.html#Document.make_bookmark) and [Document.find_bookmark()](https://pymupdf.readthedocs.io/en/latest/document.html#Document.find_bookmark).
33+
34+
While accessing pages, you may decide to re-layout the document, but you want to quickly find the current location afterwards again:
35+
36+
```python
37+
bm = doc.make_bookmark() # compute current position in document
38+
doc.layout(width=612, height=792, fontsize=10) # layout document
39+
chapter, pno = doc.find_bookmark(bm) # retrieve new location
40+
```
41+
42+
Method `find_bookmark()` returns a location in `(chapter, pno)` style. TXT documents only have one chapter, so `chapter = 0`.
43+
44+
`bm` is an integer with a special internal structure - which must not be touched in any way.

text-documents/any-file.ipynb

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Open Arbitrary Files as TXT Document\n",
8+
"\n",
9+
"PyMuPDF can open plain text files as a `fitz.Document`. Because of its support for UTF-8 (and UTF-16) endcoding, a wide range of files can be opened as if they were text files.\n",
10+
"\n",
11+
"Here is a small example where we create a PDF and then open it again as a text file.\n",
12+
"\n",
13+
"> **NOTE:** Files containing a NULL byte will currently not be processed completely, as this is regarded as an EOF marker."
14+
]
15+
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": 24,
19+
"metadata": {},
20+
"outputs": [
21+
{
22+
"name": "stdout",
23+
"output_type": "stream",
24+
"text": [
25+
"%PDF-1.7\n",
26+
"%µ¶\n",
27+
"1 0 obj\n",
28+
"<</Type/Catalog/Pages 2 0 R>>\n",
29+
"endobj\n",
30+
"2 0 obj\n",
31+
"<</Type/Pages/Count 1/Kids[3 0 R]>>\n",
32+
"endobj\n",
33+
"3 0 obj\n",
34+
"<</Type/Page/MediaBox[0 0 595 842]/Rotate 0/Resources<</Font<</helv 4 0 R>>>>/Parent 2 0 R/Contents 5 0 R>>\n",
35+
"endobj\n",
36+
"4 0 obj\n",
37+
"<</Type/Font/Subtype/Type1/BaseFont/Helvetica/Encoding/WinAnsiEncoding>>\n",
38+
"endobj\n",
39+
"5 0 obj\n",
40+
"<</Length 81>>\n",
41+
"stream\n",
42+
"q\n",
43+
"BT\n",
44+
"/helv 11 Tf\n",
45+
"1 0 0 1 100 742 Tm\n",
46+
"[(Just some arbitrary content.)] TJ\n",
47+
"ET\n",
48+
"Q\n",
49+
"q\n",
50+
"Q\n",
51+
"endstream\n",
52+
"endobj\n",
53+
"xref\n",
54+
"0 6\n",
55+
"0000000000 65536 f \n",
56+
"0000000016 00000 n \n",
57+
"0000000062 00000 n \n",
58+
"0000000114 00000 n \n",
59+
"0000000238 00000 n \n",
60+
"0000000327 00000 n \n",
61+
"trailer\n",
62+
"<</Size 6/Root 1 0 R/ID[<CC2A084CB9885ADABC51CE10CFC725E8><8F2C15D6C784DF5A97728DB5403FCAB8>]>>\n",
63+
"startxref\n",
64+
"457\n",
65+
"%%EOF\n",
66+
"\n"
67+
]
68+
}
69+
],
70+
"source": [
71+
"import fitz\n",
72+
"\n",
73+
"doc = fitz.open() # make a new empty PDF\n",
74+
"page = doc.new_page() # give it a page ... with some content\n",
75+
"page.insert_text((100, 100), \"Just some arbitrary content.\")\n",
76+
"page.clean_contents()\n",
77+
"# save the PDF to memory\n",
78+
"doc.save(\"test.pdf\", garbage=4, expand=True) # prevent premature EOF (0x00)\n",
79+
"doc.close()\n",
80+
"\n",
81+
"# make a TXT Document from the PDF\n",
82+
"doc = fitz.open(\"test.pdf\", filetype=\"txt\", rect=fitz.paper_rect(\"letter\"), fontsize=8)\n",
83+
"page = doc[0]\n",
84+
"print(page.get_text())"
85+
]
86+
}
87+
],
88+
"metadata": {
89+
"kernelspec": {
90+
"display_name": "Python 3",
91+
"language": "python",
92+
"name": "python3"
93+
},
94+
"language_info": {
95+
"codemirror_mode": {
96+
"name": "ipython",
97+
"version": 3
98+
},
99+
"file_extension": ".py",
100+
"mimetype": "text/x-python",
101+
"name": "python",
102+
"nbconvert_exporter": "python",
103+
"pygments_lexer": "ipython3",
104+
"version": "3.12.0"
105+
}
106+
},
107+
"nbformat": 4,
108+
"nbformat_minor": 2
109+
}

0 commit comments

Comments
 (0)