Skip to content

Commit 1b1405a

Browse files
Add files via upload
1 parent 4858b1e commit 1b1405a

File tree

1 file changed

+137
-0
lines changed

1 file changed

+137
-0
lines changed

pdfminer.ipynb

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# How to convert pdf files to text files? \n",
8+
"---\n",
9+
"There are different ways to convert pdf to text file, such as using python, some OCR software, or linux commands and etc. Here I will show you two methods to do the pdf converstions. \n",
10+
"\n",
11+
"- Use a Python package called pdfminer to convert a pdf file to text file \n",
12+
"- Use linux commands to perform this work "
13+
]
14+
},
15+
{
16+
"cell_type": "markdown",
17+
"metadata": {},
18+
"source": [
19+
"## 1. Use Python package pdfminer to convert a pdf to text file \n",
20+
"- install the python package "
21+
]
22+
},
23+
{
24+
"cell_type": "code",
25+
"execution_count": null,
26+
"metadata": {
27+
"scrolled": true
28+
},
29+
"outputs": [],
30+
"source": [
31+
"!pip install pdfminer==20110515"
32+
]
33+
},
34+
{
35+
"cell_type": "markdown",
36+
"metadata": {},
37+
"source": [
38+
"- build up a function to convert the pdf to txt file "
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": null,
44+
"metadata": {},
45+
"outputs": [],
46+
"source": [
47+
"from cStringIO import StringIO\n",
48+
"from pdfminer.pdfinterp import PDFResourceManager, process_pdf\n",
49+
"from pdfminer.converter import TextConverter\n",
50+
"from pdfminer.layout import LAParams"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {},
57+
"outputs": [],
58+
"source": [
59+
"def to_txt(pdf_path):\n",
60+
" input_ = file(pdf_path, 'rb')\n",
61+
" output = StringIO()\n",
62+
"\n",
63+
" manager = PDFResourceManager()\n",
64+
" converter = TextConverter(manager, output, laparams=LAParams())\n",
65+
" process_pdf(manager, converter, input_)\n",
66+
"\n",
67+
" return output.getvalue() "
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"metadata": {
74+
"scrolled": false
75+
},
76+
"outputs": [],
77+
"source": [
78+
"to_txt('~/location_of_the_pdf_file/a.pdf')"
79+
]
80+
},
81+
{
82+
"attachments": {},
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
"## 2. Use Linux commands to convert single and a batch of pdf files to txt files \n",
87+
"references: \n",
88+
"- https://www.howtogeek.com/228531/how-to-convert-a-pdf-file-to-editable-text-using-the-command-line-in-linux\n",
89+
"- https://askubuntu.com/questions/211870/how-to-convert-all-pdf-files-to-text-within-a-folder-with-one-command \n",
90+
"- https://askubuntu.com/questions/211870/how-to-convert-all-pdf-files-to-text-within-a-folder-with-one-command"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {},
97+
"outputs": [],
98+
"source": [
99+
"#check if linux has popoler-utils package \n",
100+
"dpkg -s poppler-utils \n",
101+
"\n",
102+
"#if not installed, install this \n",
103+
"sudo apt-get install poppler-utils \n",
104+
"\n",
105+
"# convert pdf to txt file \n",
106+
"pdftotext Your_pdf_file_location/sample.pdf Your_txt_file_location/sample.txt \n",
107+
"\n",
108+
"#preserve the layout the document from the original file \n",
109+
"pdftotext sample.pdf sample.txt \n",
110+
"\n",
111+
"#convert batch pdfs to text files \n",
112+
"for file in *.pdf; do pdftotext -layout \"$file\"; done "
113+
]
114+
}
115+
],
116+
"metadata": {
117+
"kernelspec": {
118+
"display_name": "Python [conda env:py27]",
119+
"language": "python",
120+
"name": "conda-env-py27-py"
121+
},
122+
"language_info": {
123+
"codemirror_mode": {
124+
"name": "ipython",
125+
"version": 2
126+
},
127+
"file_extension": ".py",
128+
"mimetype": "text/x-python",
129+
"name": "python",
130+
"nbconvert_exporter": "python",
131+
"pygments_lexer": "ipython2",
132+
"version": "2.7.16"
133+
}
134+
},
135+
"nbformat": 4,
136+
"nbformat_minor": 2
137+
}

0 commit comments

Comments
 (0)