python - Convert PDF with columns to text -
in unix or windows, want convert dictionary python dictionary
. copied contents of pdf
dictionary , put them in .rtf
file, intending read
them python. however, gives like:
a /e/ noun human blood type of abo system, containing antigen (note: some- 1 type can donate people of same group or of ab group, , can receive blood people type or type o.)
aa
abdominal distension /bdɒmn(ə)l ds tenʃ(ə)n/ noun condition in abdo-
men stretched because of gas or fluid
a
abdominal distension
aa abbr alcoholics anonymous
it has squashed columns pdf strange mismash. how convert pdf text columns respected? in other words, desired output is:
a /e/ noun human blood type of abo system, containing antigen (note: some- 1 type can donate people of same group or of ab group, , can receive blood people type or type o.)
aa abbr alcoholics anonymous
...and on
you have 2 options text:
- direct text extraction each page as-is.
- split each page 2 along column space , extract text each half separately
for first option i'll suggest first try pdftotext
, parameter -layout
. (there other tools, such tet
, text extraction toolkit pdflib folks, can try if pdftotext
isn't enough.)
for following road of second option using ghostscript , other tools, may want check out answers following questions:
- linux-based tool chop pdfs multiple pages (superuser)
- convert pdf 2 sides per page 1 side per page (superuser)
- how can split pdf's pages down middle? (superuser)
- cropping pdf using ghostscript 9.01 (stackoverflow)
- split 1 pdf page two (stackoverflow)
- pdf - remove white margins (stackoverflow)
pdftotext -layout
you can try command line tool pdftotext
. you'll have decide if "good enough" purpose.
the following command extracts text page 8 (first page dual column layout) , prints <stdout>
:
$ pdftotext -f 8 -l 8 -layout \ dictionary+of+medical+terms+4th+ed.-+\(malestrom\).pdf - \ | head -n 30
results in:
medicine.fm page 1 thursday, november 20, 2003 4:26 pm /e/ noun human blood type of abo abdominal distension /bdɒmn(ə)l ds abdominal distension system, containing antigen (note: some- tenʃ(ə)n/ noun condition in abdo- 1 type can donate people of men stretched because of gas or fluid same group or of ab group, , can receive abdominal pain /b dɒmn(ə)l pen/ noun abdominal pain blood people type or type o.) pain in abdomen caused indigestion or aa aa abbr alcoholics anonymous more serious disorders & e /e ənd /, & e department /e ənd abdominal viscera /bdɒmn(ə)l vsərə/ & e abdominal viscera d pɑ tmənt/ noun same accident , plural noun organs contained in emergency department abdomen, e.g. stomach, liver , intes- & e medicine /e ənd med(ə)sn/ & e medicine tines abdominal wall /b dɒmn(ə)l wɔ l/ noun abdominal wall noun medical procedures used in & e de- partments muscular tissue surrounds abdomen abdomino- /bdɒmnəυ/ prefix referring abdomino-
note use of -layout
! without it, extracted text this:
medicine.fm page 1 thursday, november 20, 2003 4:26 pm /e/ noun human blood type of abo system, containing antigen (note: somea
one type can donate people of same group or of ab group, , can receive blood people type or type o.) aa abbr alcoholics anonymous & e /e ənd /, & e department /e ənd d pɑ tmənt/ noun same accident , emergency department & e medicine /e ənd med(ə)sn/ noun medical procedures used in & e deaa
a & e & e medicine partments ab /e bi / noun human blood type of abo system, containing , b antigens ab
i noted file uses on page 8, has not embedded, fonts courier
, helvetica
, helvetica-bold
, times-roman
, times-italic
.
this not pose problem text extraction, since these fonts use /winansiencoding
.
however, there other fonts, embedded subset. these fonts use /custom
encoding, not provide /tounicode
table. table required reliable text extraction (back-translating glyph names character names).
what said can seen in table:
$ pdffonts -f 8 -l 8 dictionary+of+medical+terms+4th+ed.-+\(malestrom\).pdf name type encoding emb sub uni object id ------------------------------ ----------- ------------- --- --- --- --------- helvetica-bold type 1 winansi no no no 1505 0 courier type 1 winansi no no no 1507 0 helvetica type 1 winansi no no no 1497 0 moekla+times-phoneticipa type 1c custom yes yes yes 1509 0 times-roman type 1 winansi no no no 1506 0 times-italic type 1 winansi no no no 1499 0 igfbal+europeanpi-three type 1c custom yes yes no 1502 0
it happened hand-coded 5 different pdf files, commented source code, new github project. these 5 files demonstrate importance of correct /tounicode
table each font embedded subset. can found here, along readme explains more detail
Comments
Post a Comment