python - Convert PDF with columns to text -


in unix or windows, want convert dictionary python dictionary. copied contents of pdf dictionary , put them in .rtf file, intending read them python. however, gives like:

a /e/ noun human blood type of abo system, containing antigen (note: some- 1 type can donate people of same group or of ab group, , can receive blood people type or type o.)
aa
abdominal distension /􏰄b􏰁dɒmn(ə)l ds 􏰂tenʃ(ə)n/ noun condition in abdo-
men stretched because of gas or fluid
a
abdominal distension
aa abbr alcoholics anonymous

it has squashed columns pdf strange mismash. how convert pdf text columns respected? in other words, desired output is:

a /e/ noun human blood type of abo system, containing antigen (note: some- 1 type can donate people of same group or of ab group, , can receive blood people type or type o.)
aa abbr alcoholics anonymous

...and on

you have 2 options text:

  1. direct text extraction each page as-is.
  2. split each page 2 along column space , extract text each half separately

for first option i'll suggest first try pdftotext, parameter -layout. (there other tools, such tet, text extraction toolkit pdflib folks, can try if pdftotext isn't enough.)

for following road of second option using ghostscript , other tools, may want check out answers following questions:


pdftotext -layout

you can try command line tool pdftotext. you'll have decide if "good enough" purpose.

the following command extracts text page 8 (first page dual column layout) , prints <stdout>:

$ pdftotext -f 8 -l 8 -layout                                         \            dictionary+of+medical+terms+4th+ed.-+\(malestrom\).pdf - \  | head -n 30 

results in:

medicine.fm page 1 thursday, november 20, 2003 4:26 pm                                                             /e/ noun human blood type of abo                abdominal distension /bdɒmn(ə)l ds                                                         abdominal distension  system, containing antigen (note: some-              tenʃ(ə)n/ noun condition in abdo-  1 type can donate people of              men stretched because of gas or fluid  same group or of ab group, , can receive           abdominal pain /b dɒmn(ə)l pen/ noun                                                           abdominal pain  blood people type or type o.)                pain in abdomen caused indigestion or  aa  aa abbr alcoholics anonymous                             more serious disorders  & e /e ənd                       /, & e department /e ənd           abdominal viscera /bdɒmn(ə)l    vsərə/  & e                                                    abdominal viscera           d pɑ            tmənt/ noun same accident ,                                                           plural noun organs contained in  emergency department                                     abdomen, e.g. stomach, liver , intes-  & e medicine /e ənd                                   med(ə)sn/  & e medicine                                                           tines                                                           abdominal wall /b dɒmn(ə)l wɔ                                                                                         l/ noun                                                           abdominal wall  noun medical procedures used in & e de-                                                              partments                                                muscular tissue surrounds abdomen                                                           abdomino- /bdɒmnəυ/ prefix referring                                                           abdomino- 

note use of -layout! without it, extracted text this:

medicine.fm page 1 thursday, november 20, 2003 4:26 pm /e/ noun human blood type of abo system, containing antigen (note: somea

one type can donate people of same group or of ab group, , can receive blood people type or type o.) aa abbr alcoholics anonymous & e /e ənd /, & e department /e ənd d pɑ tmənt/ noun same accident , emergency department & e medicine /e ənd med(ə)sn/ noun medical procedures used in & e deaa

a & e & e medicine partments ab /e bi / noun human blood type of abo system, containing , b antigens ab

i noted file uses on page 8, has not embedded, fonts courier, helvetica, helvetica-bold, times-roman , times-italic.

this not pose problem text extraction, since these fonts use /winansiencoding.

however, there other fonts, embedded subset. these fonts use /custom encoding, not provide /tounicode table. table required reliable text extraction (back-translating glyph names character names).

what said can seen in table:

$ pdffonts -f 8 -l 8 dictionary+of+medical+terms+4th+ed.-+\(malestrom\).pdf   name                           type        encoding      emb sub uni object id  ------------------------------ ----------- ------------- --- --- --- ---------  helvetica-bold                 type 1      winansi       no  no  no    1505  0  courier                        type 1      winansi       no  no  no    1507  0  helvetica                      type 1      winansi       no  no  no    1497  0  moekla+times-phoneticipa       type 1c     custom        yes yes yes   1509  0  times-roman                    type 1      winansi       no  no  no    1506  0  times-italic                   type 1      winansi       no  no  no    1499  0  igfbal+europeanpi-three        type 1c     custom        yes yes no    1502  0 

it happened hand-coded 5 different pdf files, commented source code, new github project. these 5 files demonstrate importance of correct /tounicode table each font embedded subset. can found here, along readme explains more detail


Comments

Popular posts from this blog

google chrome - Developer tools - How to inspect the elements which are added momentarily (by JQuery)? -

angularjs - Showing an empty as first option in select tag -

php - Cloud9 cloud IDE and CakePHP -