How to Get Data from PDFs using pdfminer

Why parse PDFs with pdfminer?

Honestly, this was the first thing I found that worked quickly for me. Other parsers were more complicated to use and/or didn’t do what I needed them to do. There’s no reason another package can’t do this though.

Why parse PDFs at all?

Often, you’ll get data from coworkers in .pdf form. This is visually appealing and easy to casually skim through, but an absolute nightmare to get data from. For example, I receive about 50 pdf files every two weeks and need to extract data from tables on the first and fifth pages. Nobody wants to sit for a couple hours and copy and paste from two different areas in 50 documents. Here’s my hacky way of peaking into a .pdf file and extracting what I want.

Extracting data from .pdfs

Get pdfminer up and running

Important note- You will have to do this every time you move this package into a different directory. I won’t tell you how long that took me to remember the other day. Let’s just say it motivated me to make this handy reference post.

Download and initialize the software in the pdfminer-20140328 directory-

First, download pdfminer here.

Unzip, and initialize the package in the pdfminer-20140328 directory by following the instructions pdfminer creators have posted on their git site here. Here they are for convenience:

(Not comfortable with using terminal? See my post on How to Use Terminal here.)

$ python setup.py install

Test the software-

$ pdf2txt.py samples/simple1.pdf 

Check that the output from this command looks like the following:

Hello

World

Hello

World

H e l l o

W o r l d

H e l l o

W o r l d

Look at the .pdf file using pdfminer

I am sure there is a more elegant way to do this…but that’s a super low bar because this method is about as graceful as a tapdancing whale. That said, this quick and dirty way works for me. Basically- I’ll use pdfminer to dump all the data into a .txt file. It’s ugly to look at, but it’s easy to parse. Because my .pdf files are almost always in the same format, I only had to do this process once and I was done.

For convenience, the bare bones of the script are here for you to copy and paste.


import csv
import sys 

directory_with_files_of_interest = "my_dir"
file_to_convert_to_txt = "test_file.pdf"
converted_filename = "test_file.txt"
#scroll over so you don't miss cut off text here
os.system("python pdfminer-20140328/tools/pdf2txt.py -o %s %s/%s" %(converted_filename, directory_with_files_of_interest, file_to_convert_to_txt))

#take a look at the contents
file = open("%s" %(converted_filename), "rt")
for line in file:
     print line

Now you can see what lines have valuable information for you and exactly where your data of interest is. All you have to do now is parse the data to extract what you want.

Optional- write extracted data to neat .csv file

Once you make your own script to get the data you want, it’s handy to put this data into a .csv file. For my setup, I iterate through all the .pdf files, saving data to a nested dictionary as I go so each .pdf file name is a key in the dictionary (file1, file2 are the examples of this that I use below).

When I’m done going through all the files I’m interested in, I write the dictionary of data to a .csv file using the following script.

#scroll side to side so you don't miss cut off text
data_dict = {'file1': {'label_1': 5 , 'label_2': 3, 'label_3':7} , 'file2': {...} }
name_of_output_file = "extracted_data.csv"
with open(name_of_output_file, 'w') as csvfile:
     fieldnames = ['column_heading_label_1', 'column_heading_label_2', 'column_heading_label_3']
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
     writer.writeheader()
     for key in data_dict:
          writer.writerow(datadict[key])

You should see a .csv file appear in your working directory with the name you assigned it in the variable name_of_output_file. There will be as many rows of data as there are keys in your data_dict. The column labels in this particular example are column_heading_label_1, column_heading_label_2, and column_heading_label_3.

Thus the final table is comprised of the 3 columns labeled above, and 3 rows (one for the header of column names, two for the two .pdf files’ data).

Important note-

Though .pdf files might look identical to the naked eye, they can look very different when examined as a .txt file. USE EXTREME CAUTION WHEN EXTRACTING DATA FROM .PDF FILES.

Good luck!