PDF Sharing is Not Data Sharing–A Public Service Announcement

by Brian Timoney

 

“More tears are shed over answered prayers than unanswered ones.”

– Truman Capote

 

Like many readers of this blog, I was heartened by last week’s Executive Order from the President of the United States declaring “Making Open and Machine Readable the New Default for Government Information.”  Finally, tangible progress on the rocky road leading to Transparency, Accountability, and Economic Benefit.

But as I finished said Executive Order, my heart sank. For it didn’t include an explicit disqualification of the PDF format as meeting the “machine readable” threshold.

No criminal classifications either.

Nor mandatory jail sentences.

In a word: Weak.

‘The Heart of Nerd Darkness’

 

If you’ve never suffered personally the travails of extracting usable data from a PDF, I strongly recommend Jeremy Merrill’s wrenching first-person account of compiling usable, ready-for-analysis data from some 2 million records trapped inside PDFs.  Similarly painful experiences led to Caitlin Rivers’ post ” ‘Send Me Your Data–PDF is Fine’, Said No One Ever…” which includes helpful guidelines for would-be data publishers.

How bad is the problem? People have admitted online to reading PDFs data aloud and having a colleague key the data into a spreadsheet.

This is not the world Gutenberg intended.

Sins of Omission, Sins of Commission

With PDFs fulfilling the deep human need to impose one’s print layout on others, we’re tempted to forgive their progenitors for they know not what they do.

And then there’s Orange County, California.

In a long-running case against the Sierra Club over the county’s right to charge $475,000 for its parcel database, the county defended the access granted to citizens by observing that information about each of the county’s 640,000 parcels was available as a freely downloadable PDF.

Lovely.

Who among the GIS crowd has not re-digitized features from a PDF while internally raging with the knowledge that the source data already exists in digital form?

Exactly.  What makes the PDF so infuriating is what makes it so beloved among the passive-aggressive set dedicated to being only semi-helpful.

 

The Scanned Image of Data Inside a PDF

The horror.

The horror.

 

 

—Brian Timoney

 
 
UPDATE: Steve Romalewski blogged his experience with NYC open data as PDF in 2011.

* “PDF Sharing is Not Data Sharing” was the title of a talk given by Victoria Smith-Campbell at the 2013 DRCOG Regional Data Summit wherein she recounted her vast experience re-tracing wildfire boundaries from PDFs so as to enable re-distribution as KML.

 

Shadow photo courtesy of   antonychammond’s Flickr stream