Tech

Study: PDF, HTML files dominate Data.gov

Data.gov heavily relies on HTML and PDF for its file formats, leaving two George Mason University researchers to ponder if the federal government’s data repository is achieving what it set out to accomplish.

By Greg Otto

June 13, 2016

You'll find a lot of files like this on Data.gov. (Pixabay)

In a paper published in April, Anne Washington and David Morar of George Mason’s School of Policy, Government and International Affairs combed through the entire Data.gov catalog to figure out what file formats were available and if those files were the most convenient for the intended audience.

The researchers examined files hosted on Data.gov against the five-star open data scheme advocated for by internet pioneer Tim Berners-Lee. The system ranks data stored in PDF and HTML formats on the lower end of the scale, and four- and five-star data formats — those that can be linked together in a fashion similar to how URLs are hyperlinked across the internet — at the upper end.

Washington and Morar modified the five-star structure to account for a number of files on Data.gov being posted in obscure file formats. These files, which were typically formats used in word processing or mapping programs, were given a 0 stars. Unstructured formats, such as HTML or PDF received one star. Proprietary files, such as Microsoft Word or Excel files, were given two stars. Structured machine-readable formats, such as XML or CSV files, were given three stars. Files that contained uniform resource identifiers were ranked the highest with four stars.

Create column charts

Researchers found that of the 244,000 files on Data.gov, more than 30 percent (77,217) are posted in HTML. The second-most popular file format is XML, at 17 percent (42,846). PDFs came in third at 14 percent (34,381), while two lesser-known file formats — ODF and Octet Stream — rounded out the top five.

More than 60 percent of Data.gov’s files were given a one-star rating. Formats that earned three stars — meaning the files are open and machine-readable — finished second, with 23 percent of all Data.gov files falling into this category.

Only 18,347 files — 7 percent — were found to meet the four-star criteria.

The study’s authors found that agencies have embraced publishing information to Data.gov in a format that can be adopted by a wide array of the public. However, the study points out that the government may be too focused on informing the “English-literate public than the data literate who want machine-readable information.”

“If the goal of open government data is machine readable structured file, there may be a legitimate concern about the large number of PDF and HTML files,” the report reads. “The innovators and the data entrepreneurs expect structure machine-readable data.”

Congress is pushing for machine-readable data to be the government’s default format. In April, groups in the House and Senate introduced a bill that calls on agencies to create an inventory of all enterprise data, determine what can be released publicly, and post it with open licenses and in machine-readable formats.

The authors also conclude that the government is going to have to decide how to reach both average users and techies alike.

“Governments attempt to satisfy both the average user, with simple accessible formats, and the sophisticated data consumer, with structured machine-readable formats,” the report reads. “Open government data has established an important pattern of considering both the least and the most sophisticated users. This study suggests that we need a broader conversation about who the data audience will be in the context of open government.”

You can download the full study here.

Contact the reporter on this story via email at greg.otto@fedscoop.com, or follow him on Twitter at @gregotto. His OTR and PGP info can be found here. Subscribe to the Daily Scoop for stories like this in your inbox every morning by signing up here: fdscp.com/sign-me-on.

Study: PDF, HTML files dominate Data.gov

More Like This

Commerce selects six Tech Hubs winners for re-awarded funds

Space commerce official says TraCSS work continues despite budget uncertainty

VA software management woes linked in part to CIO vacancy, watchdog says

Top Stories

DHS taps dataset of baggage images to improve TSA scans

Senate Dems say IRS chief may have ‘misled’ Congress on staffing answers

Governmentwide HR system in the clear on post-award GAO bid protests

TMF-backed projects to bring $1B in cost savings in coming years, watchdog finds

CAISI would benefit from more resources, OSTP director tells lawmakers

NASA, DOD and others join Energy Department-led Genesis Mission

Energy Department demos Genesis Mission platform

OSTP’s Kratsios denies White House influence over NSF grants decisions

More Scoops

Data experts see new Labor Department portal as ‘an important first step’

Controlled sandboxes and open data: A look inside GSA’s AI-themed hackathon

With its expiration date looming, the CDO Council waits on missing White House guidance

‘Work yet to be done’ on AI-ready data standards for federal government, Commerce CDO says

GSA moves forward on overhaul of FedRAMP priorities

How federal agencies can improve data insights and lower storage costs

With AI, agencies have secondary responsibility of providing data for industry

Latest Podcasts

The Energy Department recruits NASA, DOD and others to join the Genesis Mission

How agentic AI can help agencies prevent fraud in real time

Commerce selects six Tech Hubs winners for re-awarded funds

OPM moves forward with its transformational HR contract

Tech

Defense

Cyber

FedScoop TV