In our increasingly fast-paced world, the buzzword du jour, “big data,” looks like a winner of the fastest transition from cool, new, vacuous, poorly-defined concept to mainstream acceptance and implementation.
A year or two ago if you had mentioned Hadoop, listeners would have thought you had the hiccups (or had just sneezed). Now even the non-Technorati use the term in casual conversations.
This became clear even to me when over the last month:
- I was a ‘table leader’ (meaning I coordinated the conversation at one of some 10-12 tables) at an ACT-IAC Congressional Education Working Group at the Rayburn Office Building discussing the use of big data and its legislative policy implications
- Appeared twice on radio shows talking about my top three news stories of the week relating to the federal government (and especially IT in my case) and each time selected one article about big data
So, when confronted with a new trend, thought, idea, movie, or playoff baseball game, I did what any loyal fan of the Big Bang Theory would do, I decided to write a blog entry.
In particular, I wanted to summarize what big data is, why it has become important now, and what are some of the major implications for the federal space.
What is it?
Alistair Croll writing in Forbes in late August, starts it out in ”Big Data Grows Up: Three Spaces to Watch Once the Hype Subsides.”
Croll notes that everything seems to be labeled “big data” since:
- Everything is on the Internet
- The Internet has a lot of data
- Therefore, everything is big data
As usual with most humor there is an element of truth in this summary. Before going into some of the whys, intuitively we can see that the amount of data which has become acessible is overwhelming.
On one of the radio shows, during a break, a question came up about a legal detail associated with qualified and unqualified financial data. All of our first reactions were to reach for our phones and look up the definition. It is remarkable that almost all of us assume that on any question or open issue, at any time of day or night, from any location (almost), it is possible to just access the Internet and find at least some aspects of the answer.
Big data can be looked at from two different perspectives, depending on the audience: technology or usage.
In large part, my interests are focused on usage. However, we should not ignore the fact that there are important technical hurdles that must be overcome in order to be successful at utilizing big data.
To the technology strategist, processing a trillion records needs different kinds of software and a different approach than processing a million (or even a billion) records. Having enough physical storage, determining how to index the data, dealing with what is large unstructured data coming from multiple sources; even figuring out how to move or access these enormous data sources from one place to another is a difficult task to solve.
From a usage perspective, big data allows us to move, as Croll notes, borrowing a phrase from Donald Rumsfeld, from studying known unknowns to unknown unknowns. This is a very dramatic change.
To explain, historical business intelligence or analytics has been the study of discovering the unknown aspects of an issue we wanted to understand better. For example, what kinds of customers bought our product and what patterns of purchase could I use to optimize my supply chain? Or, what was the efficiency of my grant program? How long did it take from grant application to grant provisioning?
We now have the opportunity, using new forms of analytics to determine patterns and thus questions (and associated answers) that we were unaware even existed – the unknown unknowns.
In my opinion, there are three broad reasons why big data has become so important.
First, of course, is the ever present Internet. As I mentioned earlier, we assume these days that all information is available all the time. This is increasingly an accurate assumption.
There is a great graphic from Domo, created in June 2012, and thus is probably already outdated, but here’s a few of their ‘every minute’ data:
- Google received over 2 million search queries
- Facebook users share 684,478 pieces of content
- Twitter users send over 100,000 tweets
- YouTube users upload 48 hours of new video
- 571 new websites are created
- 204,166,667 emails are sent (it feels like most of them to me)
Second, is the growing existence of the Internet of Things (IOT). Originally most of the traffic on the Internet was generated by people.
Even if people use large-data generating devices, like cameras, there still is a limited amount of data that people can create. Increasingly however the sources (and the recipients) of the data are automated devices of one sort or another.
- Health sensors
- Traffic and other cameras in public spaces
- Electricity and other energy usage sensors
- Intelligent grid, river, or transportation sensors
- Did I mention sensors?
The point being that almost everything is becoming an Internet connected generator of information about itself; the key points being ‘generator’ and ‘connected.’ As we have learned over the last few years, the ability to ‘mash’ together multiple data sets can produce unexpected and very useful results.
Third, people are much more connected using mobile devices to the Internet and thus have access to all of this information.
Looking at mobile phone use around the world is pretty interesting.
There are over 7 billion people in the world. There are almost 6 billion mobile cellular subscriptions and over 1 billion mobile broadband (smartphone or tablet) subscriptions. Understanding that in some cases there are multiple subscriptions per person, it still means that a majority of the people in the world have cell phones, and closing in on 15% of the people in the world have a smart phone. In the U.S. ,the totals are close to 100% for cell phones and close to 30% have smart phones.
These are remarkable statistics. Increasingly the way people access the Internet and receive information is through their mobile devices.
My experience in the federal government is that there is a constant push and pull between improving how well a mission goal is implemented, let us call that efficiency, and achieving specific program goals, let us call that effectiveness.
Too much emphasis, in my opinion, is placed on optimizing efficiency, e.g., cost reductions, albeit for understandable reasons. Improving efficiency is generally easy to measure and has less political implications. However, the reality is that if IT professionals want to generate support, the most powerful way of doing that is to improve effectiveness in a demonstrable fashion.
It is for this reason that big data has resonated so quickly within the federal government. While there may be disagreement on many aspects of government performance, almost everyone would agree that one of the functions of government is to collect data, for example Census or Labor statistics, analyze that data and disseminate the results. Big data touches all of these bases.
I would note two final points.
First, as far as I can tell, and it is possible that I am being unfair, program performance measurements have not been a concept that has generated much acceptance. As big data efforts become more mature, the ability to perform these measurements and report the results will increase with the potential for conflicted political implications.
Second, some initial key steps are needed at the data infrastructure level in order to really utilize big data. The Office of Management and Budget should consider increasing its emphasis on data standards, such as the National Information Exchange Model, which allows data to be shared more easily across federal, state and local governments.
Optimizing federal use of big data will take political will from OMB and the political leadership within agencies and departments, both to support program measurements including finding the unknown unknowns, and to support data standardization as a strong foundation.