Big Data Cloud
Wednesday, February 11, 2015
Adopting a big data strategy presents four challenges for public sector organizations. This is the first entry of a two part blog post that identifies those challenges (talent management, interoperability, trust in the data, and cyber infrastructure) and poses a few solutions to help mitigate the risk these challenges present.

Introduction

This entry is the second in a series of blog posts discussing issues that the government faces with implementing big data. In the first blog post in this series, I described why data should be defined as "big" based on complexity of the data, not volume alone.

In this entry, I address the first two of four major challenges for government in the implementation of big data: talent management and interoperability. In my next blog entry, I address trust and cyber infrastructure.

As part of this research, I spoke with a number of leaders who are helping agencies and organizations navigate these emerging technologies. These leaders included:

  • Frank Baitman, the Chief Information Officer for the U.S. Department of Health and Human Services. He previously served as a White House Entrepreneur-In-Residence at the Food and Drug Administration and the Chief Information Officer for the U.S. Social Security Administration.
  • Wendy Wigen, the Technical Coordinator for the Big Data Senior Steering Group run by the National Coordination Office's Network and Information Technology Research and Development Program.
  • Three subject matter experts from IBM:
    • Andras Szakal, Vice President and Chief Technology Officer for IBM U.S. Federal.
    • Tim Paydos, Director of IBM's World Wide Public Sector Big Data Industry Team.
    • Brian Murrow, a leader in IBM's Business Analytics and Strategy Team.

Challenge 1: Attract and Retain a Workforce with Big Data Skills

Each expert indicated that workforce, also known as talent management, is a challenge that the public sector faces in adopting big data technologies. While the workforce is a significant challenge, not many solutions present themselves. Mr. Baitman surmised that the public sector needs to get creative in compensation to attract and retain a workforce with the requisite skills in data structure, data science, and data analysis. As another alternative, several private sector organizations are providing pro bono work, helping the public sector gain access to these skills. For example, Splunk4Good and DataKind are two groups of data scientists from the private sector working together to contribute to society and the public sector for a greater good. Additionally, organizations can develop talent development programs similar to the military's training with industry (TWI) fellowship or pay for additional education. (Full disclosure: in return for the skills I add to my analyst toolkit during my year of TWI, I incur an additional commitment in service to the Army, ensuring that the organization benefits from the development of my talent.)

Challenge 2: Interoperability Across Data Sources, Organizations, and Domains

Another significant challenge that organizations face when starting down a path to the higher end of the analytic spectrum (prescriptive and cognitive) and the use of big data analytic tools is interoperability: across data sources, across organizations, and across domains. My research revealed that the ways to address the interoperability challenge are data governance and changes to laws. According to Ms. Wigen, data governance is the foundation for interoperability. Just because new analytic tools are capable of handling unstructured data, there is still a need for structure and governance.

A promising application of big data technologies is the merging of data stores across organizations and/or domains, internally and possibly externally. Ms. Wigen provided two examples. First, NASA and the Forest Service coordinate and share data about weather and ground conditions to better assess forest fire risks. The response to the Deep Water Horizon oil spill is another example where analysts integrated weather data, oceanography data, and plant data to determine where to send clean-up crews. Data stores from different elements of the organization may follow multiple data structures, but all of the data is related. When organizations merge data with multiple data structures, they essentially have unstructured, or at best, semi-structured data. These technologies created for big data can make this data merge possible since they can handle unstructured data.

Solution 1: Up-front Agreements to Manage Expectations
As organizations begin to build big data stores, Mr. Paydos noted it is critical to have discussions and agreements about data governance up front to manage expectations. He has seen instances where an original provider of data no longer needs and stops producing data that another consumer of that data becomes dependent on, unbeknownst to the producer. Ms. Wigen shared that NASA and the Forest Service effort required a lot of coordination in data requirements and data governance ahead of time. They hosted a number of working groups to understand the problem based on history and used that information with knowledge about the available data. Mr. Murrow added that he believes data governance is a task that an organization should contract out. A data governance program includes definitions, usage agreements and database modeling. Mr. Murrow pointed out that the development of a data governance program is a surge requirement, not enduring, so public sector organizations do not need to build this capability in house. Instead, organizations should use experts in data governance to develop a transparent, enterprise solution. Mr. Baitman disagreed, saying that government has the responsibility to protect the data and you cannot outsource that responsibility. Perhaps a blended approach might be best, initially bringing in the subject matter experts to consult and help design the data management program, but train and staff a data governance maintenance program internally from the organization.

Solution 2: Implement a Data Governance Program
Once the agreements are in place, Mr. Paydos stated that one of the best methods for implementing a data governance program was through the use of metadata management. This technique uses metadata, or data that describes the data, to control the structure of the data warehouse and to control access. Mr. Baitman pointed out that data governance is something that organizations have to continue to build upon and learn as you go since there is no perfect answer for every organization. Ms. Wigen added that a data governance program establishes a common set of data definitions, standards of how you collect and store the data. Data governance is critical to data being reusable and valuable. However, establishing a data governance program is not trivial; it requires humans-in-the-loop, data scientists and subject matter experts. This interoperability challenge highlights again the significant challenge of developing the workforce. In Ms. Wigen's steering group, Google shared that there are thousands of people behind that scenes working on data management, metadata and linguistics to get accurate search results.

Solution 3: Update Laws
The final solution to interoperability is the need to update laws to reflect advances technology and data security. Mr. Baitman highlighted that some of the biggest challenges in opening data stores and sharing health research between organizations and domains were the laws that are in place that prevent sharing data. He stated that the Affordable Care Act made huge strides in breaking down some of the legal barriers in helping doctors better treat patients. Other changes in legislation have enabled doctors and researchers to more effectively treat and cure diseases. He believes that laws are too restrictive and that a set of guiding principles would be more conducive to opening data stores and increasing interoperability. For example, President Obama signed an Executive Order in May 2013 as part of the Open Government Initiative that led to the opening of many data stores, which are now available on www.data.gov. The datasets that are now available to the public are enabling and forcing data scientists and analysts to learn how make them interact in ways that increase the value of data which used to be locked up in silos behind firewalls.

Prelude to Future Big Data Blogs

In future Big Data Blog Series, I will cover:

***The ideas and opinions presented in this paper are those of the author and do not represent an official statement by IBM, the U.S. Department of Defense, U.S. Army, or other government entity.***