Part 1: What is Big Data?

 In healthcare today, it seems that a big data perfect storm has arisen out of a confluence of factors:

  • Healthcare in the United States is entering into a phase of ‘post EMR’ deployment.  Organizations are now keen on gaining insights and instituting organizational change from the vast amounts of data being collected from their EMR systems.
  • Participants in the healthcare ecosystem are trying to reduce the cost and improve the quality of care by applying advanced analytics to both internally and externally generated data.
  • Technological advances enable larger volumes of structured and unstructured data to be managed and analyzed  through faster, more efficient and cheaper computing (processors, storage, and advanced software) and through pervasive computing (telecomputing, mobile devices and sensors).

Though almost all can agree that healthcare has big data challenges, there is no agreement on what makes a data challenge a big data challenge.

The term ‘big data’ has become extremely common in everyday business vocabulary.  It seems that every industry is focused on implementing big data strategies and projects.  However, when you read articles about big data, it becomes clear that there are varying definitions of big data.  This looseness leads to considerable confusion about what constitutes big data and how organizations should address their big data challenges. 

As a sweeping generalization, there are two definitions of big data: a more narrow definition and a broader definition. The more narrow definition of big data focuses on data tasks that conventional structured relational databases cannot easily handle.  This narrower definition, which is generally favored by the technical community, frequently includes some discussion of Hadoop, MapReduce, NoSQL or similar technologies that can be employed to solve the problem of making large volumes of unstructured data useful for analytical purposes.

 IMAGE 1_What is Big Data_Brial Gaffney_03-2014                    

The broader definition of big data, favored by the press and general public, encompasses data intensive efforts irrespective of whether the data are structured or not.  In fairness, the more narrow definition is likely the correct definition.  However, both interpretations have meaning and relevance to HIMSS members.  We do not propose to resolve the debate of what exactly constitutes big data.  Instead, we intend to provide insights about the meanings of the two definitions and how, regardless of definition, big data has relevance in the healthcare industry.

The wider, more generally used, definition of big data does not have to rely on the newer, non-relational database technologies.  As the term big data becomes more commonly used, it seems that the meaning of big data morphs beyond the stricter technical definition to include any large scale data analytics effort.  Those that are communicating from this perspective are not necessarily worrying about whether newer technologies such as Hadoop and NoSQL databases are being used, but are rather focused on leveraging large amounts of data, or real-time access and response to these data.

Many media articles do not discuss the use of these newer technologies.  This is not a criticism but rather meant to reinforce the notion that the general media does not typically delve into the nuances of big data. Instead it focuses on the realization that there is tremendous value in the volumes of data that are available today.

Consider an organization with a 10 million patient relational database; perhaps they are storing these data in a traditional relational database to mine the terabytes of data by applying analytical tools on fields in the database to predict readmissions.  The technical solution does not rely on anything other than conventional techniques but this still fits the generalized notion of a big data problem.  Similarly, as point of care monitoring devices come on line in more healthcare delivery settings, the transmission, processing, storage, and access to the data also fall into the concept of a big data problem. In this case, it’s the near-real-time analysis of the data that might provide early warning of adverse outcomes.

For some HIMSS members, the wider definition of big data is nothing new and they have had data intensive efforts in place for decades that rely on sophisticated data warehouses and related technology.  For these organizations, “big data” is already part of the fabric of their organizational infrastructure.

For other, smaller HIMSS organizations, the journey towards leveraging big data has just begun.  These organizations can still benefit from the wider view of big data by instituting data focused efforts.  Some of those efforts might rely on the time tested relational and star schema database (i.e., data warehouses) approaches while others might necessitate the use of newer technologies like Hadoop.

The other, narrower interpretation, of big data is when massive amounts of both structured and unstructured data are available and need to be mined and analyzed. Sometimes the data may even be in multiple locations so that the computing needs to be taken to the data instead of the other way around.  In these types of scenarios there are sets of information processing challenges that stretch the boundaries of conventional data processing analysis techniques.   In this definition, simply having a large relational data base with millions of patients does not automatically create a big data problem. The big data problem arises because of the need to tag and analyze large volumes of unstructured data that may be distributed across multiple locations

While there are many technologies that can be used to support and perform advanced analytical information processing challenges, one of the technologies most frequently cited in big data is Apache Hadoop.  Hadoop is a fairly large collection of software that includes a distributed file system. Hadoop “is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.  Hadoop provides various software solutions within this framework so that users can build out scalable software systems.  These scalable systems are well suited to tackling information challenges that require massive parallel computing to crunch through extremely large amounts of data.

The Hadoop framework allows streams of data to be processed and analyzed in parallel, reducing the time to solve analytics problems.  For example, consider again the problem of understanding the information in clinical notes.  Streams of unstructured data need to be rapidly interpreted and tagged for potentially relevant topics.  Breaking these textual streams into multiple chunks and processing them in parallel makes perfect sense.  In some settings, where the amount of data is large and the turnaround time needs to be fast, Hadoop or other parallel technologies are the only current way to solve the challenge.

While Hadoop itself is an open software platform provided by Apache, there are many vendors now providing commercialized Hadoop solutions targeted to both specific industries and software challenges.  To further blur the distinction of big data, traditional systems vendors are now enabling their relational databases and other familiar tools to run on top of Hadoop. 

Another way of defining big data, which admittedly is not terribly precise, is based on the three (and sometimes four) V’s:

  • Volume: Very large amounts of data, usually from terabytes to petabytes and beyond
  • Variety: Structured and unstructured data
  • Velocity: Data that is created at a rapid pace and analyzed in near real-time
  • And sometimes Veracity: Data that may not be fully accurate or trustworthy

These four V’s have been used to classify an information challenge as a true big data problem.  For example, considering again the clinical notes example in healthcare, one can easily see how this sort of information challenge could push the boundaries of the four V’s and become a big data problem.

The definition of big data can be confusing.  Whether one defines big data in the narrow, technical perspective or the wider, more generally accepted perspective; the relevance of big data to healthcare organizations is about the many ways that data can be leveraged to improve the state of healthcare. For HIMSS members, having an understanding of the non-relational, big data technologies can be helpful when thinking about how to address business and clinical challenges.  Not every organization is ready to develop true big data solutions, but every organization can benefit from a better focus on data management and analysis initiatives.

What constitutes big data today may not constitute big data tomorrow.  As technology improves, current computational problems that require massive amounts of parallel storage and processor power often become problems that can be solved with simpler and less costly hardware/software combinations.  Just as the supercomputers of 20 years ago are now matched by the personal computers of today, the big data problems of today will be matched by conventional computer systems in years hence.

Related Documents: 
big data, analytics, definition