Data Science

Data Lakes: Insights on Diving Into Raw Data

During a time when information is gold, the value of raw data should not be understated. But with so much of it in existence, where does it all go?

That’s where data lakes come in—cost-effective methods of data storage that hold unstructured and structured data from a variety of sources—allowing for multiple points of access and collection while maintaining the original raw data for future use. Besides collecting and storing data, ensuring that it is secure and of high quality is critical.

“The progress we have made into data exchange has led to the creation of data lakes by organizations, repositories of aggregated data that can be leveraged and analyzed to inform care, improve processes and lower costs,” said Katie Crenshaw, senior manager, informatics, HIMSS. “When people thought of interoperability in the past, they thought of transactional exchange, or one system exchanging one document with another. This was often in a more traditional clinical environment. What we are seeing today is something more like a lake of data, where data sources—such as EHRs and public health data—are making information available for access by the right people.”

The first implementations were created for handling web data for organizations like Google. This eventually cascaded to other suits of big data, like clickstream data, server logs, social media, geolocation coordinates, and machine and sensor data.

Using this vast amount of data benefited web data organizations by increasing speed and quality of search and providing more detailed information to inform advertising and audience analysis.

The Potential of Raw Data

Many organizations investing in enterprise data warehouses have begun creating data lakes to complement warehouses. Businesses dive into the lakes to unveil market trends, patterns and customer insights. Though they bring their own set of challenges, data lakes are filled with potential to help drive interoperability and quality information to support decision-making.

“This approach to data really allows the industry to embrace the concept of health and not just healthcare,” said Bronwen Huron, senior manager, informatics, HIMSS. “New stakeholders are able to access this data and add to it, making the overall system smarter and more robust.”

It also has immense research potential, Huron noted. “A challenge for research today is finding large enough patient populations that meet specific criteria but whose data can be de-identified. If you are selecting for a specific genetic condition in a specific region, you may have only a few patients from whom to select. By accessing a lake of data and only pulling the demographics and data you need, you may be able to find many more people who meet your research criteria and strengthen your findings.”

Machine learning and AI are also positioned to benefit greatly from data lakes, Crenshaw added. “AI and machine learning solutions require large amounts of data to optimize the information or role they provide, so having access to them will push [both] to new capability levels.”

Getting the Most From Your Raw Data

In an interview with HIMSS TV, Amy Compton-Phillips, executive vice president and chief clinical officer at Providence, emphasized the importance of recruiting the right talent to manage data effectively. “We do have a data lake—so we dump all of our data, structured, unstructured, into a big pool of information,” she explained. “And then we have really skilled people that pull that out in usable chunks as a way to provide information. We were surprised that there was very little correlation between cost and quality and that even the best clinicians had opportunity to get better.”

“Having the information staged altogether… is a huge step forward, but that information is an amorphous blob unless you have somebody skilled to be able to query it and then structure it,” said Compton-Phillips. “And so I think talent is critical to get that structuring.” This is why now more than ever, data scientists play a critical role in care delivery.

Ensuring Your Data’s Quality

While organizations are seeing the benefits of utilizing this storage format, opportunity remains to improve the quality and secure sharing of the data within.

“If you imagine this as a literal lake, it is supplied by streams,” said Huron. “If those streams are dirty, they can contaminate the lake and make the water murky and hard to see through. If the data that is added to this lake is inaccurate or outdated, it will be important for systems and users to be able to date and source this information to better understand how to contextualize it.”

“We need to be able to ask ourselves, is this data directly from the patient or provider? Or is this data that was based on other data that was old or incomplete and now duplicated in the record?”

Ensuring high data quality starts with organizations making a commitment to maintaining data and requires industry-wide coordination and standardization of definitions and data models.

An organization first needs to have an understanding of the data it produces and the data it needs to inform care and business insights. This awareness can allow for monitoring of and improvements to documentation and aggregation for data quality.

And at an industry level, we also need common standards and definitions—down to the data element level—to ensure consistent interpretation of the data captured, Crenshaw explained. This work is occurring and growing as stakeholders continue to realize the value of the vast amount of data made available in data lakes.