The challenges of building data inventories to scale

Organizations processing personal data are being required to take steps to comply with the widespread promulgation of breach notification laws and the enactment of sweeping privacy and data protection regulations such as the EU’s General Data Protection Regulation (GDPR).  Among the requirements these organizations need to comply with are enabling rights of access, portability, and deletion; managing and tracking consents; implementing robust incident response and notification measures; and mapping and legitimizing cross-border personal data transfers. Taken together, these and other data privacy requirements are clearly indicative of a growing organizational imperative to know what types of personal data the organization has, to whom the data belongs, with whom the data is shared and to whom it is disclosed, when and for what purposes the data is processed and where the data is transferred and stored (and these are just the highlights!).

As the GDPR effective date nears, many organizations are scrambling to inventory and to map their personal data assets in order to answer these and other questions about their personal data stocks. However, for organizations that create and process extraordinarily high volumes of personal data across multiple systems and locations, the process of discovering and inventorying all of the relevant aspects of the organization’s personal data stocks can seem insurmountable. These organizations are facing a seemingly intractable and unfortunately all too familiar problem: the problem of building personal data inventories to scale.

Until recently, the privacy tech market had been bereft of the type of data inventory and mapping solutions that organizations can adopt to effectively address the problem of scale. In the absence of these solutions, organizations have been forced to leverage manual and semi-automated solutions that simply cannot meet the scaling demands of large, complex and multinational organizations that create, collect, process and/or store large volumes and types of personal data.  The problem of scale for these organizations falls into two related sets of problems – complex and often messy IT environments and the vast amounts and types of personal data records that are processed and stored in that environment. Let’s briefly take a moment to unpack some of these problems.

Complex and messy IT environment: The first set of challenges facing organizations trying to inventory to scale is the variety and volume of data stores containing personal data. For example, personal data may be stored in onprem data stores or cloud data stores or both; it may be stored in data warehouses, data lakes, or even IOT devices; and the data may be stored in structured and unstructured repositories. Simply identifying an organization’s personal data stores can be challenging. When considering the added complexity of needing to inventory both the personal data elements themselves and important contextual data such as the purpose for processing and any associated consents and/or processing restrictions, it is easy to see how organizations struggle to build personal data inventories to scale.

Voluminous numbers and types of records:  The second and related set of factors that make inventorying to scale so challenging is the vast amount and types of personal data records stored in the kinds of messy IT environments described above. Many organizations have terabytes and sometimes even petabytes of data associated with potentially millions of identities. Now consider the added complexity that disparate elements of data pertaining to an individual often reside in multiple repositories across the IT environment.  To effectively meet their privacy obligations, to inventory to scale, organizations would need to leverage a solution that can not only consolidate the relevant bits of data (e.g., medical, financial information) with each unique identity, but just as importantly, the contextual data (e.g., relevant privacy notices and consents) that organizations need to fulfill their privacy obligations.

Given both sets of factors it is easy to see why a reliance on manual, semi-automated, or even fully automated solutions such as DLP are simply ill-suited for discovering and consolidating personal data at the identity level. They simply cannot provide the data visibility and granularity necessary for organizations to follow through on their privacy obligations. So are there a set of capabilities we can identify that would allow organizations to effectively inventory to scale?  Consider the following four capabilities:

  • Emphasis on machine -discovered who, what, where, when and why – Where possible, solutions that can inventory to scale want to keep human involvement to a bare minimum. The more automated the tool, the more repositories that can be scanned (regardless of platform) and the more accurate the image of the personal data map for the organization. In turn, the more complete the data map, the more useful it is for decision-makers to determine important obligations concerning what is permissible and what is impermissible with respect to the use of the personal data. Although human involvement will remain indispensable for the immediate future, next generation solutions should leverage state of the art automation wherever and whenever possible.
  • Inferring purpose from categorization of system and data – As noted above, data inventories do not themselves address and resolve the organization’s privacy obligations. Rather, the inventory should be understood as a foundational step on the road to privacy compliance. Next generation tools should be able to pull disparate pieces of personal data from different systems while also providing an extra layer of context (e.g., information about notices, consents,) that includes the information necessary to understand the organization’s privacy requirements. For example, the tool should be able to inventory and consolidate personal data across multiple platforms including CRM, customer support, and analytics platforms. Data from the same individual may be flowing through each of these platforms but without the additional contextual data for each of the channels, the organization simply cannot be sure it is meeting its privacy obligations.
  • Use context to identify the meaning of data elements – Organizations regularly collect and process bits of unstructured personal information. These can be meeting records, medical images, or various files such PDFs. Without sufficient context it can be very difficult for organizations to understand what specific obligations they have in regard to what can and cannot be done with these data elements. Next generation solutions should have the capability to identify the context of the data elements stored throughout the network.  For example, the tool should be able to conduct a query to identify US citizen health data across the IT environment. In this case the tool would provide the organization the added value of important context that allows it to identify the specific set of privacy obligations around the use of the data.
  • No superimposed process on the discovery process – Traditional automated solutions such as DLP have the built-in limitation that they can only discover data elements that match a pre-programmed query. In other words, these types of solutions can only discover what is already known about the organization’s systems and data. In order to inventory to scale, next generation solutions will need to be able to go beyond the already known to not only automatically discover the schema for structured and unstructured databases, but also to match and consolidate the appropriate personal and contextual data with the appropriate identity.  
  • Strong data visualization capabilitiesfor an effective user experience, which often leads to increased usage and value generation, it is essential that the user interface will allow users to run queries, reports and analytical scenarios quickly and accurately. This can be achieved through the utilization of BI-like tools, running on high performance “identity-inventory” repository, enriched with meta data and master data for stronger reporting and querying. Like in any big-data solution, performance, quick response rates and high data quality are the foundations for user traction.

The good news is that a recent spate of promising technology solutions have emerged in the market that leverage artificial intelligence (AI) and machine learning.  Not only do these solutions operate at the level identity by discovering and then consolidating relevant personal and contextual data into inventories that are readily accessible for the organization’s compliance requirements, they also hold the promise of building these inventories to scale.