Personal Data discovery after the GDPR deadline
By Roland Bullivant
The GDPR deadline is now behind us and perhaps unsurprisingly there is wide disparity between levels of compliance across those organisations affected. However, meeting GDPR requirements is not a one-time effort. It requires ongoing vigilance to ensure that as an organisation's business and relationships with Data Subjects change so the compliance efforts adapt to meet those changes.
For those who are more advanced, the process of maintaining compliance will continue and they have a good basis from which to work. For other organisations there are possibly many steps still to take on the road to compliance.
For both groups, building and/or maintaining an accurate catalogue or repository of Personal Data will remain a priority. After all if you do not know where Personal Data is located across your IT landscape or cannot be sure that it is up to date, then being able to pinpoint data lineage or where system changes are required in the context of GDPR is very difficult.
Many organisations are seeking to record the location of Personal Data centrally in order to aid efficiency and compliance. Some have used a manual approach, whilst others employ spreadsheet or other home-grown solutions to do this and to help them understand how data flows through their organisation. Others have invested in solutions from Data Catalogue or Data Glossary vendors for the same reason as well as to make enterprise data more accessible and easy to understand in a business context for users.
The physical process of finding where Personal Data exists can take many forms. At its heart however is a need to be able to access and understand the data definitions, the metadata, which is the foundation of any IT system. Other mechanisms such as data profiling, i.e. examining the format of the data itself can also prove valuable.
The challenge with metadata of course is that it is not the same for every system and application. There is a wide diversity of metadata types held in different ways. Some systems give up their metadata easily, some have relatively little metadata and some have large data landscapes which are more complex, and difficult to access and understand.
For example, finding an attribute called "Date of Birth" in a relational database with only 20 tables and one of them having a column called "Date of Birth" is much easier than trying to find the same information from 90,000 tables in an SAP system with no business descriptions in the database catalogue.
What follows is a brief summary of the most common methods of finding Personal Data, together with some of the advantages and drawbacks in certain scenarios and IT system types.
1. Searching through documentation
This a natural place to start and, for some systems, might provide some valuable insight into the metadata and other data definitions which will be useful in determining the location of personal data. However, it is likely to only be of limited use in anything but smaller systems, possibly home-grown applications with simple data structures. It will also only be of any real value if the documentation has been kept in step with any changes which have been made to the data model.
In addition, for large ERP and CRM packages, the task of searching through documentation to find individual tables and attributes from potentially thousands of pages will be very challenging and time consuming. Also, any useful information found will likely need to be re-keyed before it can be shared with other tools.
2. Manual database investigation
Scouring the relational database (RDBMS) system catalogue for any information which could provide a clue as to whether the tables contain Personal Data can be very practical in certain circumstances.
It will work for relatively small packages or systems, which are relatively limited in scope or have perhaps been developed in-house. In these cases finding what is contained in the relatively small quantity of tables and attributes can be achieved relatively easily as long as the naming has been done with clarity in mind.
3. Application and tool specialists
Technical specialists may have the most familiarity with the underlying data models of source systems plus access to vendor or third party technical tools which can help them to identify Personal Data attributes. They are a good source of information, but they are often in short supply and may not be easily available.
As with other discovery approaches this method is probably more useful with smaller systems and there may be challenges in providing the results to whatever form of catalogue or glossary is being used.
Some ERP and CRM package vendors provide tools that can be used by technical specialists to try to locate the required information. However, as above, there may not be an effective mechanism to share that information more widely or with the glossary.
4. External consultants
If an organisation has limited resources then hiring external consultants can be an option. As well as the potential cost implications however, there may be some delay whilst they familiarise themselves with the source system's metadata and, if a package, its customisations, before being able to offer insight.
There is a potential downside to this approach as it may limit in-house knowledge levels in the long run.
5. Metadata-driven software
Deploying software to automate the process of identifying metadata associated with Personal Data across an entire IT ecosystem can be a faster and more effective method. Many data catalogue and governance products already have facilities to connect to source systems and import their metadata directly. Automating this process reduces the opportunity for error as only very limited manual intervention is necessary.
One category of system for which this approach will not be effective unless specialist capabilities exist in those products is CRM and ERP packaged applications. This is because of the size, complexity, inaccessibility and level of customisation of their data landscapes. There are a few advanced self-service metadata discovery tools available, like Safyr(r). These make the underlying package data structures available to users who can search for Personal Data attributes, subset them into appropriate groupings and share the results with Data Catalogue or Governance products or used with Excel.
To illustrate the scale of the Personal Data Discovery challenge for these systems, Silwood Technology recently conducted research into five of the largest and most widely used information application packages.
This revealed that SAP has more than 900,000 fields, JD Edwards more than 140,000 and Microsoft Dynamics AX 2012 100,000 - all of which may (or may not) contain personal information that requires detection and risk assessment.
6. Internet search
Locating metadata relating to Personal Data via Internet search is a viable, low cost option, but highly labour intensive and also questionable in terms of accuracy.
This method is only of real value when data models have been published by vendors. It is common for example to find metadata definitions for well-known social media platforms and machine or sensor systems in this way.
It is possible to find data models relating to some ERP and CRM packages in this way. Some are provided by the vendors and some have been published by customers. These may provide a pointer to where Personal Data is located in your systems however it is important to compare the version found on the internet with how your own system has been implemented.
Many companies also use best guess or hypothesis testing methods to try to find tables and attributes that they need. Relying on data observation, insight and trying to find an appropriate start point from which to launch a search is a strategy that can be frustrating, time consuming and potentially inaccurate though.
8. Data modelling and profiling
Data modelling tools can reverse engineer metadata from RDBMS and other sources and create data models from the tables, fields and relationships they find. This may provide access to information relating to Personal Data.
Data profiling software lets users examine data formats to determine if they are likely to contain Personal Data. Sometimes, these incorporate machine learning or other analysis techniques uncover the relevant information.
What is clear is that there is no one solution for Personal Data discovery which fits all sources types. However using software to streamline as much of the process where possible will make complying with GDPR easier, faster and less risky than might otherwise be the case.
Roland Bullivant is at Silwood Technology Ltd