When Big Data Meets Personal Data – Anonymisation Plays Matchmaker

February 2022

When Big Data Meets Personal Data – Anonymisation Plays Matchmaker

11 min read

30.51k

The volume of accessible data is growing today. Big data is attractive. It promises insights that many assume have the potential to unlock business opportunities and to help improve products and service, driving business growth. To unlock personal data for big data analytics, businesses are increasingly turning to anonymisation. However, anonymisation should not be relied on as an effective antidote to data protection obligations such as consent, protection, and transfer limitation. Anonymisation as a process has its own inherent risks, and businesses should carefully weigh these up against the expected benefits of the data processing before choosing whether to proceed or not.

Introduction

In this age of big data and data driven decision making, information can increasingly be systematically extracted from sets previously too large, fast, or complex to be analysed manually or by available technology. Big data is attractive. It promises insights that many executives and investors assume have the potential to unlock business opportunities and to help improve products and service, driving business growth.¹Some common applications of data analytics include sharpening an AI model’s ability to accurately classify and react to changing variables, recalculating risk portfolios, determining root cause of failures, detecting fraudulent behaviour, spotting anomalies, and converting data into insights. It is no wonder that big data is often touted as the harbinger of a new era in business development. Linked to big data, is an increasing proclivity of businesses towards using deidentified or anonymous data for business analytics. However, anonymisation should not be relied on as an effective antidote to data protection obligations such as consent, protection, and transfer limitation. Anonymisation as a process has its own inherent risks, and businesses should carefully weigh these up against the expected benefits of the data processing before choosing whether to proceed or not.

Growth of Data Resources

Day to day, the amount of data available for analytics and processing is increasing. Social media is a key contributor, especially of data relating to identifiable individuals. In today’s world, a large swath of the total world population has interacted with platforms like Facebook, Twitter, Instagram, LinkedIn and Whatsapp. Activities like making comments, uploading photos, or livestreams create data. Social media is not the only source of data about identifiable data, merely one of the largest pools. As we go about our daily lives, wearables like smart watches reliably detect and record our heart rate and physical activity. Depending on the type of apps we install in our mobile devices, the same device can be used to control the lighting, air conditioners, washing machines, security systems and other appliances that we use. We frequently engage in e-commerce transactions by shopping, trading, and banking online. We also rely on Grab, GoJek, GoogleMaps and GPS for travelling. All these interactions and transactions leave a digital trail and are recorded as data. Some of these data are confidential, others are publicly available. Yet others are de-identified or anonymised before they are sold or traded.

Each piece of data may represent only a limited or fleeting segment of an individual’s life. But what happens, when these diverse sources of data are pieced together, is that we start to get a more complete view into the individual, such as what the individual likes to eat, where the individual goes, which brands the individual prefers, or who the individual interacts with. In short, the more data we can piece together about a person, the richer the profile we build about the person, and the more accurately we can predict how a person might react to scenarios or stimuli. Such profiles of individuals or groups of individuals could be a powerful tool for businesses, and they can greatly aid the delivery of better, more customised, and relevant products to the individual, but their existence may also create discomfort for individuals who know their data protection or privacy rights.²In Singapore, individuals have rights to request access to their personal data and to rectify or correct data that is inaccurate or incomplete. There is also a right to request data to be ported to a third party in a common machine-readable format, although the details of that right are still being worked out by the regulator. In other jurisdictions, additional rights may include the right to request erasure of data, the right to restrict processing of data and the right to object to automated decision making.

Limits on the Use of Personal Data

How might individuals limit what businesses can do with the data acquired about themselves? At this point, big data encounters the bodyguard of its best friend, personal data. That bodyguard is the set of data protection or data privacy laws around the world. These laws regulate what can or cannot be done with personal data.³Some jurisdictions have adopted comprehensive laws. For example, Singapore has enacted the Personal Data Protection Act 2012 (“PDPA”), the EU has adopted the General Data Protection Regulation (EU) 2016/679 (“GDPR”) and in China the Personal Information Protection Law of the People’s Republic of China (“PIPL”) was passed at the 30th meeting of the Standing Committee of the 13th National People’s Congress on August 20, 2021 and came into force on 1 November 2021. In other jurisdictions, laws may be piecemeal and sectoral based. While there are variations in the precise definitions across jurisdictions, countries have generally regarded information that identify an individual, whether directly or indirectly, as personal data or personally identifiable information. The laws across jurisdictions also usually embrace more commonalities than differences in their key principles. The use of data generally must be underpinned by consent or some other legally recognised basis. Personal data should not be stored indefinitely but should be disposed of when no valid purpose continues to be served by its retention. It should be stored securely. It also either should not leave the jurisdiction in which it was collected, or where permitted to leave, should leave on conditions equivalent or better than the standards of protection provided in the jurisdiction where the data is being transferred from. With these requirements in place, it can be difficult and costly for a business to carry out big data analytics over multiple sources of data, and yet comply with all the legal obligations that attach to the data.

Playing matchmaker between big data and the legitimate use of personal data is anonymisation, a commonly offered panacea for the obstacles to conducting analytics over identifiable data where consent or some other valid legal basis is challenging to establish. There are many methods of anonymising data, including pseudonymisation, aggregation, value replacement, masking, data suppression and data recoding or generalisation.⁴For an introduction to some of these common methods, see the Personal Data Protection Commission’s (“PDPC) Advisory Guidelines on the Personal Data Protection Act for Selected Topics, issued 24 September 2013, revised 4 October 2021, Chapter 3, paras 3.8-3.9. (https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Advisory-Guidelines/AG-on-Selected-Topics/Advisory-Guidelines-on-the-PDPA-for-Selected-Topics-4-Oct-2021.pdf?la=en, last accessed 26 January 2022). The suitability of each method or combination of methods chosen in turn depends on a complex matrix of factors such as the nature, purpose, and size of the data set. However, while again being conscious of variations across jurisdictions,⁵Discussed further below. regardless of method employed,⁶The notable exception being pseudonymisation for which some jurisdictions have specific rules. anonymised data is data which does not relate to an identified or identifiable person. Consequently, most jurisdictions recognise that anonymised data is not personal data. Anonymisation therefore presents itself as a valuable tool for enabling the extraction of valuable insights through data analytics, without compromising the privacy of individuals. Additionally, anonymisation serves a secondary function of being a method of which an organisation can fulfil its obligation to protect personal data under applicable laws.

Yet, anonymisation can only take big data analytics so far. Anonymising data can be challenging – it is difficult to reduce data to the point where it is no longer identifiable given today’s context. There is an ever-increasing pool of publicly available or accessible data thanks to the Internet, and an increasing availability of analysis technology. These factors collectively translate into a heightened risk of reidentification. In 2008, an anonymised Netflix dataset of film ratings was reidentified by comparing the ratings with public scores on the Internet Movie Database.⁷Narayanan, A., & Shmatikov, V. Robust de-anonymisation of large sparse datasets. In Proceedings – 2008 IEEE Symposium on Security and Privacy, SP, pp. 111-125. (https://dl.acm.org/doi/10.1109/SP.2008.33, last accessed 26 January 2022). In a 2019 study, 99.98 per cent of Americans were correctly reidentified in any available “anonymized” dataset by using just 15 characteristics, including age, gender, and marital status.⁸Rocher L, Hendrickx JM, de Montjoye YA. Estimating the success of re-identifications in incomplete datasets using generative models, Nature Communications, 2019 Jul;10(1), p 3069. (https://doi.org/10.1038/s41467-019-10933-3, last accessed 26 January 2022). In a country of a population size like Singapore, the probability of reidentification is likely even higher. For a data analyst about to process anonymised data, a very relevant issue to consider, considering the above, is the degree of anonymisation necessary before data protection or data privacy laws cease to apply. This in turn translates into legal risk. A related question is how the risks of reidentification can be mitigated and managed.

Definitions of Anonymous Data

On one end of the spectrum, views are stringent, and anonymisation must be effectively irreversible. In Opinion 05/2014 on Anonymisation Techniques, the Article 29 Working Party took a view which has since been criticised for being near impossible to achieve,⁹Khaled El Emam, Cecilia Álvarez. A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymisation techniques, International Data Privacy Law, 2015 Feb; Vol 5(1), pp 73–87. (https://doi.org/10.1093/idpl/ipu033, last accessed 26 January 2022). that an effective anonymisation solution “prevents all parties from singling out an individual in a dataset, from linking two records within a dataset (or between two separate datasets) and from inferring any information in such dataset”.¹⁰P 9. (https://ec.europa.eu/justice/article-29/documentation/opinion recommendation/files/2014/wp216_en.pdf, last accessed 26 January 2022). In similar vein, China’s PIPL provides at Article 73 that anonymisation refers to the process in which the personal information is processed so that it is impossible (无法) to identify a certain natural persona and cannot be recovered.¹¹The original text reads, “匿名化，是指个人信息经过处理无法识别特定自然人且不能复原的过程。”

By contrast, in many jurisdictions, including Singapore, there is acknowledgment that true anonymisation is notoriously difficult to achieve and opt for a risk-based approach instead. But even then, there are significant variations across jurisdictions as to the degree to which data must be transformed to be considered anonymised.

In Singapore, the Chapter 3 of the Guidelines provides that anonymisation “refers to the process of converting personal data into data that cannot be used to identify any particular individual, and can be reversible or irreversible”. It goes on to explicitly recognise that data “that has been anonymised is not personal data”. In determining the brightline between anonymised data and personal data, a risk-based approach has been adopted, with the Guidelines providing that data “would not be considered anonymised if there is a serious possibility that an individual could be re-identified”. In instances where the data involved is of a highly sensitive nature, even if there is a less than serious possibility of an individual being identified from the data, the organisation should carefully consider whether using or disclosing such data would be appropriate.¹²Ibid, note iv.

By contrast, Recital 26 of the GDPR defines anonymous information as “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” and also provides that to determine whether a natural person is identifiable, account should be taken of “all the means reasonably likely to be used” where “account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments”. Further, Recital 26 provides that personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person. Under the GDPR, pseudonymisation is defined as the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organisational measures to ensure non-attribution to an identified or identifiable individual.

The differences in language appear slight, but at the margins, especially if pseudonymisation had been selected as the technique of choice, choice of words could make all the difference – what is considered anonymised in one jurisdiction may not pass muster in another. If it turns out what one was processing was not anonymised data but personal data, and steps had not been taken to comply with key obligations under applicable laws in areas such as consent, protection or transfer limitation, enforcement penalties could apply. Other concerns such as the need to give effect to data subject rights such as access and information requests, or in some jurisdictions, erasure, and objections to profiling, would also kick in.

Mitigating Risks of Reidentification

If the standard to which personal data must be anonymised has been correctly identified, one then has to turn to the next aspect of legal risk management, ie, mitigation. This would ordinarily take the form of assessing and documenting the probability of reidentification. Such steps would be useful for demonstrating to regulators and enforcement agencies that due diligence had been undertaken to give assurance that the data being manipulated is sufficiently anonymised.

In its updated draft guidance on anonymisation, pseudonymisation and privacy enhancing technologies,¹³https://ico.org.uk/media/about-the-ico/documents/4018606/chapter-2-anonymisation-draft.pdf, last accessed 26 January 2022. the Information Commissioner’s Office in the UK recommends that applying a “motivated intruder” test “is a good starting point to consider identifiability risk”. The test involves assessing “whether an intruder would be able to achieve identification if they were motivated to attempt it”. The draft defines a motivated intruder as “a person who starts without any prior knowledge but wishes to identify an individual from whose personal data the anonymous information is derived”. It assumes that a motivated intruder is someone that is reasonably competent, has access to appropriate resources and uses investigative techniques. The intruder could also be determined, with a particular reason to want to identify individuals such as investigative journalists, disgruntled spouses, former employees, stalkers, or industrial spies. For higher value data, potential intruders with stronger capabilities, tools, and resources such as state actors may need to be considered as well. To be clear, the administering of the motivated intruder test does not rule out entirely the possibility of reidentification. For example, it ordinarily does not require assessment of the reidentification probability vis-à-vis an intruder who has specialist knowledge such as an experienced hacker, or an intruder who is not motivated but who may have access to a combination of data which is key to unlocking the anonymisation performed, e.g., a wife who happens to see an anonymised data set and can infer which entry corresponds to her husband based on indirect identifiers such as birth date, height and weight.

Another commonly taken step in mitigating identifiability risk, is the k-anonymity test. As explained by Khaled El Eman and Fida Jamal Dankar¹⁴El Emam, Khaled, and Fida Kamal Dankar. Protecting privacy using k-anonymity, Journal of the American Medical Informatics Association, 2008; Vol 15(5), pp 627-37., a k-anonymised data set has the property that each record is similar to at least another k-1 other records on the potentially identifying variables. For example, if k=5 and the potentially identifying variables are age and gender, then a k-anonymised data set has at least five records for each value combination of age and gender. The k-anonymity test remains a popular means of demonstrating that a data set is not identifiable despite academic criticism that k-anonymity datasets are subject to certain vulnerabilities where there is little diversity in the sensitive attributes suppressed in such records, or where the attacker has background knowledge and can draw the appropriate inferences.¹⁵Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. l-diversity: Privacy beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data (TKDD), 2007; Vol 1(1), 3-es. (https://personal.utdallas.edu/~muratk/courses/privacy08f_files/ldiversity.pdf, last accessed 26 January 2022).

Conclusion

In the world of big data and digital innovation that we live in today, data analytics seems set to become a more and more ubiquitous part of commercial life. However, before rushing off to invest in a data analytics team, it is useful to evaluate how the potential upsides of the endeavor might weigh against the costs of legal compliance with applicable laws. It is sometimes tempting to tend to the cynical view that identification risk assessments are unnecessary because it is nigh impossible for an individual to prove that his or her data has been insufficiently anonymised or reidentified after anonymisation (think maternity ads turning up on social media feeds after a visit to the pharmacy or e-commerce website to purchase a pregnancy test kit), but a number of high-profile exposés¹⁶The Netflix example has already been discussed above. In another example, in the mid-1990’s, the governor of Massachusetts assured the public that records which it has released summarizing every state employee’s hospital visits had been properly scrubbed. A graduate student obtained the data and used the Governor’s zip code, birthday, and gender to identify his medical history, diagnosis, and prescriptions. For more examples, see Boris Lubarsky. Reidentification of “Anonymised” Data. 2017 1 Geo. L. Tech. Rev. 202. show otherwise.

As discussed above, there are inherent risks of anonymising data. The literature available on how to mitigate such risks is also increasing as businesses and government grapple with how to fairly unlock the value of data without compromising the privacy of the individual. Considering this, identification risk assessment should become a part of the regular parlance of any unit that engages regularly in the processing of anonymised data. Specifically, considerations should include the following. First, is the data analytics tail wagging the business dog? The outcome of the data analytics should bring competitive value to the business that at least outweighs the risk of identification. As observed by Andrei Hagiu and Julian Wright in the Harvard Business Review, providers of consumer products will not build strong competitive positions just based on data analytics alone, unless the value added by customer data is high and lasting, the data is proprietary and leads to product improvements that are hard to copy, or the data-enabled learning creates network effects.¹⁷Andrei Hagiu and Julian Wright. When Data Creates Competitive Advantage. Harvard Business Review, Jan–Feb 2020. (https://hbr.org/2020/01/when-data-creates-competitive-advantage, last accessed 26 January 2022). Second, is there capacity to choose which jurisdiction the data analytics is being performed in? If there is, some degree of forum selection could help to reduce risk exposure considering the variations in the definitions of anonymisation across the board. Third, what method or methods of anonymisation and associated risk mitigations would be most suitable considering the purpose of the analysis, the cost of applying the method and the likelihood of identification. Bearing in mind that the zero risk may be both infeasible and costly, this is ultimately a balancing exercise, and the data analyst needs to factor both upside and downside into what on the surface is a purely technical consideration. Big data might often meet personal data, but the matchmaker does not always earn her keep.

Endnotes[+]

Endnotes
↑1	Some common applications of data analytics include sharpening an AI model’s ability to accurately classify and react to changing variables, recalculating risk portfolios, determining root cause of failures, detecting fraudulent behaviour, spotting anomalies, and converting data into insights.
↑2	In Singapore, individuals have rights to request access to their personal data and to rectify or correct data that is inaccurate or incomplete. There is also a right to request data to be ported to a third party in a common machine-readable format, although the details of that right are still being worked out by the regulator. In other jurisdictions, additional rights may include the right to request erasure of data, the right to restrict processing of data and the right to object to automated decision making.
↑3	Some jurisdictions have adopted comprehensive laws. For example, Singapore has enacted the Personal Data Protection Act 2012 (“PDPA”), the EU has adopted the General Data Protection Regulation (EU) 2016/679 (“GDPR”) and in China the Personal Information Protection Law of the People’s Republic of China (“PIPL”) was passed at the 30th meeting of the Standing Committee of the 13th National People’s Congress on August 20, 2021 and came into force on 1 November 2021. In other jurisdictions, laws may be piecemeal and sectoral based.
↑4	For an introduction to some of these common methods, see the Personal Data Protection Commission’s (“PDPC) Advisory Guidelines on the Personal Data Protection Act for Selected Topics, issued 24 September 2013, revised 4 October 2021, Chapter 3, paras 3.8-3.9. (https://www.pdpc.gov.sg/-/media/Files/PDPC/PDF-Files/Advisory-Guidelines/AG-on-Selected-Topics/Advisory-Guidelines-on-the-PDPA-for-Selected-Topics-4-Oct-2021.pdf?la=en, last accessed 26 January 2022).
↑5	Discussed further below.
↑6	The notable exception being pseudonymisation for which some jurisdictions have specific rules.
↑7	Narayanan, A., & Shmatikov, V. Robust de-anonymisation of large sparse datasets. In Proceedings – 2008 IEEE Symposium on Security and Privacy, SP, pp. 111-125. (https://dl.acm.org/doi/10.1109/SP.2008.33, last accessed 26 January 2022).
↑8	Rocher L, Hendrickx JM, de Montjoye YA. Estimating the success of re-identifications in incomplete datasets using generative models, Nature Communications, 2019 Jul;10(1), p 3069. (https://doi.org/10.1038/s41467-019-10933-3, last accessed 26 January 2022).
↑9	Khaled El Emam, Cecilia Álvarez. A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymisation techniques, International Data Privacy Law, 2015 Feb; Vol 5(1), pp 73–87. (https://doi.org/10.1093/idpl/ipu033, last accessed 26 January 2022).
↑10	P 9. (https://ec.europa.eu/justice/article-29/documentation/opinion recommendation/files/2014/wp216_en.pdf, last accessed 26 January 2022).
↑11	The original text reads, “匿名化，是指个人信息经过处理无法识别特定自然人且不能复原的过程。”
↑12	Ibid, note iv.
↑13	https://ico.org.uk/media/about-the-ico/documents/4018606/chapter-2-anonymisation-draft.pdf, last accessed 26 January 2022.
↑14	El Emam, Khaled, and Fida Kamal Dankar. Protecting privacy using k-anonymity, Journal of the American Medical Informatics Association, 2008; Vol 15(5), pp 627-37.
↑15	Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. l-diversity: Privacy beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data (TKDD), 2007; Vol 1(1), 3-es. (https://personal.utdallas.edu/~muratk/courses/privacy08f_files/ldiversity.pdf, last accessed 26 January 2022).
↑16	The Netflix example has already been discussed above. In another example, in the mid-1990’s, the governor of Massachusetts assured the public that records which it has released summarizing every state employee’s hospital visits had been properly scrubbed. A graduate student obtained the data and used the Governor’s zip code, birthday, and gender to identify his medical history, diagnosis, and prescriptions. For more examples, see Boris Lubarsky. Reidentification of “Anonymised” Data. 2017 1 Geo. L. Tech. Rev. 202.
↑17	Andrei Hagiu and Julian Wright. When Data Creates Competitive Advantage. Harvard Business Review, Jan–Feb 2020. (https://hbr.org/2020/01/when-data-creates-competitive-advantage, last accessed 26 January 2022).