Anonymisation Techniques for Health and Population Data

By Molulaqhooa Linda Maoyi

With the increasing adoption of Health and Demographic Surveillance Systems (HDSS) in sub-Saharan Africa, the generation and capture of health and population data have increased rapidly in the past years. South Africa's South African Population Research Infrastructure Network (SAPRIN) brings together three HDSS sites: Agincourt, DIMAMO and Africa Health Research Institute, which capture births, deaths, migrations, and household's socio-demographics. The three HDSS sites are located in the Mpumalanga, Limpopo and KwaZulu-Natal provinces (see Fig. 1).

Fig. 1: SAPRIN HDSS node locations

For a developing country like South Africa, HDSSs can be a valuable source of public health data to support policy-making and original health and population research. However, surveillance data may contain highly sensitive information that cannot easily be shared and therefore needs to be protected and anonymized to ensure privacy. But what does it mean to anonymize data?


Data Anonymisation

Data anonymization is the process of safeguarding private or sensitive information by deleting or encrypting identifiers that link a person to stored data. For example, you can run personally identifiable information such as names, ID numbers, and addresses through a data anonymization process that saves the data but conceals the source. Even if you clear identifiers from data, attackers can utilize de-anonymization methods to retrace the data anonymization process. De-anonymization techniques can cross-reference the sources and reveal personal information because data typically go via numerous sources, some of which are accessible to the public. 

In South Africa, The Personal Information Protection Act (POPIA) describes a specific set of rules to protect user data and create transparency. Although POPIA is strict, it allows companies to collect anonymous data without consent, use it for any purpose, and store it indefinitely, provided the company removes all identifiers from the data. At SAPRIN, we also take serious measures to de-identify the surveillance data we collect. For instance, any personally identifiable information (PII) is removed when processing and producing our publicly available datasets, and unique individual identifiers are converted to a range of random integers. However, I would like to remark that anonymization differs from de-identification, which refers to removing or replacing personal identifiers in a dataset. An authorized third party can only establish the link between the individual and their data record.

Anonymisation Techniques

Several techniques for achieving anonymization have been proposed in scientific literature. In essence, the three main objectives of data anonymization are to preserve:

  • Data utility (measured by the amount of loss caused by the anonymization technique, e.g. information loss).
  • Data privacy (measured by the conformity of the data to the privacy model constraints).
  • Data truthfulness (measured by each anonymized record corresponding to a single record in the original table).

 

Below, I give an overview of some common data anonymization techniques.

● Data perturbation— This alters the original dataset somewhat by rounding numbers and introducing random noise. The range of values must be proportional to the perturbation. A tiny base may result in weak anonymization, whereas a large base may diminish the dataset's usability. For example, because it is proportionate to the original value, a base of 5 can be used to round quantities like age or house number.

● Data masking— This method entails concealing data with altered values. You can, for example, substitute a value character with a symbol like "@" or "*," making reverse engineering or detection impossible.

Data Pseudonymization—This is a data management and de-identification strategy that substitutes personal identifiers with fictitious identities or pseudonyms, such as replacing "Simiso Ndlovu" with "John Mathebula." Pseudonymization maintains statistical correctness and data integrity, allowing the modified data to be used for training, development, testing, and analytics while maintaining data privacy.

Data Generalization— This is the process of removing some of the data to make it less identifiable. Data can be transformed into a group of ranges or a large region with defined limits. For example, you can remove the house number from an address but not the street name. Again, the goal is to remove some of the identifiers while maintaining some degree of data accuracy.

● Data swapping—Also known as shuffling and permutation, this method is used to rearrange dataset attribute values so that they no longer correspond to the original records. For example, swapping columns containing identifier values, such as date of birth, may have a greater influence on anonymization than swapping membership type data.

● Synthetic data—Synthetic data is used to produce artificial datasets rather than modifying or using the real dataset and jeopardizing privacy and security. The procedure entails developing statistical models based on trends discovered in the original dataset. To construct the synthetic data, you can, for example, utilize standard deviations, medians, linear regression, or other statistical approaches.

Although these techniques work well, they are not without their shortcomings. The anonymized data might still be subject to several attacks such as linkage attacks, background knowledge attacks, attribute disclosure attacks and membership disclosure attacks. For example, linkage attacks are one of the classical attacks on relational datasets where an adversary can re-identify or link a record in an anonymized dataset by combining quasi-identifiers from different sources to an individual. Therefore, care and due diligence need to be taken to ensure against these attacks. In my next blog, I will detail each of these types of attacks and how to mitigate them. 

 

 

 

 

 

 

 

Comments