Introduction to Data Anonymization

You most likely have heard the term “data anonymization” if you are involved with data directly or indirectly. In this post, we will learn about this crucial part of the data processing pipeline, targetting data privacy at its core.

But wait a minute. Let’s address this question first “why does it matter?” before delving into data anonymization.

Why does data anonymization matter?

Well, to start with, data anonymization matters to honor people’s Right to Privacy, a fundamental human right, included in the universal declaration of human rights by the United Nations (article 12).

No one shall be subjected to arbitrary interference with his privacy, family, home, or correspondence, nor to attacks upon his honor and reputation. Everyone has the right to the protection of the law against such interference or attacks.

This article, though, does not explicitly mention data privacy but it is inherent in the term privacy.

In the current digital age we are living in today, we are increasingly contributing, knowingly or unknowingly, to data collection. Some of those data are highly personal (e.g., credit card number, phone number) –also referred to as personal data– making data privacy a paramount concern.

Personal Identifiable Information (PII)

A formal definition of Personal data is any information concerning the identified person or identifiable living person (GDPR 2016, Indian Digital Data Protection Act 2023). This is also referred to as Personal Identifiable Information (PII) (Refer to this post for more detailed information on PII). Some examples are name, email address, and passport number.

The first and foremost reason why data anonymization matters? Is because data anonymization helps to protect personal data to respect people’s privacy and, thus, upholds the fundamental Right to Privacy. However, we don’t live in an perfact world where people always do what is ethical and right (In that case we would not have needed laws at all). So to ensure the protection of personal data, data privacy regulations have been drafted and integrated in constitutions across the world (refer to this link for more information on data privacy laws in different countries). For example, GDPR in European countries, Digital Data Protection Act in India, Personal Information Protection Law in China, etc. **

The second reason why we should care for data anonymization is that it helps in adhering to data-privacy regulations and legislation violations which could not only result in monetary loss but also loss of customer trust.

Non-compliance with data-privacy legislation could cost a fortune. Take the example of 746 million euro penalty on Amazon for not processing personal data as per GDPR**). So to avoid such a situation and also to respect people’s privacy, we need a way that can strike a balance between the utility of data and privacy preservation. Here, we have data anonymization.

Data anonymization

Data anonymization removes personal information from the data without degrading the utility of that data for the objective at hand, thus, greatly reducing the risks associated with data privacy.

Tip

More importantly, data privacy regulations—such as the GDPR—do not apply to properly anonymized data, allowing organizations to operate without compliance concerns (provided the anonymization is conducted correctly).

Data anonymization techniques can be broadly categorized into two main types: anonymization and pseudonymization.

The following section explores both approaches, highlights their key differences, and illustrates their applications with real-world examples.

Anonymization v/s Pseudonymization

Data anonymization and pseudonymization are both methods of removing personal information, but they differ in two key ways:

The level of anonymity they provide
Whether the process can be reversed

The figure below highlights the main differences between these two approaches.”

Example

Let’s take an example to understand it further. Consider the following table, which consists of Personal Identifiable Information (PII).

Original data

Name	Gender	Date of Birth	Zipcode
John Smith	Male	1985-03-15	90210
Emily Davis	Female	1990-07-22	10001
Michael Lee	Male	1978-11-05	30301

Pseudonymized data

Pseudonymization replaces direct personal identifiers (like names) with artificial keys or pseudonyms. The mapping between original identifiers and pseudonyms is preserved, enabling valuable use cases such as targeted marketing campaigns while still protecting raw personal data.

Pseudonymized Name	Gender	Date of Birth	Zipcode
User_1A7B	Male	1985-03-15	90210
User_2C8D	Female	1990-07-22	10001
User_3E6F	Male	1978-11-05	30301

Tip

The mapping from the original PII to the pseudonymized version is reversible which means individuals still can be re-identified.

Anonymized data

In the case of an anonymized version of data, the mapping is irreversible. For example, names are replaced with alternate strings without retaining any information on the mapping which means we don’t have any means of converting back the anonymized ID to original PII using only the information available in the table.

Anonymized ID	Gender	Date of Birth	Zipcode
ANON_001	Male	1985-03-15	90210
ANON_002	Female	1990-07-22	10001
ANON_003	Male	1978-11-05	30301

However, this de-indentified data along with some other publicly available information could led to re-identification.

Important

The table though anonymized personal identifier but re-identification is still possible using some additional information.

For example, Professor Lataney Sweeney (who also proposed the k-anonymity method) showed that “87% of the U.S. Population are uniquely identified by {date of birth, gender, ZIP}”. This now really makes us think again about the risks associated with our anonymized version of data.

To mitigate re-identification risks in supposedly ‘anonymized’ data, robust techniques like k-anonymity (discussed earlier) and generalization have been developed. When properly implemented, these methods enable correct anonymization—ensuring personal data cannot be traced back to individuals, as illustrated in the example below.

Anonymized ID	Gender	Age Group	Generalized Zipcode
ANON_001	Male	35-44	90XXX
ANON_002	Female	30-34	10XXX
ANON_003	Male	45-54	30XXX

In the upcoming posts, we’ll peel back the layers of data anonymization!

Stay tuned for more privacy-preserving insights! 🔍🔒