PSEUDONYMIZATION VS. ANONYMIZATION AND HOW THEY HELP WITH GDPR 

 

In the world of data security, there is a clear distinction between two terms, namely: pseudonymization and anonymization. These data techniques are distinct in one significant aspect. For pseudonymization data, the subject substitutes its identity so that there must be a case of additional information to recognize the data subject. Simultaneously, anonymization, like the name implies, destroys any way of identifying the subject data. It is quite pertinent to understand the distinction between these two terms since both data categories are classified in different categories in regulation with GDRP invention.   

Let’s take an example for better understanding; think about it from the perspective of pencil production. Say we have 20 pencils produced by an anonymous company. We don’t have a way of identifying if the same pencil company made all 20 pencils, or instead produced by say 15,16,17 or probably even 20 different pencil manufacturing companies or there about meaning all pencil producers are anonymous. We say that we have 20 other pencil products by Richmond woods stationery (a pencil production company). And we know that all 20 pencils were produced by the same company, though we also know Richmond woods stationery as Royce brooks. Therefore, Royce made the pencils under a pseudonym.   

The table below would further help us understand and practically examine tokenization. As seen in the table, a token would be provided for each different name, which gives rise to required access to additional information to re-identify the data. 

Name  Anonymized  Token/Pseudonym 
Justin  XXXXXX  Espoins 
Dave  XXXXXX  Jums 
James  XXXXXX  Poqqa 
David  XXXXXX  Zwpvs 
Avery  XXXXXX  Poqqa 
Jim  XXXXXX  Zwp 

 

In the table above, with the pseudonymized data, it is assumed we don’t know the data subjects’ identity. Still, we can correlate entries with specific subjects (records 1 and 7 references the same person, records 2 and 5 references the same person, records 3 and 4 concern the same person). We can get back to the real identity if we can re-identify the data via the token lookup tables. However, with the anonymized data, we only know that there are seven records, and there is no method to re-identify the data.   

It is a method of data identification through substituting with a reversible and consistent value. Anonymization is the destruction of identifiable data.   

We should be concerned about “indirect re-identification” with anonymization. Back to our example made from the pencil company, our anonymous pencil may be indirectly identified by analyzing each company’s pencil producing style. It may be quite a task identifying the pencils’ unknown producers because of how they were produced. It may be challenging to recognize the name, but one would know the same company made specific pencils due to their various pencil production styles. We may be able to find out the pencil producers if only the producer has produced a pencil under their name, and then we can compare the style and design if it matches other familiar styles of pencil made.   

For instance, assuming an organization retains purchase history records of a customer but anonymizes easily identifiable records and name and address. It may still be possible to identify a record indirectly since humans are creatures of habit.   

Every afternoon, Monday to Friday, jack does a routine of going to the same coffee shop, buys the same coffee and waffles for breakfast, and uses his debit card for payment. On Thursday night, he always withdraws $100 from the ATM close to his office because it’s party night with his buddies on Friday night.   

Jack’s behavior would allow us to indirectly re-identify him (all of these transactions reference the same person because we can identify his predictable behavior) even if the organization has “anonymized” jack’s personally identifiable data (destroyed his name, address, etc.) Therefore, the data set has not been properly anonymized. We may have to use additional methods to hide individual behavior to anonymize this data effectively. For example, we might only store records based on some grouping. 

“40 people went to this coffee shop every morning.”  

“100 people got money from this ATM every Sunday.”  

“A total of $200,000 was taken from this ATM on Thursday.”  

“40 people bought waffles today”  

Now the data has been anonymized because we have no way of seeing Jack’s predictable behavior pattern. Rixon Technology’s Enterprise Vaultless Tokenization is an excellent way to accomplish both pseudonymization and anonymization of data. Although, expert statisticians should undertake full anonymization, data scientists, etc. and based on the individual organization that retains such data.