De-identification and re-identification of PII in large-scale datasets using Cloud DLP  |  Cloud Architecture Center  |  Google Cloud (2023)

This document discusses how to use Cloud Data Loss Prevention (Cloud DLP) to createan automated data transformation pipeline to de-identify sensitive data likepersonally identifiable information (PII). De-identification techniques liketokenization (pseudonymization) let you preserve the utility of your data forjoining or analytics while reducing the risk of handling the data by obfuscatingthe raw sensitive identifiers. To minimize the risk of handling large volumes ofsensitive data, you can use an automated data transformation pipeline to createde-identified replicas.Cloud DLP enables transformations such as redaction, masking, tokenization, bucketing,and other methods ofde-identification.When a dataset hasn't been characterized,Cloud DLP can also inspect the data for sensitive information by usingmore than 100 built-in classifiers.

This document is intended for a technical audience whose responsibilitiesinclude data security, data processing, or data analytics. This guide assumesthat you're familiar with data processing and data privacy, without the need tobe an expert.

Reference architecture

The following diagram shows a reference architecture for usingGoogle Cloud products to add a layer of security to sensitive datasets by usingde-identification techniques.

De-identification and re-identification of PII in large-scale datasets using Cloud DLP | Cloud Architecture Center | Google Cloud (1)

The architecture consists of the following:

  • Data de-identification streaming pipeline: De-identifies sensitivedata in text using Dataflow. You can reuse thepipeline for multiple transformations and use cases.

  • Configuration (DLP template and key) management: A managedde-identification configuration that is accessible by only a small group ofpeople—for example, security admins—to avoid exposing de-identificationmethods and encryption keys.

    (Video) The journey to Processing PII in the Data Cloud

  • Data validation and re-identification pipeline: Validates copies ofthe de-identified data and uses a Dataflow pipeline tore-identify data at a large scale.

Helping to secure sensitive data

One of the key tasks of any enterprise is to help ensure the security of theirusers' and employees' data. Google Cloud provides built-in securitymeasures to facilitate data security, including encryption of stored data andencryption of data in transit.

Encryption at rest: Cloud Storage

Maintaining data security is critical for most organizations. Unauthorizedaccess to even moderately sensitive data can damage the trust, relationships,and reputation that you have with your customers. Googleencrypts data stored at rest by default. By default, any object uploaded to aCloud Storage bucket is encrypted using aGoogle-managed encryption key.If your dataset uses a pre-existing encryption method and requires a non-defaultoption before uploading, there are other encryption options provided byCloud Storage. For more information, seeData encryption options.

Encryption in transit: Dataflow

When your data is in transit, the at-rest encryption isn't in place.In-transit data is protected by secure network protocols referred to asencryption in transit.By default, Dataflow uses Google-managed encryption keys. Thetutorials associated with this document use an automated pipeline that uses thedefault Google-managed encryption keys.

Cloud DLP data transformations

There are two main types of transformations performed by Cloud DLP:

  • recordTransformations
  • infoTypeTransformations

Both recordTransformations and infoTypeTransformations methods cande-identify and encrypt sensitive information in your data. For example, you cantransform the values in the US_SOCIAL_SECURITY_NUMBER column to beunidentifiable or use tokenization to obscure it while keeping referentialintegrity.

The infoTypeTransformations method enables you to inspect for sensitive dataand transform the finding. For example, if you have unstructured or free-textdata, the infoTypeTransformations method can help you identify an SSN insideof a sentence and encrypt the SSN value while leaving the rest of the textintact. You can also define custom infoTypes methods.

(Video) Comprehensive Protection of PII in GCP (Cloud Next '19)

The recordTransformations method enables you to apply a transformationconfiguration per field when using structured or tabular data. With therecordTransformations method, you can apply the same transformation acrossevery value in that field such as hashing or tokenizing every value in a columnwith SSN column as the field or header name.

With the recordTransformations method , you can also mix in theinfoTypeTransformations method that only apply to the values in the specifiedfields. For example, you can use an infoTypeTransformations method inside of arecordTransformations method for the field named comments to redact anyfindings for US_SOCIAL_SECURITY_NUMBER that are found inside the text in thefield.

In increasing order of complexity, the de-identification processes are asfollows:

  • Redaction: Remove the sensitive content with no replacement of content.
  • Masking: Replace the sensitive content with fixed characters.
  • Encryption: Replace sensitive content with encrypted strings, possiblyreversibly.

Working with delimited data

Often, data consists of records delimited by a selected character, with fixedtypes in each column, like a CSV file. For this class of data, you can applyde-identification transformations (recordTransformations) directly, withoutinspecting the data. For example, you can expect a column labeled SSN tocontain only SSN data. You don't need to inspect the data to know that theinfoType detector is US_SOCIAL_SECURITY_NUMBER. However, free-formcolumns labeled Additional Details can contain sensitive information, but theinfoType class is unknown beforehand. For a free-form column, you need toinspect the infoTypes detector (infoTypeTransformations) before applyingde-identification transformations. Cloud DLP allows both of thesetransformation types to co-exist in a single de-identification template.Cloud DLP includesmore than 100 built-in infoTypes detectors.You can also create custom types or modify built-in infoTypes detectors tofind sensitive data that is unique to your organization.

Determining transformation type

Determining when to use the recordTransformations or infoTypeTransformationsmethod depends on your use case. Because using the infoTypeTransformationsmethod requires more resources and is therefore more costly, we recommend usingthis method only for situations where the data type is unknown. You can evaluatethe costs of running Cloud DLP using theGoogle Cloud pricing calculator.

For examples of transformation, this document refers to a dataset that containsCSV files with fixed columns, as demonstrated in the following table.

Column nameInspection infoType (custom or built-in)DLP transformation type
Card NumberNot applicableDeterministic encryption (DE)
Card Holder's NameNot applicableDeterministic encryption (DE)
Card PINNot applicableCrypto hashing
SSN (Social Security Number)Not applicableMasking
AgeNot applicableBucketing
Job TitleNot applicableBucketing
Additional DetailsBuilt-in:

This table lists the column names and describes which type of transformation isneeded for each column. For example, the Card Number column contains creditcard numbers that need to be encrypted; however, they don't need to beinspected, because the data type (infoType) is known.

(Video) Kalyan Pamarthy - Google Cloud FHIR APIs: Data Ingestion, Management, and Analytics | DevDays 2021

The only column where an inspection transformation is recommended is theAdditional Details column. This column is free-form and might contain PII,which, for the purposes of this example, should be detected and de-identified.

The examples in this table present five different de-identificationtransformations:

  • Two-way tokenization: Replaces the original data with a token that isdeterministic, preserving referential integrity. You can use the token tojoin data or use the token in aggregate analysis. You can reverse orde-tokenize the data using the same key that you used to create the token.There are two methods for two-way tokenizations:

    • Deterministic encryption (DE):Replaces the original data with a base64-encoded encrypted value anddoesn't preserve the original character set or length.
    • Format-preserving encryption with FFX (FPE-FFX):Replaces the original data with a token generated by usingformat-preserving encryption in FFX mode. By design, FPE-FFX preservesthe length and character set of the input text. It lacks authenticationand an initialization vector, which can cause a length expansion in theoutput token. Other methods, like DE, provide stronger securityand are recommended for tokenization use cases unless length andcharacter-set preservation are strict requirements, such as backwardcompatibility with legacy data systems.
  • One-way tokenization, usingcryptographic hashing:Replaces the original value with a hashed value, preserving referentialintegrity. However, unlike two-way tokenization, a one-way method isn'treversible. The hash value is generated by using an SHA-256-based messageauthentication code(HMAC-SHA-256)on the input value.

  • Masking:Replaces the original data with a specified character, either partiallyor completely.

  • Bucketing:Replaces a more identifiable value with a less distinguishing value.

  • Replacement:Replaces original data with a token or the name of the infoType ifdetected.

    (Video) Cloud Data Governance and Catalog – Azure Case Study

Method selection

Choosing the best de-identification method can vary based on your use case. Forexample, if a legacy app is processing the de-identified records, then formatpreservation might be important. If you're dealing with strictly formatted10-digit numbers, FPE preserves the length (10 digits) and character set(numeric) of an input for legacy system support.

However, if strict formatting isn't required for legacy compatibility, as isthe case for values in the Card Holder's Name column, then DE is thepreferred choice because it has a stronger authentication method. Both FPE andDE enable the tokens to be reversed or de-tokenized. If you don't needde-tokenization, then cryptographic hashing provides integrity but the tokenscan't be reversed.

Other methods—like masking,bucketing,date-shifting,and replacement—are good for values that don't need to retain full integrity.For example, bucketing an age value (for example, 27) to an age range (20-30)can still be analyzed while reducing the uniqueness that might lead to theidentification of an individual.

Token encryption keys

For cryptographic de-identification transformations, a cryptographic key,also known as token encryption key, is required. The token encryption keythat is used for de-identification encryption is also used to re-identify theoriginal value. The secure creation and management of token encryption keys arebeyond the scope of this document. However, there are some important principlesto consider that are used later in the associated tutorials:

  • Avoid using plaintext keys in the template. Instead, useCloud KMS to create a wrapped key.
  • Use separate token encryption keys for each data element to reduce therisk of compromising keys.
  • Rotate token encryption keys. Although you can rotate the wrapped key, rotating thetoken encryption key breaks the integrity of the tokenization. When the keyis rotated, you need to re-tokenize the entire dataset.

Cloud DLP templates

For large-scale deployments, useCloud DLP templates to accomplish the following:

  • Enable security control withIdentity and Access Management (IAM).
  • Decouple configuration information, and how you de-identify thatinformation, from the implementation of your requests.
  • Reuse a set of transformations. You can use the de-identify andre-identify templates over multiple datasets.


The final component of the reference architecture is viewing and working withthe de-identified data inBigQuery.BigQuery is Google's data warehouse tool that includesserverless infrastructure, BigQuery ML, and the ability to runCloud DLP as a native tool. In the example reference architecture,BigQuery serves as a data warehouse for the de-identified dataand as a backend to an automated re-identification data pipeline that can sharedata throughPub/Sub.

To learn more about advanced applications of BigQuery ML andsensitive data, seeConsiderations for sensitive data within machine learning datasets.

(Video) Personal data de-identification for data science tasks [eng] / Halyna Oliinyk

What's next

  • Learn about using Cloud DLP to inspect storage and databases for sensitive data.
  • Learn about other pattern recognition solutions.
  • Explore reference architectures, diagrams, tutorials, and best practices about Google Cloud.Take a look at ourCloud Architecture Center.


What is PII in DLP? ›

Personally Identifiable Information (PII) Forcepoint DLP Predefined Policies and Classifiers > Data Loss Prevention policies > Content Protection > Personally Identifiable Information (PII)

How do you identify PII data? ›

To begin de-identification, drop all PII variables not necessary for analysis. This may include household coordinates; birth dates; contact information; IP address; and/or the names of survey respondents, family members, employees, and enumerators.

What is PII in cloud computing? ›

PII or Personally Identifiable Information is the particular data that can help enterprises to secure their cloud data by contacting, identifying, or locating specific individuals or users.

How does DLP cloud work? ›

Cloud DLP has built-in support for scanning and classifying sensitive data in Cloud Storage, BigQuery, and Datastore, and a streaming content API to enable support for additional data sources, custom workloads, and applications.

What are the PII requirements? ›

According to the NIST PII Guide, the following items definitely qualify as PII, because they can unequivocally identify a human being: full name (if not common), face, home address, email, ID number, passport number, vehicle plate number, driver's license, fingerprints or handwriting, credit card number, digital ...

What is PII data examples? ›

Personal identification numbers: social security number (SSN), passport number, driver's license number, taxpayer identification number, patient identification number, financial account number, or credit card number. Personal address information: street address, or email address. Personal telephone numbers.

Can be used to redact personally identifiable information? ›


When you redact PII, you are removing part of the information from your document so that it cannot be used to steal your identity or the identity of others. As the document filer, you are solely responsible for ensuring that PII is redacted. The Clerk's Office may NOT redact PII for you.

What is the role of masking data as part of the approach to test data management? ›

The main objective of data masking is creating an alternate version of data that cannot be easily identifiable or reverse engineered, protecting data classified as sensitive. Importantly, the data will be consistent across multiple databases, and the usability will remain unchanged.

Is Phi a height and weight? ›

Certain information like full name, date of birth, address and biometric data are always considered PII. Other data, like first name, first initial and last name or even height or weight may only count as PII in certain circumstances, or when combined with other information.

How does PII handle data in the cloud? ›

Following are some of the key ways for securing PII on the cloud.
  1. Encryption. One of the viable methods for PII security is compliance with specific needs of PII data encryption associated with customers' technical frameworks. ...
  2. Strong, Unique Passwords. ...
  3. Data Disposal.
25 May 2021

Can PII be stored on cloud? ›

PII in the Cloud

They are general regulations that address sensitive personal data stored in electronic form wherever the associated computer systems are located. Companies need to implement compliance with these standards across their computing environment on-premises and in the cloud.

How does Azure handle PII data? ›

How to use this solution template
  1. Go to template PII detection and masking. ...
  2. Create a New connection to your destination storage store or choose an existing connection.
  3. Select Use this template.
  4. You should see the following pipeline:
  5. Clicking into the dataflow activity will show the following dataflow:
23 Sept 2022

What are the 3 types of data loss prevention? ›

What Are the 3 Types of Data Loss Prevention? The three main types of data loss prevention software include network DLP, endpoint DLP and Cloud DLP.

How do you implement a DLP strategy? ›

A 7 Step Framework for Developing and Deploying Data Loss Prevention Strategy
  1. Prioritize data. Not all data is equally critical. ...
  2. Categorize (classify) the data. ...
  3. Understand when data is at risk. ...
  4. Monitor all data movement. ...
  5. Communicate and develop controls. ...
  6. Train employees and provide continuous guidance. ...
  7. Roll Out.
4 Oct 2022

What does a DLP analyst do? ›

Responsibilities. Working with vendors to implement and support DLP technology, including troubleshooting and upgrading. Maintaining DLP technology, configuring policies, and compiling reports for analytics. Monitoring and responding to alerts generated from DLP systems and other technologies.

What is an example of sensitive personally identifiable information PII? ›

a. Examples of stand-alone PII include Social Security Numbers (SSN), driver's license or state identification number; Alien Registration Numbers; financial account number; and biometric identifiers such as fingerprint, voiceprint, or iris scan.

What are the four specifications related to personally identifiable information PII? ›

Personal identification number: Social security number (SSN), passport number, driver's license number, taxpayer identification number, financial account numbers, bank account number or credit card number. IP addresses - Some jurisdictions even classify IP addresses as PII. Medical Records. Financial information.

How does AWS handle PII data? ›

The steps in this solution are as follows:
  1. The sensitive data is stored in an S3 bucket. ...
  2. Run a DataBrew profile job to identify the PII columns present in the dataset by enabling PII statistics.
  3. After identification of PII columns, apply transformations to redact or encrypt column values as a part of your recipe.
19 Nov 2021

What is the difference between PII and personal data? ›

PII (Personally identifiable information) is the term used in the USA, while the term personal data is the term generally described in the EU's General Data Protection Regulation.

How do you protect information from PII? ›

Secure Sensitive PII in a locked desk drawer, file cabinet, or similar locked enclosure when not in use. When using Sensitive PII, keep it in an area where access is controlled and limited to persons with an official need to know. Avoid faxing Sensitive PII, if at all possible.

What is a way to protect PII and sensitive data from office visitors? ›

Encrypt PII

Encrypting your PII at rest and in transit is a non-negotiable component of PII protection. Use strong encryption and key management and always make sure you that PII is encrypted before it is shared over an untrusted network or uploaded to the cloud.

What does PII stand for? ›

Personal Identifiable Information (PII) is defined as: Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.

What are the 3 types of data loss prevention? ›

What Are the 3 Types of Data Loss Prevention? The three main types of data loss prevention software include network DLP, endpoint DLP and Cloud DLP.

What is PII NIST? ›

Share to Facebook Share to Twitter. Abbreviation(s) and Synonym(s): Personal Data show sources. NIST SP 800-63-3.

What are the 3 types of personal information? ›

an individual's name, signature, address, phone number or date of birth. sensitive information. credit information.

How do you protect information from PII? ›

Secure Sensitive PII in a locked desk drawer, file cabinet, or similar locked enclosure when not in use. When using Sensitive PII, keep it in an area where access is controlled and limited to persons with an official need to know. Avoid faxing Sensitive PII, if at all possible.

What is an example of sensitive personally identifiable information PII? ›

Sensitive PII includes but is not limited to the information pictured here, which includes Social Security numbers, driver's license numbers, Alien Registration numbers, financial or medical records, biometrics or a criminal history.

What is not an example of PII? ›

PII, or personally identifiable information, is sensitive data that could be used to identify, contact, or locate an individual. What are some examples of non-PII? Info such as business phone numbers and race, religion, gender, workplace, and job titles are typically not considered PII.

What are the 3 main objectives being solved by DLP? ›

Data loss prevention solves three main objectives that are common pain points for many organizations: personal information protection / compliance, intellectual property (IP) protection, and data visibility.

What is DLP strategy? ›

Data loss prevention (DLP) -- sometimes referred to as data leak prevention, information loss prevention and extrusion prevention -- is a strategy to mitigate threats to critical data. DLP is commonly implemented as part of an organization's plan for overall data security.

What are the best practices of DLP? ›

Data Loss Prevention Best Practices
  • Identify and classify sensitive data. ...
  • Use data encryption. ...
  • Harden your systems. ...
  • Implement a rigorous patch management strategy. ...
  • Allocate roles. ...
  • Automate as much as possible. ...
  • Use anomaly detection. ...
  • Educate stakeholders.
16 Jul 2019

What PII is as related to cybersecurity? ›


Personally Identifiable Information; Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.

Is PII confidential information? ›

Sensitive PII (SPII) is Personally Identifiable Information, which if lost, compromised, or disclosed without authorization, could result in substantial harm, embarrassment, inconvenience, or unfairness to an individual.

What timeframe must DOD organizations report PII breaches? ›

Report both electronic and physical related incidents to the Army Privacy Office (APO) within 24 hours of discovery by completing the Breach of Personally Identifiable Information (PII) Report via PATS.


1. Level Up - Automatically tokenize sensitive data with DLP and Dataflow
(Google Cloud APAC)
2. Protecting Sensitive Data in Huge Datasets (Cloud Next '19)
(Google Cloud Tech)
3. Kalyan Pamarthy - Google Cloud FHIR APIs | DevDays November 2020 Virtual
4. Google Cloud's end-to-end data cloud demo
(Google Cloud)
5. The Future of Cloud Computing and Its Impact on Healthcare Applications
(Stanford Healthcare Innovation Lab)
6. Sensitive data management for collaborative research clouds (Google Cloud Next '17)
(Google Cloud Tech)
Top Articles
Latest Posts
Article information

Author: Catherine Tremblay

Last Updated: 01/20/2023

Views: 5239

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Catherine Tremblay

Birthday: 1999-09-23

Address: Suite 461 73643 Sherril Loaf, Dickinsonland, AZ 47941-2379

Phone: +2678139151039

Job: International Administration Supervisor

Hobby: Dowsing, Snowboarding, Rowing, Beekeeping, Calligraphy, Shooting, Air sports

Introduction: My name is Catherine Tremblay, I am a precious, perfect, tasty, enthusiastic, inexpensive, vast, kind person who loves writing and wants to share my knowledge and understanding with you.