The Right to Be Forgotten (RTBF), mandated by regulations like GDPR and CCPA, requires organizations to permanently delete personally identifiable information (PII) upon request within a defined timeframe.
RTBF in a data lake poses a unique set of challenges, while A data lake is traditionally designed to store all data indefinitely; acting as an immutable storage layer where raw, unaltered data is kept for future analysis, contradicts RTBF, where some data under certain conditions must be deleted permanently, and not just made invisible.
Personally Identifiable Information
PII (Personally Identifiable Information) refers to any data that can be used to uniquely identify an individual. It includes direct identifiers (like names or Social Security numbers) and indirect identifiers (like IP addresses or location data) that, when combined, can reveal a person’s identity.
Sensitive PII, Direct Identifiers
Is information that explicitly identify a person, and if exposed could pose a risk for identity theft or fraud.
- Full name
- Social Security Number (SSN) / National Identification Number like the Danish CPR
- Passport number
- Driver’s license number
- Credit card number
- Personal phone number
- Personal email address
- Home address
- Biometric data (fingerprints, retina scans, facial recognition)
Quasi PII, Indirect Identifiers
Is information which not individually poses a risk, but when combined could uniquely identify a person.
- Date of birth
- ZIP code
- IP address
- Device identifiers (IMEI, MAC address)
- Browsing history
- Geolocation data
- Employment details
PII vs. Sensitive Personal Data
PII is often mixed up with Sensitive Personal Data, which has a much stricter regulation. Under GDPR, sensitive data has stricter processing rules than general PII.
- Racial/ethnic origin
- Political opinions
- Religious beliefs
- Health records
- Sexual orientation
- Genetic/biometric data (if used for identification)
RTBF Conditions
Now this is the tricky part, while some of it quickly turns into a judgement scenario, when is it that RTBF must be honored.
- The data is no longer necessary for the original purpose it was collected.
- The individual withdraws consent, and there is no other legal basis to retain it.
- The individual objects to processing, and there are no overriding legitimate grounds.
- The data was processed unlawfully (e.g., collected without proper consent).
- The data must be erased due to a legal obligation (e.g., regulatory requirements).
But its also important to understand when RTBF is not to be honored.
- Legal or regulatory compliance requires retention (e.g., financial, tax, health records).
- The data is needed for public interest (e.g., research, scientific studies).
- The data is required for the exercise of free expression or journalism.
- It is necessary for legal claims (e.g., defending against lawsuits).
- The request is excessive or unreasonable (e.g., repeated requests without justification).
There is certain deadlines to be met, in order to be compliant.
- GDPR: Must process the request within one month (can extend to 3 months in complex cases).
- CCPA: Must process the request within 45 days (can extend by another 45 days if necessary).
Data ingestion and segregation
The first step is the implement Segregation in the ingestion layers of different types of data (e.g., PII Sensitive PII, Quasi PII, Sensitive, and non-PII Data Elements) to enforce security, compliance, and governance. These Four distinct layers should be linkable using a surrogate key, allowing downstream systems to consume the fully combined dataset or, in RTBF scenarios, access only the anonymized portions of the data. By clustering PII, Sensitive PII, Quasi-PII, and Sensitive data into separate layers, we can delete data from these layers without jeopardizing analytical processes, as the anonymized portions of the data remain intact. It also enables the enforcement of special access controls on sensitive data, preventing it from being included in general information modules.

- Deletion of Data based on a specialized rule engine, enabling us to stay compliant in the Bronze Layer.
- Pseudonymization, Tokenized Data for Privacy + Retention, Instead of storing raw PII, store tokens or hashed values in the data lake. Example: Store a hashed email, the benefits is to achieve RTBF compliance without full deletion or publication of data in order for Analytics still being able to use anonymized data.
- Assignment of surrogate key, make data relatable across the 5 different data layers.
- Governance enabling us to enforce Table & Column-Level Security (Databricks Unity Catalog / Table ACLs)
- Segregation, splitting our raw data into 5 buckets, depending their Sensitivity information or PII classification.
Leave a Reply