Data De-Identification
Data Analytics and AI
The ability to analyze data effectively has become more powerful than ever before. AI-driven data analytics can unlock insights that drive innovation and improve outcomes. However, this data still needs to be protected, especially when dealing with sensitive information that could compromise individual privacy. Balancing the promise of AI with the need to protect privacy is a critical challenge. Data de-identification allows organizations to leverage the many uses of AI while also minimizing potential risk.
De-Identifying Data
AI services have great capabilities that can increase the quality and efficiency of data analytics. However, data analytics often deals with sensitive data that should not be shared without an appropriate privacy agreement in place. Because of this concern, data can be de-identified so that private data is not exposed with an AI service while still leveraging the strengths that the AI service has.
Personal Data
Personal data is any information related to an identifiable person. Personal data includes a wide variety of direct identifiers, as well as indirect identifiers.
Direct Identifiers vs Indirect Identifiers
Direct identifiers are anything that can directly identify an individual (full name, social security number, etc). Indirect identifiers are pieces of information that do not identify an individual on their own, but can be used alongside other identifiers to identify an individual (race, ethnicity, age, zip code, birthday, etc). Both direct and indirect identifiers should not be present in data that is being shared with an AI service.
De-Identified Data vs Anonymized Data
Within the privacy world, it is important to note the difference between de-identified and anonymized data. De-identified data is NOT the same as anonymized data, and should never be displayed or marketed as such. Anonymized data can only be claimed as such when the entirety of the data across an organization does not have any direct or indirect identifiers that could lead to the identity of any person.
Types of Analyses
Once data has been de-identified, there are many types of analyses that can be performed in order to draw conclusions from the data. Common types of analyses used across different career fields are listed below:
- Market Basket Analysis
- Determining which products users frequently purchase together.
- Market Trend Analysis
- Identifying trends in a given market to anticipate demand and adapt strategies.
- Campaign Performance Analysis
- Evaluating success of marketing campaigns in terms of conversions, engagement, or recruitments.
- Sentiment Analysis
- Analyzing customer or employee feedback, reviews, or engagement to understand how they feel about the brand, product, or service.
- Cost-Benefit Analysis
- Comparing costs to the benefit of an action or decision.
- Sales or Usage Forecasting
- Predicting future sales or usage based on historical data and market conditions.
- Process Efficiency Analysis
- Examining workflows to identify bottlenecks or areas of improvement.
- Survey Analysis
- Drawing insights from customer preferences in order to direct areas of improvement or focus.
De-identification Checklist
This checklist should be used before inputting data into an AI platform. Sensitive information should only be processed in an AI tool when it is absolutely necessary, and protected by a prior agreement.
1. Determine goal of data analysis or processing.
2. Review goal with privacy principles in mind.
3. Determine direct and indirect identifiers in the data set.
4. Determine minimal acceptable data utility.
5. Select a de-identification method (see Methods for De-Identification Tab).
6. Evaluate risk of re-identification (which refers to the process of reversing the de-identification of data, thereby linking previously anonymized information back to specific individuals).
7. Import sample data to test data transformation.
8. Evaluate usability of data output.
9. Refine as necessary and apply data transformations to entire data set.
10. Export transformed data as use as intended.
Methods for data de-identification
Note: Below are summaries of each data de-identification method. Please ensure you thoroughly understand any given method before using it with data.
There are many ways to de-identify data. Certain methods work better with certain kinds of data, and becoming familiar with what methods work best with certain data sets is a great way to leverage the amazing capabilities of AI while still adhering to BYU's privacy policies
Note: It is important to recognize that using de-identification does not remove all risk related to processing personal data, but instead greatly reduces the risk when completed effectively.
Omission and Sub-Sampling
Omission is the simplest method of data de-identification. Omission is the practice of not including indirect or direct identifiers in the data set. Sub-sampling is the method of refining an entire data set to just a sample of the set which will be shared with the AI model, reducing the probability of reidentification.
Suppression
Suppression is another basic method of data de-identification and includes removing any values or columns from the data in an effort to reduce the risk of reidentification. Suppression includes removing quasi-identifiers, which are identifiers that do not uniquely identify an individual in most cases but can in some cases be used when combined with other quasi-identifiers (ex. job title is not a quasi-identifier unless it is a unique job and the company information is also present in the data set).

Generalization
Generalization is the method of broadening the data set to reduce the direct link to individuals. This can be accomplished by reporting values as ranges instead of unique values (ex. the zip code 84057 could be generalized to the category 84000-84500). Generalization can be applied to entire data sets or just specific records as needed.

Swapping
Swapping is a more complicated method of data de-identification and includes exchanging values between records, within defined levels of generalization (ex. you could exchange the hometown of two individuals if you were studying correlation between age and county of residence). Swapping should be handled with care to ensure that results of the data set are correct and authentic.

Random Noise Addition
Random noise addition is the process of adding random data or columns to the data set in order for the data to no longer match any individual person. Another way of completing random noise addition is to add or remove random amounts to unimportant numeric data, in an attempt to further hide the identifying characteristics of the individual. Random noise addition can also be done by replacing certain values with other values in a manner that is consistent throughout the data set (ex. ages 18-21 are given category A, 22-26 are category B, etc).

Aggregation
Aggregation is the process of taking individual data sets and combining them in order to analyze trends while protecting individual privacy by using groups of individuals with similar characteristics rather than isolating one individual at a time. A key to aggregation is examining what groupings of individuals will be effective to meet analytical goals.

Redaction
Redaction erases or masks identifiers in all data records, using techniques like pixelation or black out.

Pseudonymization
Pseudonymization is the process of transforming names and other information with pseudonyms in the data (ex. individual (with a real name of Cougar) was given the pseudonym "r543c" and voted yes on the poll). Pseudonymization can be difficult as a balance needs to be struck between risk and usefulness of the data.

Hashing
Hashing is a mathematical algorithm that transforms a list of characters into a new value that represents the original list of characters (ex. Payment method ending in 1234 becomes hash akd427).

Tokenization
Tokenization is a process of substituting a piece of sensitive data with a non-sensitive substitute called a token. Tokens can be created via hashing, a randomly generated identifier, etc.

Encryption
Encryption is a process of scrambling information so it can only be read by someone with the "key" to unscramble the information (ex. This is very commonly used to display only the last four digits of a social security number).
Red Flags in Data De-identification
Getting familiar with all the definitions and methods for data de-identification can sometimes feel overwhelming. In this section, some issues will briefly be discussed to explain how to navigate these topics. One important thing to note: If anyone claims their de-identification methods have removed ALL risk related to processing of data, they are not to be trusted.
Data Still Contains Identifiable Details
- If a dataset still includes names, addresses, phone numbers, or email addresses, it has not been properly de-identified.
- Example: Even indirect details (like exact birthdates, small location areas, or unique job titles) can be used to re-identify someone.
Overly Specific Information Remains
- If data includes rare job titles, small groups (e.g., "Only 3 people in this city have this degree"), or unique combinations of traits, it may still be traceable.
- Example: Instead of listing “Professor of Ancient Egyptian Art at XYZ University,” the data should generalize to “University Faculty.”
Data Can Be Matched to Other Public Information
- If a dataset could easily be cross-referenced with public records, social media, or online profiles, it is not truly de-identified.
- Example: If a dataset includes gender, zip code, and birth year, someone could match it with voter records and identify individuals.
"Anonymous" Data That Still Allows Tracking
- If a dataset claims to be anonymous but still assigns a unique code or identifier to each person, it may not be fully de-identified.
- Example: A study replaces names with ID numbers, but if the same ID is used across multiple datasets, someone could figure out who it belongs to.
Incomplete or Weak De-identification Techniques
- Simply removing names is not enough—proper methods like data masking, generalization, and pseudonymization should be used.
- Example: If a company claims data is de-identified but has not applied these proper measures, the data could be at risk.
Lack of Transparency About the Process
- If an organization won’t explain how they de-identify data or what safeguards they use, they may not be following best practices.
- Example: A company says your data is "anonymized" but doesn’t clarify how. This could mean they’re not using proper techniques.
Data Shared Without Proper Protections
- If your de-identified data is widely shared with third parties, there is a higher risk of re-identification.
- Example: An organization does not have strict policies on how de-identified data is stored, shared, and used.
"Opting Out" Is Difficult or Impossible
- If a company collects and de-identifies your data but does not let you opt out or request deletion, it could be a warning sign.
- Example: A website tracks your behavior and claims the data is de-identified, but there’s no way to prevent your data from being included.
Re-Identification Risks Are Not Addressed
- Even if data is de-identified, companies should monitor for re-identification risks and take steps to prevent them.
- Example: If a dataset is later combined with other information and people start being re-identified, the organization should have a process to fix it.
Common Questions About Data De-Identification
- Does de-identification mean my personal data is deleted?
No, de-identification doesn’t delete your data. It removes or changes personal details so that no one can tell the data belongs to you. - Can someone still figure out who I am from de-identified data?
It depends on the robustness of the de-identification methodology. If de-identified data is combined with other information, someone might be able to guess who it belongs to. That’s why strong privacy measures should be used. - Why do universities and companies use de-identified data?
They use it to study trends, improve services, and make decisions while protecting people’s privacy. For example, universities might analyze student performance without revealing anyone’s identity. - How is de-identification different from just removing names?
Simply removing names isn’t enough. Other details, like birth dates or locations, could still reveal identities. De-identification changes or removes multiple pieces of information to make sure data stays private. - Can de-identified data be re-identified later?
In some cases, yes—especially if the data is combined with other sources. That’s why organizations use extra steps, like grouping data or limiting access, to prevent this. What are some real-world examples of de-identification?
- Hospitals remove patient names from medical records before sharing them for research.
- Schools analyze student test scores without showing individual results.
- Companies track website visits without storing users’ personal details.
- Do all types of personal data need to be de-identified?
Not always. Some data, like publicly available information, doesn’t need de-identification. But sensitive data such as health records, financial details, and student records must be protected. - Is de-identification only used for research, or are there other uses?
It’s used for many purposes, like public health reporting, improving services, and detecting fraud, all while keeping personal details private. - How does de-identification affect the accuracy of data?
It depends on the method used. Some details may be generalized (like showing age groups instead of exact ages), but the overall data can still be useful for analysis. - What should I do if I think my de-identified data was misused?
You can report your concerns to the organization that collected the data. Many institutions have privacy officers who handle data protection issues.
Resources
—————————
GenAI and Privacy
About GenAI
Key Terms
Personal Data: Any information related to an identifiable person. Personal data includes a wide variety of direct identifiers, as well as indirect identifiers.
Direct Identifiers: Any information that can directly identify an individual (full name, social security number, etc)
Indirect identifiers: Identifiers that do not identify an individual on their own, but can be used alongside other identifiers to identify an individual (race, ethnicity, age, zip code, birthday, etc)
De-identified data: Data set that has undergone a data de-identification method in order to remove or hide all direct and indirect identifiers.
Quasi-identifiers: Identifiers that do not uniquely identify an individual in most cases but can in some cases be used when combined with other quasi-identifiers (ex. job title is not a quasi-identifier unless it is a unique job and the company information is also present in the data set)
Re-identification: The process of reversing the de-identification of data, thereby linking previously anonymized information back to specific individuals