How to evaluate PII Detection output with Presidio Evaluator

7 min readAug 28, 2024

Personally Identifiable Information (PII) refers to any information that can be used to identify an individual. Examples of PII include, but are not limited to, names, phone numbers, email addresses, social security numbers, etc. In the digital age, where data is constantly being collected and shared, the protection of PII has become a critical issue.

PII detection process involves identifying, extracting, and masking PII from different data sources, which plays a crucial role in protecting individuals’ privacy. However, the effectiveness of PII detection models largely depends on accurate evaluation methods. Responsible AI practices also advocate the importance of accurate and reliable PII detection to avoid potential misuse or mishandling of sensitive data.

In this article, we will explore the token-based PII evaluation method and explain its implementation using presidio-evaluator package.

Content overview:

What is the token-based evaluation and its key metrics?
How to evaluate PII detection model by using presidio-evaluator
Token-based evaluation limitation

What is the token-based evaluation and its key metrics?

Token-based evaluation assesses the performance of a PII detection model at the token level, instead of the whole document or entity. It checks how precisely each token that might be PII is identified in the text. For example, in the sentence: “My name is Nguyen Trang”, this method involves tokenizing the sentence and then evaluating the model’s performance to recognize individual tokens like “Nguyen” and “Trang” as PII. As a result, this approach helps pinpoint exactly which parts of the PII the model detects correctly and can provide some insights into specific areas for improvement.

Key Metrics of token-based evaluation

Comparing the ground truth and predicted entity for each token, we can identify the following key metrics:

· True Positive (TP): The model correctly matches a token with its entity, like identifying “John” as a Name.

· True Negative (TN): The model rightfully does not associate a token with an incorrect entity, such as not labeling “the” as a Name.

· False Positive (FP): The model incorrectly labels a non-entity token, like mistaking “table” for an Animal.

· False Negative (FN): The model fails to recognize a correct entity, such as missing out “John” as a Name.

Then, critical derived metrics include:

· Precision: The ratio of correctly predicted positive observations to the total predicted positives; high precision means low false positives.

· Recall: The ratio of correctly predicted positive observations to all actual positives; high recall means low false negatives.

· F1 Score: The weighted average of Precision and Recall, used as a test’s accuracy measure.

The accompanying example demonstrates these metrics’ calculation, starting from tokenization to evaluating precision, recall, and F1 score.

Evaluating a PII Detection Model Using the presidio-evaluator Package

Now you understand the basic concepts of token-based evaluation. In the next part of the article, I want to showcase how we can do it by using presidio-evaluator. Presidio is an open-source framework by Microsoft for detecting and anonymizing PII in text and images. The presidio-evaluator package extends Presidio’s capabilities, providing tools for generating data and evaluating custom PII detection models.

Setting up the environment

First, let’s set up our environment. Ensure you have Python installed, then install the necessary packages:

pip install presidio-evaluator

Import libraries

Then, we need to import the necessary libraries. InputSample and Span are classes from the presidio-evaluator library that we will use to present our data and its ground truth spans of PII, respectively.

# Import libraries
from presidio_evaluator import InputSample, Span

Load data

Next, we will load our data. This data should be in the form of a list of InputSample objects. Each InputSample object represents a piece of text and contains a list of Span objects that represent the information of ground truth spans of PII in the text.

# Define the data as an Input Sample object
sample = InputSample(
    full_text = "My name is Trang Nguyen. I live in France.",
    spans = [
                Span(start_position  = 11, # Starting position of PII in the full_text
                    end_position = 22, # Ending position of PII in the full_text
                    entity_value = "Trang Nguyen", 
                    entity_type = "PERSON"),
                Span(start_position = 33, # Starting position of PII in the full_text
                    end_position = 38,# Ending position of PII in the full_text
                    entity_value = "France",
                    entity_type = "LOCATION")
        ],
    create_tags_from_span=True
    )

In this example, I create an instance of InputSample where full_text contains only two sentences: “My name is Trang Nguyen. I live in France.”. These sentences have the ground truth span declared inside the `spans` parameter. Since there are no ground truth tokenizations — which are needed for evaluation, we set `create_tags_from_span = True` to tokenize the `full_text`. The default `token_model_version` used for tokenization is `en_core_web_sm` and the IO schema. You can change these parameters to use a different model and schema for tokenization.

Now let’s iterate over all the tokenizations and their ground truth tags. In the below code snippet, the sample.tokens list contains individual words or punctuation from the InputSample object’s full_text and the sample.tags list has tags indicating each token’s entity type, such as ‘PERSON’ for a person entity or ‘O’ for non-PII entities.

# Print each pair of token and grouth truth tag in InputSample 
for token, tag in zip(sample.tokens, sample.tags):
    print({token: tag})
# Count the number of tokens per entity 
entity_counter = Counter()
for tag in sample.tags:
    entity_counter[tag] += 1
print("Count per entity:")
print(entity_counter.most_common())

Output:

{My: 'O'}
{name: 'O'}
{is: 'O'}
{Trang: 'PERSON'}
{Nguyen: 'PERSON'}
{.: 'O'}
{I: 'O'}
{live: 'O'}
{in: 'O'}
{France: 'LOCATION'}
{.: 'O'}
Count per entity:
[('O', 8), ('PERSON', 2), ('LOCATION', 1)]

Evaluate the model

Once our data is loaded, we can proceed to evaluate our PII detection model. In this example, we use Presidio Analyzer as the model to detect PII.

from presidio_evaluator.models import PresidioAnalyzerWrapper
model = PresidioAnalyzerWrapper()

Before conducting the evaluation, let’s examine how the model identifies the PII in the InputSample defined in the preceding step.

pii_prediction = model.predict(sample)
print("PII detection output by using PresidioAnalyzerWrapper model")
print(sample.tokens)
print(pii_prediction)

Output:

PII detection output by using PresidioAnalyzerWrapper model
My name is Trang Nguyen. I live in France.
['O', 'O', 'O', 'PERSON', 'PERSON', 'O', 'O', 'O', 'O', 'LOCATION', 'O']

As you can see, in this case, the model was able to correctly classify all tokens. Now, to evaluate the PII detection performance, we create an Evaluator object and pass a model into it. The Evaluator object is used to evaluate this model on all samples in the dataset and calculate the model’s overall score:

from presidio_evaluator.evaluation import Evaluator
evaluator = Evaluator(model=model)

# Evaluate the results on all samples in the dataset 
evaluation_results = evaluator.evaluate_all(dataset=[sample])
# Calculate the score based on evaluation_results
results = evaluator.calculate_score(evaluation_results)

Analyze the results

After evaluating the model, we can analyze the results. We convert the evaluation results to a confusion matrix, which is a table often used to describe the performance of a classification model. Finally, we print the precision and recall scores. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall (Sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class

entities, confmatrix = results.to_confusion_matrix()
print(pd.DataFrame(confmatrix, columns=entities, index=entities))
print("Precision and recall")
print(results)

Output:

Confusion matrix:
          LOCATION  O  PERSON
LOCATION         1  0       0
O                0  6       0
PERSON           0  0       2
Precision and recall
              Entity           Precision              Recall   Number of samples
            LOCATION             100.00%             100.00%                   1
              PERSON             100.00%             100.00%                   2
                 PII             100.00%             100.00%                   3
PII F measure: 100.00%

Token-based evaluation limitations

Although token-based evaluation is a straightforward method that is both easy to implement and interpret, it does come with certain limitations. Firstly, since it calculates metrics at the token level, it assigns equal weight to every token in a text, therefore multiple-token entities (like phone numbers) have higher weight because they break down to more tokens. Secondly, this method does not effectively measure the level of privacy protection. For instance, if a person’s name is mentioned multiple times in a given document, it’s only considered “protected” if every mention of it is masked. Moreover, failing to detect a direct identifier, such as a person’s name, is far more harmful from a privacy perspective than failing to detect an indirect identifier.

In conclusion, while token-based evaluation offers a simple and interpretable method for assessing model performance, it is not without its shortcomings. The equal weighting of each token and the inability to effectively gauge privacy protection are significant limitations. These drawbacks underscore the need for more nuanced evaluation techniques to better address the complexities of privacy protection and entity recognition. By leveraging tools such as confusion matrix, precision, and recall metrics, we can gain a deeper understanding of our model’s strengths and weaknesses, ultimately guiding us toward more robust and privacy-aware solutions.

Please note that the views expressed in this article are my own and do not necessarily reflect those of Microsoft, my current employer.

References:

GitHub — microsoft/presidio: Context aware, pluggable and customizable data protection and de-identification SDK for text and images

GitHub — microsoft/presidio-research: This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models.

dstoolkit-e2e-presidio-evaluation/notebooks/003-lab-pii-detection-evaluation.ipynb at main · microsoft/dstoolkit-e2e-presidio-evaluation (github.com)

How to evaluate PII Detection output with Presidio Evaluator

What is the token-based evaluation and its key metrics?

Evaluating a PII Detection Model Using the presidio-evaluator Package

Setting up the environment

Token-based evaluation limitations

Written by Nguyen Trang

No responses yet