Conceptual diagram showing the BERT powered middleware

Adding a Safety Layer in Generative AI Systems

Robert McDermott
13 min readJust now

--

Preventing toxic behavior and disclosure of sensitive information

Background

When implementing generative AI-based solutions, it is important to prevent unethical uses such as identity attacks, threats of violence, obscenity, or other harmful language. Another common concern is data leakage to third-party services or the use of sensitive data, such as Personally Identifiable Information (PII) or Protected Health Information (PHI), in systems not authorized to process such information.

This article discusses concepts that can be used to create lightweight middleware that sits between your user interface or APIs and the backend language model, to detect, filter, block, or de-identify content from human inputs or outputs from LLMs.

While LLMs can be fine-tuned to avoid responding to toxic, unethical, illegal prompts, or those containing protected health information, there are hundreds (if not thousands) of LLMs to choose from, each with varying alignments or safeguards (or sometimes none at all). Using small, fast BERT models (which use an “encoder-only” transformer architecture) between your users and the backend language model ensures a consistent and reliable safety layer, regardless of which backend LLM or service is used. BERT models are compact and typically add only a second or two of latency to the overall response time of your generative AI application.

Although I provide some code examples in this article, there are no ready-to-use middleware solutions provided. You will need to determine how to implement these concepts within your specific systems and use cases.

Detecting harmful content with ‘detoxify’

👉 NOTE: I don’t provide any examples of toxic, violent, identity attacks or obscene text used in this section of the article so it both doesn’t offend anyone, and it doesn’t get automatically flagged as being in violation of a policy. You can use the provided code to do your own testing.

We will first look at how to automate the detection of toxicity. We’ll be using the detoxify package that uses a small BERT model that was created as part of a toxic comment classification challenge on Kaggle.

First thing we’ll need to do is install the detoxify python module and other requirements:

pip install detoxify transformers pandas torch

To make it easy for folks to try out, I’ve created a simple command-line utility that accepts text input and returns the toxicity score, along with an exit code indicating whether the toxicity level exceeds the configurable threshold. An exit code of “0” means the toxicity level is below the threshold, while “1” indicates it is above the threshold.

import sys
import pandas as pd
import argparse
from detoxify import Detoxify

def toxicity_score():
parser = argparse.ArgumentParser(description='Detect toxic language using Detoxify model')
parser.add_argument('input', type=str, help='Text input to analyze for toxicity')
parser.add_argument('--model', type=str, default='unbiased', choices=['unbiased', 'original'], help='Model to use')
parser.add_argument('--device', type=str, default='cpu', choices=['cpu', 'cuda', 'mps'], help='Device to use')
parser.add_argument('--threshold', type=float, default=0.5, help='Toxicity threshold (default: 0.5)')
args = parser.parse_args()
results = Detoxify(args.model, device=args.device).predict(args.input)
results = pd.DataFrame([results]).round(4)
return results, args.threshold

if __name__ == '__main__':
results, threshold = toxicity_score()
print(results)
toxicity = list(results.get('toxicity', 0))[0]
sys.exit(0 if toxicity < threshold else 1)

The very first time you run the script it will automatically download a small (~500MB) BERT transformer model. By default, the script will run the model on your CPU. To run the model on your GPU, provide the --device mps flag on Apple Silicon systems (M1 or greater) or --device cuda if you have an Nvidia GPU.

This model scores the input text across the following categories:

  • Toxicity
  • Severe Toxicity
  • Obscenity
  • Identity Attack
  • Insult
  • Threat
  • Sexually Explicit

Usage of the provided script:

python toxicity-score.py "text to check and score for toxicity"

Examples checking and scoring text

Non-toxic input example
Toxic input example

You can take this example and expand it to score a collection of text, such as sentences in a document or messages in a chat thread, or any other collection of text you want to check. The table below shows the results of scoring over a dozen items that include a mix of toxic inputs and some safe ones:

toxicity  severe_toxicity  obscene  identity_attack   insult   threat  sexual_explicit
0.97665 0.01943 0.07709 0.05396 0.08322 0.92437 0.01017
0.98105 0.01184 0.40984 0.62485 0.89753 0.00624 0.01359
0.00051 0.00000 0.00003 0.00008 0.00011 0.00002 0.00001
0.98189 0.09601 0.95745 0.02437 0.45696 0.00250 0.96087
0.95797 0.00583 0.07051 0.89684 0.75983 0.00574 0.00293
0.96898 0.01161 0.01417 0.86536 0.18288 0.66921 0.00611
0.00076 0.00000 0.00003 0.00015 0.00018 0.00004 0.00001
0.70879 0.00048 0.00348 0.01227 0.00717 0.55968 0.00098
0.97653 0.01512 0.03611 0.13018 0.08783 0.92815 0.00676
0.99224 0.13720 0.95218 0.02334 0.39137 0.03205 0.96928
0.00053 0.00000 0.00003 0.00008 0.00012 0.00002 0.00002
0.98120 0.00511 0.04703 0.94185 0.89853 0.00200 0.00529
0.96088 0.00006 0.00219 0.00199 0.93293 0.00014 0.00059
0.20143 0.00028 0.00231 0.26141 0.01058 0.00097 0.00264
0.00074 0.00000 0.00009 0.00008 0.00010 0.00004 0.00005

This BERT model is small, so it runs fast. It can check input text in just a second, even on CPU.

Implementing as a Middleware

In the previous section, we explored the Detoxify module and the associated BERT model for detecting and scoring the toxicity of provided text inputs. In this section, I’ll show an example of how this could be used as middleware between a user-facing chat app and the backend LLM.

This is not a stand-alone working example, you’ll need to figure out how to fit this into your own application. The following is the critical section of of a filter class registered on the pipelines server, that has been integrated with an Open WebUI AI chat application:

async def inlet(self, body: dict, user: Optional[dict] = None) -> dict:
# This filter is applied to input before it is sent to the selected LLM.
user_message = body["messages"][-1]["content"]
# Detect level of toxicity
toxicity = self.model.predict(user_message)
if toxicity["toxicity"] > 0.8:
raise Exception("""⚠ Your prompt violates company standards of professional conduct.
💣 Repeated violations may result in loss of system access or other actions.
👉 Review the standards here https://company.com/compliance-office/standards.html""")

With this filter in place and integrated with the application I’m using (Open WebUI), it will intercept any user inputs (inlet) before they reach the selected LLM (Llama 3.1 in the example below) and display a notice indicating that the user has violated the company’s standards of professional conduct:

Example of the toxicity filter in action in internal ChatGPT like application

This example detects and blocks the use of toxic language, but logging could be added to gather metrics on users' violations of the acceptable use policy. Repeated violations or the severity of the content could result in loss of access to the system or other consequences, depending on the frequency or nature of the violations.

Detecting, Blocking or De-identifying PII and PHI

👉 NOTE: The clinical note that is used as an example in this section is fictional. Also this is just exploring how this could be done, check with your compliance office or similar before attempting to use something like this with use cases processing actual protected health information.

We can apply the same method used in the previous section on toxicity to detect, block, or de-identify PII and PHI before it is sent to the backend LLM. For instance, you may want to allow your staff to use a ChatGPT-like system but need to prevent them from sending any PHI to the LLM hosted by third-party vendors like OpenAI, Azure, AWS, or Anthropic. While you may have a written policy prohibiting the use of PHI in such systems, it still relies on users adhering to that policy. By implementing middleware using this method, you can not only prevent unauthorized use of PHI but also log any attempts to violate the policy.

Like the toxicity detection, we’ll also be using a BERT model trained on medical records to specifically detect PHI. The repository containing more information about this model is titled “Robust DeID: De-Identification of Medical Notes using Transformer Architectures” and there are two models available on Huggingface:

Like the toxicity section we’ll need to install the transformers and torch python modules to use these models:

pip install transformers torch

Like the last section, I’ve created a simple command-line utility to make it easy to try out. The following script takes some text input, and will report on any PII or PHI elements in the provided text:

import argparse
from transformers import pipeline

def detect(note, model, device):
pipe = pipeline("token-classification", model=model, device=device)
results = pipe(note)
for res in results:
if res['score'] > 0.1:
print(f"Entity: {res['entity']}, Score: {res['score']:.2f}, Start: {res['start']}, End: {res['end']}")

def main():
parser = argparse.ArgumentParser(description="Detect PHI entities in medical notes")
parser.add_argument('note', type=str, help='The medical note text to process for entity detection')
parser.add_argument('--model', type=str, default='obi/deid_roberta_i2b2', help='The model to use for token classification')
parser.add_argument('--device', type=str, default='cpu', choices=['cpu', 'cuda', 'mps'], help='The device to use for inference (cpu, cuda, or mps)')
args = parser.parse_args()
detect(args.note, args.model, args.device)

if __name__ == '__main__':
main()

The very first time you run this script, it will need to download ~1GB BERT type transformer model. It will only need to download it once. By default the script will run the model on your CPU. To run the model on your GPU, provide the --device mps flag on Apple Silicon systems (M1 or greater) or --device cuda if you have an Nvidia GPU.

Usage:

python phi-detection.py "Text to detect and repot PHI"

Example with fake PHI:

Example running the detection script against fictional PHI

The output of the script provides the type of the identifying ‘entity’, the probably score (1.0 = 100%) and the ‘start’ and ‘end’ indexes of the tokens (tokens don’t map to complete words in many cases) in the provided text. The following text is an example of the output:

Entity: U-DATE, Score: 1.00, Start: 40, End: 42
Entity: U-DATE, Score: 1.00, Start: 42, End: 43
Entity: U-DATE, Score: 0.83, Start: 43, End: 45
Entity: B-PATIENT, Score: 1.00, Start: 98, End: 102
Entity: L-PATIENT, Score: 1.00, Start: 103, End: 105
Entity: L-PATIENT, Score: 1.00, Start: 105, End: 110
Entity: U-AGE, Score: 1.00, Start: 112, End: 114
Entity: B-LOC, Score: 1.00, Start: 158, End: 161
Entity: I-LOC, Score: 1.00, Start: 162, End: 166
Entity: L-LOC, Score: 1.00, Start: 167, End: 172
Entity: U-PHONE, Score: 1.00, Start: 208, End: 211
Entity: U-PHONE, Score: 0.88, Start: 211, End: 212
Entity: U-PHONE, Score: 0.77, Start: 212, End: 215
Entity: B-STAFF, Score: 1.00, Start: 290, End: 295
Entity: I-STAFF, Score: 1.00, Start: 296, End: 297
Entity: L-STAFF, Score: 1.00, Start: 298, End: 303
Entity: U-HOSP, Score: 0.86, Start: 360, End: 361
Entity: L-ID, Score: 0.43, Start: 361, End: 363
Entity: L-ID, Score: 0.97, Start: 363, End: 364
Entity: L-ID, Score: 0.34, Start: 364, End: 366

Now that we have code capable of detecting and identifying elements in a given text input, we can create a filter similar to the one in the previous section. This filter will detect and block the submission of such data before it is sent to the backend LLM, which might be hosted by a third-party.

Example in ChatGPT like application:

Example of a user facing AI ChatGPT like application detecting and blocking the use of PHI

De-identification

With the scores and start/stop indexes of the sensitive tokens identified, we can use that information to de-identify the provided text. The following script is a command-line utility that takes text input and returns a de-identified version of the same text. If no identifying information is detected, the text will be returned unchanged.

import argparse
from transformers import pipeline

def detect(note, model, device):
pipe = pipeline("token-classification", model=model, device=device)
NER = []
results = pipe(note)
for res in results:
if res['score'] > 0.1:
NER.append(res)
return NER

def mask_text(text, entities):
entities_sorted = sorted(entities, key=lambda x: x['start'])
masked_text = ""
last_end = 0
current_mask = None
for entity in entities_sorted:
start = entity['start']
end = entity['end']
entity_type = entity['entity']

if current_mask is None:
masked_text += text[last_end:start]
current_mask = entity_type
elif current_mask != entity_type or start > last_end:
masked_text += f"[{current_mask}]"
masked_text += text[last_end:start]
current_mask = entity_type
last_end = end

if current_mask is not None:
masked_text += f"[{current_mask}]"
masked_text += text[last_end:]
return masked_text

def main():
parser = argparse.ArgumentParser(description="Detect PHI entities in medical notes")
parser.add_argument('note', type=str, help='The medical note text to process for entity detection')
parser.add_argument('--model', type=str, default='obi/deid_roberta_i2b2', help='The model to use for token classification')
parser.add_argument('--device', type=str, default='cpu', choices=['cpu', 'cuda', 'mps'], help='The device to use for inference (cpu, cuda, or mps)')
args = parser.parse_args()
entities = detect(args.note, args.model, args.device)
deidentified_text = mask_text(args.note, entities)
print(deidentified_text)

if __name__ == '__main__':
main()

For testing purposes, we are going to use the following fictional clinical note as input to the above script:

Consult Note Pt: Ulysses Ogrady MC #0937884 Date: 07/01/19 Williams Ct M OSCAR, JOHNNY Hyderabad, WI 62297 HISTORY OF PRESENT ILLNESS: The patient is a 77-year-old-woman with long standing hypertension who presented as a Walk-in to me at the Brigham Health Center on Friday. Recently had been started q.o.d. on Clonidine since 01/15/19 to taper off of the drug. Was told to start Zestril 20 mg. q.d. again. The patient was sent to the Unit for direct admission for cardioversion and anticoagulation, with the Cardiologist, Dr. Wilson to follow. SOCIAL HISTORY: Lives alone, has one daughter living in Nantucket. Is a non-smoker, and does not drink alcohol. HOSPITAL COURSE AND TREATMENT: During admission, the patient was seen by Cardiology, Dr. Wilson, was started on IV Heparin, Sotalol 40 mg PO b.i.d. increased to 80 mg b.i.d., and had an echocardiogram. By 07-22-19 the patient had better rate control and blood pressure control but remained in atrial fibrillation. On 08.03.19, the patient was felt to be medically stable.

When provided to our de-identification script as input, we get the following de-identified output:

Consult Note Pt: [B-PATIENT][I-PATIENT] [L-PATIENT] [U-HOSP] #[U-ID] Date: [U-DATE][L-DATE] [B-PATIENT] [I-PATIENT] [I-PATIENT] [I-PATIENT] [L-PATIENT] [U-LOC], [U-LOC] [U-LOC][L-LOC] HISTORY OF PRESENT ILLNESS: The patient is a [U-AGE]-year-old-woman with long standing hypertension who presented as a Walk-in to me at the [B-HOSP] [I-HOSP] [L-HOSP] on [U-DATE]. Recently had been started q.o.d. on Clonidine since [U-DATE][L-DATE] to taper off of the drug. Was told to start Zestril 20 mg. q.d. again. The patient was sent to the Unit for direct admission for cardioversion and anticoagulation, with the Cardiologist, Dr. [U-STAFF] to follow. SOCIAL HISTORY: Lives alone, has one daughter living in [U-LOC][L-LOC]. Is a non-smoker, and does not drink alcohol. HOSPITAL COURSE AND TREATMENT: During admission, the patient was seen by Cardiology, Dr. [U-STAFF], was started on IV Heparin, Sotalol 40 mg PO b.i.d. increased to 80 mg b.i.d., and had an echocardiogram. By [U-DATE]-[L-DATE] the patient had better rate control and blood pressure control but remained in atrial fibrillation. On [U-DATE].[U-DATE].[L-DATE], the patient was felt to be medically stable.

It correctly detected all the identifying elements of the fictional clinical note, and retained the remaining information that still conveys the medically significant details of the clinic note.

Rather than simply detecting and blocking use of PHI in ChatGPT-like applications, we can use this method to de-identify the provided text before sending it to the LLM. In the example below, I prompted the LLM to echo back the text I provided, and as shown, it received the de-identified version of the fictional clinical note.

Example of de-identification in a ChatGPT like application

While this works, the de-identified output is ugly with all the redacted elements remaining in the text and it makes it hard to read. Since the LLM is only getting the redacted text now, we can just ask it to summarize the provided fictional clinical note with all the redacted text removed.

Prompt: Summarize the following text with all the redacted text in square brackets removed: <clinical note goes here>”

We provided the chat application the full PHI laden fictional clinical note and it responded with a summary of the de-identified information:

Example of de-identified and summarized fictional clinical note

That looks like a good de-identified summary, likely better than equivalent systems or manual methods used in clinical or research environments (speculation). Initially, I thought it was making up the “every other day”, “daily”, and “twice daily” details, as I didn’t see those in the fictional clinical note. However, after some investigation, I learned that those terms were the equivalent English meaning of the “q.o.d”, “q.d”, and “b.i.d”, Latin based abbreviations that medical doctors use. Clearly, I’m not a medical professional, and LLMs are smarter than I am 🤣. With that resolved, I believe the summary accurately captures the information from the original fictional clinical note and makes it easier to read. This type of clear, de-identified summarization could likely be useful to clinical research.

Conclusion

In this article, I provided examples of how a safety layer could be added to generative AI applications by serving as middleware between user-facing applications and backend LLMs. In practice, I’ve implemented both the toxicity and PHI-blocking features in the customized ChatGPT-like platform I personally use.

The toxicity filter works very well with minimal to no false positives. It has been enabled for several months on the system I use daily, and I can’t recall ever encountering a false positive.

The PHI detector, while effective at detecting personally identifiable information such as names, dates, addresses, phone numbers, and medical IDs, became annoying over time, so I disabled it for my general-purpose ChatGPT-like use cases, such as summarization, coding assistance, and document Q&A. It’s hard to avoid common entities like a person’s name (even historical figures), dates, or locations in such contexts. I think this method is best suited for specific use cases where data security is crucial, or when applied to tasks like de-identification and summarization of medical notes as shown in this article rather than to enforce a company policy in a general purpose chat application; there will be far too many false positives.

--

--