Sanitize Documents in the Browser with AI Tuned for PII
In the digital age, data is akin to currency, flowing through networks with the potential to unlock immense value or cause significant harm. As we navigate this landscape, the principle of ‘need to know’ becomes paramount in safeguarding sensitive information. It is not just about protecting data; it is about smartly managing its flow to prevent unnecessary exposure.
Enter the concept of clientless data sanitization at the edge— automatically sanitize data in the user’s browser. By implementing data sanitization risk management protocols at the source, we ensure that only essential data traverses our networks, reducing the risk of spillage. Utilizing a distilled AI word classification model for the browser there is no need for the client to install something to enable the functionality.
Enter the concept of clientless data sanitization at the edge— automatically sanitize data in the user’s browser. By implementing data sanitization risk management protocols at the source, we ensure that only essential data traverses our networks, reducing the risk of spillage. Utilizing a distilled AI word classification model for the browser there is no need for the client to install something to enable the functionality.
The encoder only transformer and the ease of fine tuning makes it possible to create PII, or CUI, PH, etc. specific models. With the advent of technologies like ONNX & ORT runtime, AI algorithms can now run directly in the browser, bringing powerful data sanitization capabilities to the client side. This means sensitive documents can be automatically sanitized by the very individuals responsible for their creation, ensuring that only clean, risk-free data is shared. This approach not only enhances security but also streamlines operations, making data management more efficient and effective.
As we continue to push AI to the edge document sanitization seemed like an ideal use case for edge AI; so, we put together this quantized PII Named Entity Recognition (NER) model providing the flexibility, accuracy, and speed to sanitize on the client side.
PII AI Technologies
Ronathan PII Collab Notebook: Fine tune NER model for PII recognition. The notebook will create a quantized model that can be used in a browser. This notebook uses Huggingface python LLM libraries to perform the finetuning and export of the model.
Hugging Face Dataset: ai4privacy/pii-masking-200k · Datasets at Hugging Face. We spent half an hour looking for PII datasets on Huggingface, and there are larger ones. However, I was using a google free tier, so I only trained on 60K.
DistilBERT: Smaller, faster , cheaper version of BERT; It has 40% less parameters. Specifically used for on device computation and intended for transfer learning.
Upcoming Articles:
In the next article I will put up a node or asp project to demonstrate the ORT to perform inference with the model in the browser.
This is the second article in a cluster about AI & NLP in the browser. The previous article, semantic chunking is HERE.