OpenAI model for masking personally identifiable information (PII) in text

31 points • by tanelpoder • today at 12:14 AM • 8 comments • view on HN

Comments

There's some interesting technical details in this release:

> Privacy Filter is a bidirectional token-classification model with span decoding. It begins from an autoregressive pretrained checkpoint and is then adapted into a token classifier over a fixed taxonomy of privacy labels. Instead of generating text token by token, it labels an input sequence in one pass and then decodes coherent spans with a constrained Viterbi procedure.

> The released model has 1.5B total parameters with 50M active parameters.

> [To build it] we converted a pretrained language model into a bidirectional token classifier by replacing the language modeling head with a token-classification head and post-training it with a supervised classification objective.

Havoc • today at 9:38 AM

50M effective parameters is impressively light. Is there a similarly light model on the prompt injection side? Most of the mainstream ones seem heavier

hiAndrewQuinn • today at 2:39 AM

I'm surprised nobody else has commented on this. This is a very straightforward and useful thing for a small locally runnable model to do.

➕ show 2 replies

7777777phil • today at 6:14 AM

> The model is available today under the Apache 2.0 license on Hugging Face (opens in a new window) and Github (opens in a new window).

Bringing back the Open to OpenAI..

y0eswddl • today at 3:37 AM

[flagged]

➕ show 1 reply

alt Hacker News

OpenAI model for masking personally identifiable information (PII) in text

Comments