logoalt Hacker News

OpenAI model for masking personally identifiable information (PII) in text

31 pointsby tanelpodertoday at 12:14 AM8 commentsview on HN

Comments

stratos123today at 9:30 AM

There's some interesting technical details in this release:

> Privacy Filter is a bidirectional token-classification model with span decoding. It begins from an autoregressive pretrained checkpoint and is then adapted into a token classifier over a fixed taxonomy of privacy labels. Instead of generating text token by token, it labels an input sequence in one pass and then decodes coherent spans with a constrained Viterbi procedure.

> The released model has 1.5B total parameters with 50M active parameters.

> [To build it] we converted a pretrained language model into a bidirectional token classifier by replacing the language modeling head with a token-classification head and post-training it with a supervised classification objective.

Havoctoday at 9:38 AM

50M effective parameters is impressively light. Is there a similarly light model on the prompt injection side? Most of the mainstream ones seem heavier

hiAndrewQuinntoday at 2:39 AM

I'm surprised nobody else has commented on this. This is a very straightforward and useful thing for a small locally runnable model to do.

show 2 replies
7777777philtoday at 6:14 AM

> The model is available today under the Apache 2.0 license on Hugging Face (opens in a new window) and Github (opens in a new window).

Bringing back the Open to OpenAI..

y0eswddltoday at 3:37 AM

[flagged]

show 1 reply