logoalt Hacker News

plaidthundertoday at 6:15 PM1 replyview on HN

It seems like there's an opportunity to embed identity information into tokens themselves, the way we embed sequence information. The trouble is... it's quite a challenge to train. Sequence is easy to derive for any corpus of data, but identity is not.

https://usize.github.io/blog/2026/april/why-no-ai-coworkers....

> In similar fashion to how sequence information is embedded within input tensors, an approach called “Instructional Segment Embedding”2 adds a parallel embedding channel for identity information. This gives models real awareness of provenance. And it works. But they only tested three fixed categories: system, user, data.

Interesting paper that touches on the idea here: https://arxiv.org/abs/2410.09102


Replies

echelontoday at 6:29 PM

Could you assign certain subject matters a score in the training data, construct a unified token space that contains these rankings, and then mark conversations as "dirty" if they veer into that subject matter?