> embedding a pre-trained KGE model into a transformer model
Do you have any good pointers (literature, code etc) on the mechanics of this?
We also did something similar in our NTULM paper at Twitter https://youtu.be/BjAmQjs0sZk?si=PBQyEGBx1MSkeUpX
Used in non generative language models like BERT but should help with generative models as well.
Check out PyKEEN [0] and go wild. I like to train a bunch of random models and "overfit" them to the extreme (in my mind overfitting them is the point for this task, you want dense, compressed knowledge). Resize the input and output embeddings of an existing pretrained (but small) LLM (input only necessary if you're adding extra metadata on input, but make sure you untie input/output weights). You can add a linear layer extension to the transformer blocks, pass it up as some sort of residual, etc. - honestly just find a way to shove it in, detach the KGE from the computation graph and add something learnable between it and wherever you're connecting it - like just a couple linear layers and a ReLU. The output side is more important, you can have some indicator logit(s) to determine whether to "read" from the detached graph or sample the outputs of the LLM. Or just always do both and interpret it.
(like tinyllama or smaller, or just use whatever karpathy repo is most fun at the moment and train some gpt2 equivalent)
[0] https://pykeen.readthedocs.io/en/stable/index.html