Don't optimize the language to fit the tokens, optimize the tokens to fit the language. Tokeni...

tricorn • today at 5:46 AM • 0 replies • view on HN

Don't optimize the language to fit the tokens, optimize the tokens to fit the language. Tokenization is just a means to compress the text, use a lot of code in target languages to determine the tokenizing, then do the training using those tokens. More important is to have a language where the model can make valid predictions of what effective code will be. Models are "good" at Python because they see so much of it. To determine what language might be most appropriate for an AI to work with, you'd need to train multiple models, each with a tokenizer optimized for a language and training specifically targeting that language. One language I've had good success with, despite having low levels of training in it, is Tcl/Tk. As the language is essentially a wordy version of Lisp (despite Stallman's disdain for it), it is extremely introspective with the ability to modify the language from within itself. It also has a very robust extensible sandbox, and is reasonably efficient for an interpreted language. I've written a scaffold that uses Tcl as the sole tool-calling mechanism, and despite a lot of training distracting the model towards Python and JSON, it does fairly well. Unfortunately I'm limited in the models I can use because I have an old 8GB GPU, but I was surprised at how well it manages to use the Tcl sandbox with just a relatively small system prompt. Tcl is a very regular language with a very predictive structure and seems ideal for an LLM to use for tool calling, note taking, context trimming, delegation, etc. I haven't been working on this long, but it is close to the point where the model will be able to start extending its own environment (with anything potentially dangerous needing human intervention).

alt Hacker News