A million alternatives is peanuts. Restricting the search space to text files with 37 possible symbo...

lefra • today at 7:05 AM • 1 reply • view on HN

A million alternatives is peanuts. Restricting the search space to text files with 37 possible symbols (letters, numbers, space), a million different files can be generated with just 4 symbols.

A trillion is 8 symbols. You still haven't reached the end of your first import statement.

I just took a random source file on my computer. It has about 8000 characters. The number of possible files with 8000 characters has 12500 digits.

At this point, restricting the search space to syntactically valid programs (how do you even randomly generate that?) won't make a difference.

Replies

johndough • today at 10:52 AM

> restricting the search space to syntactically valid programs (how do you even randomly generate that?)

By using a grammar. Here is an example on how to only generate valid JSON with llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

> A trillion is 8 symbols. You still haven't reached the end of your first import statement.

Since LLMs use tokens from a vocabulary instead of characters, the number is likely somewhere in the lower billions for the first import statement.

But of course, LLMs do not sample from a uniform random distribution, so there are even fewer likely possibilities.

alt Hacker News

Replies