This would be more interesting if it was generalized. Using a hash, even one character difference wi...

Apreche • yesterday at 8:11 PM • 4 replies • view on HN

This would be more interesting if it was generalized. Using a hash, even one character difference will result in a miss.

If I could have it analyze my blog and then find people who have similar ideas that would be incredibly useful.

Replies

Imustaskforhelp • yesterday at 8:48 PM

To be really honest, they can take a look at bao. (I used it for an eerily similar project like this one though its great that this is receiving traction! I Do feel like scuttlebutt protocol might be good implementation for most use cases as well)

Bao allows us to have a common hash for the first n contents of the term and then they can still have common hash so you can just loop it over each continuous word to see how much commonly (long?) their hash is and the length becomes the amount similar

Some issue might come where if the word changes in the start and the rest is similar but I feel like bao could/does support that as well. My information on bao is pretty rusty (get the pun? It's written in rust) but I am sure that this idea is technically possible & I hope someone experienced in the field could tell more about it

https://github.com/oconnor663/bao, Oconnor's bao's video or documentaries on youtube are so good, worth a watch & worth a star (though they do mention that its a little less formally cryptographically solved iirc but its still pretty robust imo)

sonnig • yesterday at 8:30 PM

True! That would be a more powerful approach. Here I kept it quite basic since I was not very familiar with the tooling. I do apply lowercasing of text + some whitespace stripping in order to increase the number of collisions a bit.

Edit: any other "quick hacks" to increase the number of collisions are welcome :)

nathan_compton • yesterday at 8:14 PM

Natural to use LM embeddings for this.

➕ show 1 reply

stogot • yesterday at 8:18 PM

That is a problem Also a long paragraph would likely never be hashed the same because of a comma or capital letter and so the builder of this would need to cap the length of the thought and make all thoughts lower case without punctuation

➕ show 1 reply

alt Hacker News

Replies