Correct me if I'm wrong, but looking at [1] it seems to be specifically using function definiti...

klibertp • today at 6:11 PM • 0 replies • view on HN

Correct me if I'm wrong, but looking at [1] it seems to be specifically using function definitions (I'm guessing this works with functions, methods, and lambdas (the "<unknown>" part)?) as units of repetition. If yes, that's fine, but I would seriously consider adding some settings to allow the user to control that granularity. Sometimes, the repeated code is a conditional branch within larger functions (i.e., "every else:" or "every except Ex:" looks the same). If the functions are large enough, the dissimilarity of the rest of the body would (probably?) cause such things to be missed.

I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.

Finally: very nice project, congrats! :)

[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...

alt Hacker News