logoalt Hacker News

xlayntoday at 1:37 AM0 repliesview on HN

I published the results for devstral... results folder of the github https://github.com/alainnothere/llm-circuit-finder/tree/main...

I'm using the following configuration --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp I did also try humaneval but something in the harness is missing and failed...

notice that I'm running 50 tests for each task, mostly because of time limitation as it takes like two hours to validate the run for the base model and the modified one.

I'll also try to publish the results of the small tests harness when I'm testing the multiple layers configurations, for reference this is phi-4-Q6_K.gguf, still running, I'm now giving more importance to the Reason factor, the reason factor comes from running a small subset of all the problems in the task config above

Initially I tried the approach of the highest math/eq but in resulted in models that were less capable overall with the exception of math, and math like in the original research is basically how good was the model at giving you the answer of a really though question, say the cubic root of some really large number... but that didn't translate to the model being better at other tasks...

  Config  | Lyr | Math   | EQ    | Reas   | Math Δ  | EQ Δ  | Reas Δ  | Comb Δ
  --------|-----|--------|-------|--------|---------|-------|---------|-------
  BASE    |   0 | 0.7405 | 94.49 | 94.12% |     --- |   --- |     --- |    ---
  (6,9)   |   3 | 0.7806 | 95.70 | 94.12% | +0.0401 | +1.21 |  +0.00% |  +1.21
  (9,12)  |   3 | 0.7247 | 95.04 | 94.12% | -0.0158 | +0.55 |  +0.00% |  +0.55
  (12,15) |   3 | 0.7258 | 94.14 | 88.24% | -0.0147 | -0.35 |  -5.88% |  -6.23
  (15,18) |   3 | 0.7493 | 95.74 | 88.24% | +0.0088 | +1.25 |  -5.88% |  -4.63
  (18,21) |   3 | 0.7204 | 93.40 | 94.12% | -0.0201 | -1.09 |  +0.00% |  -1.09
  (21,24) |   3 | 0.7107 | 92.97 | 88.24% | -0.0298 | -1.52 |  -5.88% |  -7.41
  (24,27) |   3 | 0.6487 | 95.27 | 88.24% | -0.0918 | +0.78 |  -5.88% |  -5.10
  (27,30) |   3 | 0.7180 | 94.65 | 88.24% | -0.0225 | +0.16 |  -5.88% |  -5.73
  (30,33) |   3 | 0.7139 | 94.02 | 94.12% | -0.0266 | -0.47 |  +0.00% |  -0.47
  (33,36) |   3 | 0.7104 | 94.53 | 94.12% | -0.0301 | +0.04 |  +0.00% |  +0.04
  (36,39) |   3 | 0.7017 | 94.69 | 94.12% | -0.0388 | +0.20 |  +0.00% |  +0.20
  (6,10)  |   4 | 0.8125 | 96.37 | 88.24% | +0.0720 | +1.88 |  -5.88% |  -4.01
  (9,13)  |   4 | 0.7598 | 95.08 | 94.12% | +0.0193 | +0.59 |  +0.00% |  +0.59
  (12,16) |   4 | 0.7482 | 93.71 | 88.24% | +0.0076 | -0.78 |  -5.88% |  -6.66
  (15,19) |   4 | 0.7617 | 95.16 | 82.35% | +0.0212 | +0.66 | -11.76% | -11.10
  (18,22) |   4 | 0.6902 | 92.27 | 88.24% | -0.0504 | -2.23 |  -5.88% |  -8.11
  (21,25) |   4 | 0.7288 | 94.10 | 88.24% | -0.0117 | -0.39 |  -5.88% |  -6.27
  (24,28) |   4 | 0.6823 | 94.57 | 88.24% | -0.0583 | +0.08 |  -5.88% |  -5.80
  (27,31) |   4 | 0.7224 | 94.41 | 82.35% | -0.0181 | -0.08 | -11.76% | -11.84
  (30,34) |   4 | 0.7070 | 94.73 | 94.12% | -0.0335 | +0.23 |  +0.00% |  +0.23
  (33,37) |   4 | 0.7009 | 94.38 |100.00% | -0.0396 | -0.12 |  +5.88% |  +5.77
  (36,40) |   4 | 0.7057 | 94.84 | 88.24% | -0.0348 | +0.35 |  -5.88% |  -5.53
  (6,11)  |   5 | 0.8168 | 95.62 |100.00% | +0.0762 | +1.13 |  +5.88% |  +7.02
  (9,14)  |   5 | 0.7245 | 95.23 | 88.24% | -0.0160 | +0.74 |  -5.88% |  -5.14
  (12,17) |   5 | 0.7825 | 94.88 | 88.24% | +0.0420 | +0.39 |  -5.88% |  -5.49
  (15,20) |   5 | 0.7832 | 95.86 | 88.24% | +0.0427 | +1.37 |  -5.88% |  -4.52
  (18,23) |   5 | 0.7208 | 92.42 | 88.24% | -0.0197 | -2.07 |  -5.88% |  -7.95
  (21,26) |   5 | 0.7055 | 92.89 | 88.24% | -0.0350 | -1.60 |  -5.88% |  -7.48
  (24,29) |   5 | 0.5825 | 95.04 | 94.12% | -0.1580 | +0.55 |  +0.00% |  +0.55
  (27,32) |   5 | 0.7088 | 94.18 | 88.24% | -0.0317 | -0.31 |  -5.88% |  -6.19
  (30,35) |   5 | 0.6787 | 94.69 | 88.24% | -0.0618 | +0.20 |  -5.88% |  -5.69
  (33,38) |   5 | 0.6650 | 94.96 | 88.24% | -0.0755 | +0.47 |  -5.88% |  -5.41
  (6,12)  |   6 | 0.7692 | 95.39 | 94.12% | +0.0287 | +0.90 |  +0.00% |  +0.90
  (9,15)  |   6 | 0.7405 | 94.65 | 94.12% | -0.0000 | +0.16 |  +0.00% |  +0.16
  (12,18) |   6 | 0.7582 | 94.57 | 88.24% | +0.0177 | +0.08 |  -5.88% |  -5.80
  (15,21) |   6 | 0.7828 | 93.52 | 88.24% | +0.0423 | -0.98 |  -5.88% |  -6.86
  (18,24) |   6 | 0.7308 | 92.93 | 94.12% | -0.0097 | -1.56 |  +0.00% |  -1.56
  (21,27) |   6 | 0.6791 | 92.54 | 82.35% | -0.0615 | -1.95 | -11.76% | -13.72