So the sane solution here is just leaving unreliable stuff to humans and reliable to machines. Especially so when human wellbeing and freedom are at the stake.
To define the line between the two, calculate the percentage of cases when mainstream CPUs return anything but integer 4 after addition of integer 2 and integer 2, and use that as the threshold to define "reliable".