That problem went viral weeks ago so is no longer a valid test. At the time it was consistently tripping up all the SOTA models at least 50% of the time (you also have to use a sample > 1 given huge variation from even the exact same wording for each attempt).
The large hosted model providers always "fix" these issues as best as they can after they become popular. It's a consistent pattern repeated many times now, benefitting from this exact scenario seemingly "debunking" it well after the fact. Often the original behavior can be replicated after finding sufficient distance of modified wording/numbers/etc from the original prompt.
That problem went viral weeks ago so is no longer a valid test. At the time it was consistently tripping up all the SOTA models at least 50% of the time (you also have to use a sample > 1 given huge variation from even the exact same wording for each attempt).
The large hosted model providers always "fix" these issues as best as they can after they become popular. It's a consistent pattern repeated many times now, benefitting from this exact scenario seemingly "debunking" it well after the fact. Often the original behavior can be replicated after finding sufficient distance of modified wording/numbers/etc from the original prompt.