三十年前的 Pentium FDIV bug 居然還有新的發現冒出來:「Intel's $475 million error: the silicon behind the Pentium division bug」。
先快速複習當年的情況,Intel 在 Pentium 上面引入了新的除法演算法,這個演匴法用到了查表的技術,但因為 script 裡面有問題,導致表格裡面缺了五個值有錯誤,所以造成計算錯誤:「The Truth Behind the Pentium Bug (1995)」。
An engineer prepared the lookup table on a computer and wrote a script in C to download it into a PLA (programmable logic array) for inclusion in the Pentium's FPU. Unfortunately, due to an error in the script, five of the 1066 table entries were not downloaded. To compound this mistake, nobody checked the PLA to verify the table was copied correctly.
當年所有的報導都指出是 5 個錯誤,然後 Intel 把錯誤補上了。
但三十年後 Ken Shirriff 直接對 CPU 掃高解析 X-ray 發現很多有趣的事情,第一個是漏掉的值有 16 個而非 5 個,不過其中的 11 個不會觸發 bug,所以從軟體層面反推的話,的確只會看到 5 個值:
However, my analysis shows that 16 entries were omitted due to a mathematical mistake in the definition of the lookup table. Five of the missing entries trigger the bug— also called the FDIV bug after the floating-point division instruction "FDIV"—while 11 of the missing entries have no effect.
第二個是修復的方式,當初大家猜測 Intel 應該是說把值補回去就搞定了,但實際上不只是修正了值,還順便 refactor 掉了這塊,反而讓這塊電路更小了:
However, the updated PLA (below) shows something entirely different. The updated PLA is exactly the same size as the original PLA. However, about 1/3 of the terms were removed from the PLA, eliminating hundreds of transistors. Only 74 of the PLA's 120 rows are used, and the rest are left empty. (The original PLA had 8 empty rows.) How could removing terms from the PLA fix the problem?
The explanation is that Intel didn't just fill in the five missing table entries with the correct value of 2. Instead, Intel filled all the unused table entries with 2, as shown below. This has two effects. First, it eliminates any possibility of hitting a mistakenly-empty entry. Second, it makes the PLA equations much simpler.
是個三十年後的 reverse engineering 的小翻案...