Abhinav Upadhyay’s Post

View profile for Abhinav Upadhyay, graphic

Senior Software Engineer

As I continue to learn about the microarch of x86, here are some of the significant changes Intel introduced in their Intel core processors (in 2010) to improve their performance. - 2nd level branch predictor: to aid applications with large code footprint where the existing predictor might not have been sufficient - Macrofusion: They can fuse a test/cmp instruction with the corresponding jmp to reduce the number of ops in the pipeline and reducing latency of such code - Loop stream detector (LSD): Loops with short body but large number of iterations execute same instructions continuously. Instead of fetching/decoding these repeatedly, the LSD will cache these decoded loop instructions in a buffer. This avoids the use of branch predictor, fetch/decode units and saves power. - Increase in the window size of instructions that the out-of-order execution engine scans from 96 to 128 to get more instruction throughput - Reduction branch misses penalty: When the predictor makes a mistake the pipeline needs to be flushed, the wrong instructions need to be retired, and instructions for the correct branch need to be fetched. In the prev generations the fetching of the correct branch instructions could not be done until the instructions for the wrong branch were retired. With the core generation processors this wait is not required and the instructions for the right branch can start to get fetched and allocated as soon as branch miss is detected. Although the flush is still needed. - 2nd Level TLB: There is a 2nd level unified TLB for helping applications with large code and data sizes, such as databases. - 1GB page sizes: The 32nm microarch processors can also support pages of size 1GB. (I was not aware of this) - Unaligned memory access optimization: The 16-byte SSE vector load/store instructions used to have increased latency when the address was not 16-byte aligned. With the Intel core gen processors, unaligned access has same latency as aligned access - Reduced latency of cache-line split memory access: If accessing data which is split across two cache lines, the latency is reduced - Reduced pipeline stalls because of LCOK/XHCG instructions: These are required for atomic memory operations. Previously the memory bus would be locked until the locking instruction finished. But now, by allowing younger load ops to execute as long as they are not overlapping the memory under the lock, the performance of multithreaded code can improve. - No resource sharing when only one thread is running on an HT enabled core. ------------------------------ All of this is from 2010, things might be a bit different in present generation processors.

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics