As I continue to learn about the microarch of x86, here are some of the significant changes Intel introduced in their Intel core processors (in 2010) to improve their performance. - 2nd level branch predictor: to aid applications with large code footprint where the existing predictor might not have been sufficient - Macrofusion: They can fuse a test/cmp instruction with the corresponding jmp to reduce the number of ops in the pipeline and reducing latency of such code - Loop stream detector (LSD): Loops with short body but large number of iterations execute same instructions continuously. Instead of fetching/decoding these repeatedly, the LSD will cache these decoded loop instructions in a buffer. This avoids the use of branch predictor, fetch/decode units and saves power. - Increase in the window size of instructions that the out-of-order execution engine scans from 96 to 128 to get more instruction throughput - Reduction branch misses penalty: When the predictor makes a mistake the pipeline needs to be flushed, the wrong instructions need to be retired, and instructions for the correct branch need to be fetched. In the prev generations the fetching of the correct branch instructions could not be done until the instructions for the wrong branch were retired. With the core generation processors this wait is not required and the instructions for the right branch can start to get fetched and allocated as soon as branch miss is detected. Although the flush is still needed. - 2nd Level TLB: There is a 2nd level unified TLB for helping applications with large code and data sizes, such as databases. - 1GB page sizes: The 32nm microarch processors can also support pages of size 1GB. (I was not aware of this) - Unaligned memory access optimization: The 16-byte SSE vector load/store instructions used to have increased latency when the address was not 16-byte aligned. With the Intel core gen processors, unaligned access has same latency as aligned access - Reduced latency of cache-line split memory access: If accessing data which is split across two cache lines, the latency is reduced - Reduced pipeline stalls because of LCOK/XHCG instructions: These are required for atomic memory operations. Previously the memory bus would be locked until the locking instruction finished. But now, by allowing younger load ops to execute as long as they are not overlapping the memory under the lock, the performance of multithreaded code can improve. - No resource sharing when only one thread is running on an HT enabled core. ------------------------------ All of this is from 2010, things might be a bit different in present generation processors.
Abhinav Upadhyay’s Post
More Relevant Posts
-
What affects your personal computer or server’s throughput far more than CPU’s core frequency or number of cores? Obviously Memory Latency But what’s it? Let’s understand with the help of the attached animation. Observe it closely and you would notice that the Wood Cutter spends far less time in cutting the wood log compared to waiting for the next log to arrive. Now assume for the time being that the conveyor belt carrying the fresh wood log is running at its peak speed and it’s carrying the maximum possible load. The wood cutter yells and asks for new log the moment a fresh log reaches to her. The period between the wood cutter finish cutting the wood log in front of him and next wood log arriving is analogous to Memory Latency. While you are at it, let’s try shifting our perspective to a very basic digital computer. Now let’s say your compuer has a custom routine implemented in it which checks on a specific address(hard coded in the circuit) if there’s instruction available for execution. It loads the instruction and checks for the data referenced in it, and eventually asks the RAM to give it the data. Think of the channel(basically the circuit connecting the RAM and CPU chip) as the wood worker’s conveyor belt. The data takes sweet time to reach the CPU just like the wood log on the conveyor belt. Before the CPU reads and understand the instruction it doesn’t know what data might be of use, so it’s impractical to load the data before or alongside the instruction. For every instruction we spend far more time waiting for our data request to reach to RAM controller and get the data mostly constrained by speed at which electricity can travel in the circuits(which in turn is fundamentally constrained by the max speed at which light can travel in vacuum known as c) So, on a modern computer the data make it to CPU’s registers in anywhere between 60-120ns. The cpu core is doing mostly nothing while waiting for the data. A modern CPU be it in mobile or laptop or desktop far exceeds 1 GHz clock speed. 1 Giga is 10^9 or 1 billion. So theoretically a cpu core can process 1 billion instructions per second. But if each instruction requires a communication with RAM for data, this practically comes down to billion/hundred = 10 million instructions. This kind of CPU idleness is known as Memory Inefficiency. Tricks like Caching(most recently used data is kept in sRAM onboard CPU chip), Branch Prediction(heuristically figure out what instruction would be executed next & request for the needed data in advance, it does really well for iterations) are deployed to improve performance. Net improvements ranges from 20 to 60%+ depending on the applications. CPUs also support Fused instructions like Fused Multiply–Add(FMA) to combine instructions which gets executed in sequence most of the times. FMA significantly reduces the round trips to RAM. I hope you learned something new today and would advocate for more efficient software and hardware.
To view or add a comment, sign in
-
primary components of a CPU Together, these parts carry out the numerous and intricate activities necessary for computer functioning, guaranteeing that commands are handled quickly and accurately. 1. Arithmetic Logic Unit (ALU) The ALU is a crucial part that performs logical and arithmetic operations. In addition to carrying out logical operations like AND, OR, and NOT, it manages basic computations like addition, subtraction, multiplication, and division. These functions are essential for decision-making and processing tasks. 2. Control Unit (CU): The CU is essential in retrieving instructions from memory, decoding them, and coordinating the ALU, registers, and other parts to carry them out. Furthermore, it controls the data flow both within and between the CPU and other components of the computer. 3. Registers: Registers are compact, quick-to-access storage spaces where instructions and data are temporarily stored. They contain everything—from addresses and preliminary processing results to raw data and instructions. Registers are vital to keeping the speed and efficiency of your CPU's operations since they provide instant access to critical data. Common types include: - Accumulator (ACC): Stores the result of arithmetic and logical operations. - Program Counter (PC): Keeps track of the address of the next instruction to be executed. - Instruction Register (IR): Holds the current instruction being executed. - Data Registers: Store data that is used or manipulated by the ALU. - Address Registers: Hold memory addresses, such as the Memory Address Register (MAR) and Memory Data Register (MDR). 4. Cache: The Cache stores copies of frequently used instructions and data from main memory to enable quick access to them. For efficiency, it is arranged into levels: The L1 cache, which is located closest to the CPU cores, is the smallest and fastest. Larger and marginally slower than L1, L2 cache can be allocated to specific cores or shared by several cores. Larger and slower than L2, L3 cache is usually shared by all CPU cores. Your CPU will operate at peak efficiency thanks to these cache levels, which also help to lower latency and speed up processing. 5. Bus Interface Unit (BIU): The BIU controls how the CPU communicates with the memory and I/O devices of the system. It manages the data transport over system buses as well as the addressing, reading from, and writing to I/O devices and memory. The BIU maintains seamless interactions between the CPU and other components, which is critical for overall system performance, by effectively handling these data transfers......... #TechInsightsWithOlympusTech
To view or add a comment, sign in
-
### Understanding How the CPU and Registers Work The **Central Processing Unit (CPU)** is often called the brain of a computer. It performs all the processing tasks by executing instructions from programs stored in memory. At the heart of the CPU's operations are **registers**, which play a crucial role in managing data and instructions. Let’s explore how these components work together. --- #### **The Role of the CPU** The CPU is responsible for performing four key operations, often referred to as the **fetch-decode-execute cycle**: 1. **Fetch**: The CPU retrieves an instruction from the computer's memory (RAM). The address of the instruction is stored in the **program counter (PC)**. 2. **Decode**: The CPU interprets the instruction to determine what action is needed. 3. **Execute**: The CPU performs the specified operation, such as calculations or data transfers. 4. **Store**: The result of the operation is saved in a register or sent to memory. --- #### **What Are Registers?** Registers are small, high-speed storage locations inside the CPU. They store data temporarily and are used to speed up data processing. Registers can hold instructions, addresses, or intermediate data for operations. Unlike RAM, registers are extremely fast but limited in size. --- #### **Types of Registers in a CPU** 1. **General-Purpose Registers**: - Used for arithmetic, logic, and data manipulation. - Examples: `AX`, `BX`, `CX`, `DX` in x86 architecture. 2. **Special-Purpose Registers**: - **Program Counter (PC)**: Holds the memory address of the next instruction. - **Instruction Register (IR)**: Stores the currently executing instruction. - **Stack Pointer (SP)**: Points to the top of the stack in memory. - **Status Register**: Holds flags that indicate the status of operations (e.g., zero flag, carry flag). 3. **Accumulator Register**: - A special register used for arithmetic and logic operations. - The result of operations is often stored here. --- #### **How Registers and the CPU Work Together** 1. **Instruction Fetching**: - The **program counter** provides the address of the instruction to the memory. - The instruction is fetched and stored in the **instruction register**. 2. **Instruction Execution**: - Data required for the operation is loaded into **general-purpose registers** from memory. - The **arithmetic logic unit (ALU)** processes the data using inputs from registers. - Results are temporarily stored back in registers before being written to memory. 3. **Control and Coordination**: - The **control unit** ensures instructions are fetched and executed in the right order, using registers for temporary data handling.
To view or add a comment, sign in
-
The CPU (Central Processing Unit) is the primary component of a computer that performs most of the data processing and executes instructions from the operating system and applications. It is essentially the "brain" of the computer, responsible for: 1. Executing instructions (programs) 2. Performing calculations and data processing 3. Controlling other components (memory, input/output devices) 4. Managing data transfer between components The CPU consists of several key components: 1. Control Unit (CU): retrieves and decodes instructions 2. Arithmetic Logic Unit (ALU): performs calculations and logical operations 3. Registers: small amount of on-chip memory for temporary storage 4. Cache Memory: small, fast memory for frequently accessed data CPU performance is measured in: 1. Clock Speed (GHz): how many instructions per second 2. Number of Cores: multiple processing units for parallel execution 3. Cache Size: amount of fast memory for data storage 4. Instruction Set Architecture (ISA): the language the CPU understands Examples of CPUs include: 1. Intel Core i5, i7 2. AMD Ryzen 5, 7 3. Apple M1 In summary, the CPU is the central component [25/05, 4:46 pm] Meta AI: The CPU (Central Processing Unit) is the primary component of a computer that performs most of the data processing and executes instructions from the operating system and applications. It is essentially the "brain" of the computer, responsible for: 1. Executing instructions (programs) 2. Performing calculations and data processing 3. Controlling other components (memory, input/output devices) 4. Managing data transfer between components The CPU consists of several key [25/05, 4:46 pm] Meta AI: The CPU (Central Processing Unit) is the primary component of a computer that performs most of the data processing and executes instructions from the operating system and applications. It is essentially the "brain" of the computer, responsible for: 1. Executing instructions (programs) 2. Performing calculations and data processing 3. Controlling other components (memory, input/output devices) 4. Managing data transfer between components The CPU consists of several key components: 1. Control Unit (CU): retrieves and decodes instructions 2. Arithmetic Logic Unit (ALU): performs calculations and logical operations 3. Registers: small amount of on-chip memory for temporary storage 4. Cache Memory: small, fast memory for frequently accessed data CPU performance is measured in: 1. Clock Speed (GHz): how many instructions per second 2. Number of Cores: multiple processing units for parallel execution 3. Cache Size: amount of fast memory for data storage 4. Instruction Set Architecture (ISA): the language the CPU understands Examples of CPUs include: 1. Intel Core i5, i7 2. AMD Ryzen 5, 7 3. Apple M1
To view or add a comment, sign in
-
Day 3: Let’s explore the Decode Stage of LC3 The Instruction Register comprised of: - 4-bit Op-Code which tells which operation we have to perform. - Microcontroller has a total of 8 Registers, R0-R7. - Source Register 1(SR1): 3-bit field in the instruction tells the CPU which register to use as the first source of data (operand) for an operation. - Source Register 2(SR2): Another 3-bit field in the instruction that specifies the second source register, used as second operand in operations like addition or subtraction. - DR (Destination Register): 3-bit field that indicates where the result of the operation should be stored, It tells the CPU which register to write the output to. What’s the Decode Stage? The Decode stage interprets the instruction fetched and prepares control signals for subsequent operations. - Inputs: - clock, reset [1 bit]: Basic timing and initialization. - Instr_dout [16 bits]: Instruction data from memory, During the fetch stage, the CPU needs an instruction from (IMEM) based on the address in the PC. Once the instruction is fetched, instr_dout holds the instruction & sends it to the decode stage of the CPU. - npc_in [16 bits]: NPC value from the Fetch stage,Carries the address of the next instruction (from npc) to the execute stage, ensuring CPU is prepared for the next step in execution. - psr [3 bits]: program status register contains the status flags that indicate the outcome of most recent operation, such as arithmetic or loading values into register. Status flags: Negative- indicates if result of the operation is negative. Zero - indicates if result of the operation is zero. Positive -indicates if result of the operation is positive. Helps CPU understand the result of recent calculations or data operations. - enable_decode [1 bit]: Enables the decode unit, If 1, the decode stage is enabled & works normally. If 0, the decode stage is paused & doesn’t process any instructions. - Outputs: - IR [16 bits]: Instruction Register. - npc_out [16 bits]: Reflects npc_in. - E_control [6 bits]: Controls the Execute unit,Manages the behavior of the execute unit in the CPU.It determines which input values are sent to the ALU & what kind of operation the ALU will perform. - W_control: Determines data writeback, decide which result should be written to register file during the writeback stage of the CPU. Choices for writing:- ALU O/P like addition or subtraction LEA calculates an address relative to PC such that with LEA instruction, this calculated address can be written to the register. Memory output (loads) operation involving loading or fetching from memory is written to the register. - Mem_control [7 bits]: It helps to choose the correct states or actions that should be taken during memory access, whether reading or writing to memory. Understanding how instructions are decoded is essential for correct execution & subsequent operations. #DecodeStage #SystemVerilog #LC3Microcontroller #DigitalDesign
To view or add a comment, sign in
-
Hardware Vulnerability in Apple’s M-Series Chips: It’s yet another hardware side-channel attack: The threat resides in the chips’ data memory-dependent prefetcher, a hardware optimization that predicts the memory addresses of data that running code is likely to access in the near future. By loading the contents into the CPU cache before it’s actually needed, the DMP, as the feature is abbreviated, reduces latency between the main memory and the CPU, a common bottleneck in modern computing. DMPs are a relatively new phenomenon found only in M-series chips and Intel’s 13th-generation Raptor Lake microarchitecture, although older forms of prefetchers have been common for years...
To view or add a comment, sign in
-
🔍 𝐓𝐎𝐏𝐈𝐂: 𝐂𝐚𝐜𝐡𝐞 𝐌𝐞𝐦𝐨𝐫𝐲 – 𝐒𝐩𝐞𝐞𝐝𝐢𝐧𝐠 𝐔𝐩 𝐂𝐨𝐦𝐩𝐮𝐭𝐢𝐧𝐠 𝐏𝐫𝐨𝐜𝐞𝐬𝐬𝐞𝐬 💡𝗪𝗵𝗮𝘁 𝗶𝘀 𝗖𝗮𝗰𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆? Cache memory is a small but ultra-fast type of memory located between the CPU and main memory (RAM). Its primary function is to store frequently accessed data and instructions, allowing the CPU to retrieve them faster than accessing the main memory. This optimization significantly boosts overall system performance by minimizing the time the CPU spends waiting for data from RAM. 💡𝗛𝗼𝘄 𝗗𝗼𝗲𝘀 𝗖𝗮𝗰𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 𝗪𝗼𝗿𝗸? Cache memory operates using three key principles: 1️⃣ 𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗹 𝗟𝗼𝗰𝗮𝗹𝗶𝘁𝘆: Frequently accessed data is likely to be requested again soon, so it is stored in the cache for quick retrieval. 2️⃣ 𝗦𝗽𝗮𝘁𝗶𝗮𝗹 𝗟𝗼𝗰𝗮𝗹𝗶𝘁𝘆: Data located near recently accessed data is also likely to be accessed soon and is, therefore, cached. 3️⃣ 𝗣𝗿𝗲-𝗳𝗲𝘁𝗰𝗵𝗶𝗻𝗴: The system anticipates future requests and loads data into the cache before it’s requested. 💡𝗧𝘆𝗽𝗲𝘀 𝗼𝗳 𝗖𝗮𝗰𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆: 1️⃣ 𝗟𝟭 𝗖𝗮𝗰𝗵𝗲 (𝗟𝗲𝘃𝗲𝗹 𝟭): The fastest and smallest cache, integrated directly into the CPU. 2️⃣ 𝗟𝟮 𝗖𝗮𝗰𝗵𝗲 (𝗟𝗲𝘃𝗲𝗹 𝟮): Larger than L1 but slightly slower, often located on the CPU chip or close to it. 3️⃣ 𝗟𝟯 𝗖𝗮𝗰𝗵𝗲 (𝗟𝗲𝘃𝗲𝗹 𝟯): Shared by multiple cores in multi-core processors, larger but slower than L1 and L2. 💡𝗖𝗮𝗰𝗵𝗲 𝗢𝗿𝗴𝗮𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Cache memory can be organized in three ways: 𝟭. 𝗗𝗶𝗿𝗲𝗰𝘁-𝗠𝗮𝗽𝗽𝗲𝗱 𝗖𝗮𝗰𝗵𝗲: Each block of main memory is mapped to exactly one cache line. 𝟮. 𝗙𝘂𝗹𝗹𝘆-𝗔𝘀𝘀𝗼𝗰𝗶𝗮𝘁𝗶𝘃𝗲 𝗖𝗮𝗰𝗵𝗲: Any block of memory can be placed in any cache line, offering flexibility but higher complexity. 𝟯. 𝗦𝗲𝘁-𝗔𝘀𝘀𝗼𝗰𝗶𝗮𝘁𝗶𝘃𝗲 𝗖𝗮𝗰𝗵𝗲: A compromise between the two, where each block can be mapped to a specific set of cache lines. 💡𝗪𝗵𝘆 𝗖𝗮𝗰𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 𝗶𝘀 𝗖𝗿𝗶𝘁𝗶𝗰𝗮𝗹? In modern computing, the CPU operates at incredibly high speeds, but accessing data from the main memory is slower. Cache memory bridges this gap, ensuring the processor operates efficiently, leading to improved application performance and reduced latency. 💡𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗮𝗻𝗱 𝗙𝘂𝘁𝘂𝗿𝗲 𝗼𝗳 𝗖𝗮𝗰𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆: 1️⃣ 𝗖𝗮𝗰𝗵𝗲 𝗖𝗼𝗵𝗲𝗿𝗲𝗻𝗰𝘆: Ensuring consistency of data across multiple caches in multi-core processors. 2️⃣ 𝗖𝗮𝗰𝗵𝗲 𝗦𝗶𝘇𝗲 𝘃𝘀. 𝗦𝗽𝗲𝗲𝗱: Larger caches are slower, and striking the right balance between size and speed is an ongoing challenge. 3️⃣ 𝗣𝗼𝘄𝗲𝗿 𝗖𝗼𝗻𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻: Optimizing cache design for minimal energy usage in mobile and embedded systems. #𝗖𝗮𝗰𝗵𝗲𝗠𝗲𝗺𝗼𝗿𝘆 #𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 #𝗖𝗣𝗨 #𝗗𝗮𝘁𝗮𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 #𝗧𝗲𝗰𝗵𝗜𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻 #𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 #𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿𝗦𝗰𝗶𝗲𝗻𝗰𝗲 #𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻
To view or add a comment, sign in
-
🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures. https://lnkd.in/g6rNKYry
GitHub - huggingface/optimum-intel: 🤗 Optimum Intel: Accelerate inference with Intel optimization tools
github.com
To view or add a comment, sign in
-
userland@localhost:~$ nvim multithmath.f95 userland@localhost:~$ gfortran multithmath.f95 -fopenmp -ffree-form -o mm userland@localhost:~$ ./mm Using 8 threads. n = 500 Matrix multiplication completed in 0.48200661400005629 seconds. userland@localhost:~$ cat multithmath.f95 program multithreading_demo use omp_lib ! Include OpenMP library for timing and parallelization features implicit none integer, parameter :: n = 500, block_size = 32 real(8), dimension(n,n) :: A, B, C integer :: i, j, k, i_block, j_block, k_block real(8) :: start_time, end_time integer :: thread_count ! Initialize matrices A and B call random_number(A) call random_number(B) C = 0.0 ! Query number of available threads thread_count = omp_get_max_threads() print *, "Using ", thread_count, " threads." print *, " n = ", n ! Start time measurement using OpenMP timer start_time = omp_get_wtime() ! Optimized matrix multiplication with OpenMP parallelization and loop reordering !$omp parallel do collapse(2) private(i, j, k) shared(A, B, C) do j = 1, n do i = 1, n do k = 1, n C(i,j) = C(i,j) + A(i,k) * B(k,j) end do end do end do !$omp end parallel do ! End time measurement end_time = omp_get_wtime() ! Output the execution time print *, "Matrix multiplication completed in ", end_time - start_time, " seconds." end program multithreading_demo userland@localhost:~$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: Qualcomm Model name: Kryo V2 Model: 4 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 0xa CPU max MHz: 1900.8000 CPU min MHz: 300.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid Model name: Cortex-A73 Model: 0 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: r1p0 CPU max MHz: 2400.0000 CPU min MHz: 300.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Vulnerable Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitizat ion Spectre v2: Not affected Srbds: Not affected Tsx async abort: Not
To view or add a comment, sign in