

# **Event Based Analysis**

Stephen Blair-Chappell

Intel Compiler Labs

# This training relies on you owning a copy of the following...

### Parallel Programming with Parallel Studio XE Stephen Blair-Chappell & Andrew Stokes

### Wiley ISBN: 9780470891650

#### Part I: Introduction

- 1: Parallelism Today
- 2: An Overview of Parallel Studio XE
- 3: Parallel Studio XE for the Impatient



Parallel Programming with Intel<sup>®</sup> Parallel Studio XE Intel<sup>®</sup> Stephen Blan Chapped (Andrew Stokes

#### Part II: Using Parallel Studio XE

- 4: Producing Optimized Code
- 5: Writing Secure Code
- 6: Where to Parallelize
- 7: Implementing Parallelism
- 8: Checking for Errors
- 9: Tuning Parallelism
- 10: Advisor-Driven Design
- 11: Debugging Parallel Applications
- 12: Event-Based Analysis with VTune Amplifier XE

#### Part III :Case Studies

- 13: The World's First Sudoku 'Thirty-Niner'
- 14: Nine Tips to Parallel Heaven
- 15: Parallel Track-Fitting in the CERN Collider
- 16: Parallelizing Legacy Code





### **VTune Amplifier is a simple tool**



Imagine you have a cool car and you want to drive a little faster or fuel effective

All what you'd need you can find here.

VTune as other simple tools can provide basic information on performance of your engine.



### **VTune Amplifier is a complex tool**



However, if you want your car to win a race...

Your tools set has to be much more complex to analyze all aspects of engine functioning. You need to be more proficient in both: the tool's functionality and the engine internals!



# **Testing the Health of an Application**

Does it run Fast?



Does it get through lots of work?



Is any part of the code inefficient?





# A program's performance

A program's performance can be impacted by

- System-wide activity
- Application Heuristics
- CPU architecture

Any analysis Should be Done in This order



Optimization

# Are you Sick?

Fever?

High Pressure?

Aches & Pains







# 'Is my program unwell?'

- Number of Cycles (clock ticks) a program consumes
- Number of Retired Instructions
- CPI Cycles Per Instruction
  - Num Cycles / Num retired instr
  - Low good, High bad
  - Theoretical best 0.25\*\*\*
  - Anything below 1.00 pretty good.

\*\*\*NOTE: Xeon Phi best CPI is 0.5





# Using CPI can be misleading

- Some optimisation steps can lead to an increase in CPI
- Always keep an eye on the fundamentals!
- How long did my program take to run?





# All programs consume cycles

These cycles consist of

Cycles where instructions are usefully executed

Cycles when nothing happens

Cycles where instructions are executed, but the results never used

### Goal of performance tuning is to reduce each of these





Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





#include <ittnotify.h>

```
___itt_domain* pD = ___itt_domain_create( "Time" );
pD->flags = 1; // enable domain
```

for(int i=0;i< 100000;i++)</pre> // mark the begining of the frame

```
// simulate frames with different timings
  for(int j =0; j < 30000; j++); // a delay</pre>
  for(int j = 0; j < 11200; j++); // another delay
```

Optimization Notice

// mark the end of the frame itt frame end v3( pD,NULL);



# Some tips to get you up and running on VTune



# Linux – Vtune is not recognised

• You must *source* the path!

source /opt/intel/vtune\_amplifier\_xe/amplxe-var.sh

# • To start from prompt.

Amplxe-gui &





# Linux – problem accessing the sample driver!

| Welcome New Amplifier XE Re                                                   | esult 🕱                                                                                                                                                                                                                                                                                                       |
|-------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 💹 Choose Analysis Type                                                        | Inte                                                                                                                                                                                                                                                                                                          |
| d Å Analysis Type                                                             |                                                                                                                                                                                                                                                                                                               |
| As As Ar Ax                                                                   | Lightweight Hotspots Copy                                                                                                                                                                                                                                                                                     |
| Algorithm Analysis 点 Lightweight Hotspots 点 Hotspots 点 Concurrency            | Identify your most time-consuming source code. Unlike<br>Hotspots, Lightweight Hotspots has lower overhead because it<br>does not collect stack information. It can also be used to<br>sample all processes on a system. This analysis type uses<br>hardware event-based sampling collection. Press F1 for mo |
| Å Locks and Waits<br>▼ 🧽 Intel Core 2 Processor An<br>Å General Exploration ≡ | Problem accessing the sampling driver. The<br>driver may need to be (re)started. See Installing<br>the Sampling Driver help topic to learn how to<br>configure the sampling driver.                                                                                                                           |
| A Memory Access                                                               | — –                                                                                                                                                                                                                                                                                                           |



Optimization Notice

# Linux – problem accessing the sample driver!

cd /opt/intel/vtune\_amplifier\_xe/sepdk/prebuilt/ ./insmod-sep3 –q *NOTE: The default installation expects users to be in the group 'vtune'* 

2. If not reload it

./insmod-sep3 -r -g democenter

3. if watchdog is causing a problem, disable it

"Warning: NMI watchdog timer is enabled. Turn off the nmi\_watchdog tim before running sampling."

echo 0 > /proc/sys/kernel/nmi\_watchdog

4. If not available, rebuild (see next slide)



<group>



# Linux – problem accessing the sample driver!

4. If not available, rebuild

cd /opt/intel/vtune\_amplifier\_xe/sepdk/src

./build-driver -ni --install-dir=../prebuilt

5. Load the driver

cd ../prebuilt

./insmod-sep3 -r -g democenter

6. To load automatically on reboot

./boot-script --install -g democenter



Optimization



# Windows – no accurate CPU time collection.

\*

| user-mode | sampling a | ang tracing | collection. | Press F1 | tor more details. |
|-----------|------------|-------------|-------------|----------|-------------------|
|           |            | _           |             |          |                   |

A Highly accurate CPU time collection is disabled for this analysis. To enable this feature, run the product with the administrative privileges.

CPU sampling interval, ms: 10

### Start the program in Administrator mode:



Optimization Notice



Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.

# **Architectural Analysis not available**



### You are highlighting the wrong analysis type!





## Linux – The source editor won't open

Set the EDITOR or VISUAL environment variable!

export EDITOR=gedit

or

export EDITOR=vi





Copyright© 2012, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.





intel





Optimization

Notice

intel



# Intel® Xeon Phi™ Coprocessor Performance Monitoring Unit

Visualizing Performance Opportunities using Intel<sup>®</sup> VTune<sup>™</sup> Amplifier



Once hot spots have been identified, performance events can be examined to identify poor resource use

Two performance counters per thread currently on coprocessor

 VTune<sup>™</sup> Amplifier event multiplexing enables sampling more than two events at a time

Focus optimization on hot spots displaying problematic events

- Indicators of resource use can be derived from perf. Events
- Many useful indicators pulled together in General Exploration analysis



# Cycles Per Instruction (CPI), a standard measure, has some special kinks

- Threads on each Intel<sup>®</sup> Xeon<sup>™</sup> Phi core share a clock
  - If all 4 HW threads are active, each gets ¼ total cycles
- Multi-stage instruction decode requires two threads to utilize the whole core one thread only gets half
- With two ops/per cycle (U-V-pipe dual issue):

| Threads<br>per Core | Best CPI<br>per Core | Best CPI<br>per Thread |
|---------------------|----------------------|------------------------|
| <b>1</b> x          | 1.0                  | = 1.0                  |
| 2 x                 | 0.5                  | = 1.0                  |
| 3 x                 | 0.5                  | = 1.5                  |
| <b>4</b> x          | 0.5                  | = 2.0                  |

• To get thread CPI, multiply by the active threads

5/28/2014

**Optimization Notice** 

Intel Confidential



Copyright © 2014, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# As an efficiency metric, CPI must be considered carefully: it IS a ratio

• Changes in CPI absent major code changes can indicate general latency gains/losses

| Metric            | Formula                                               | Investigate if       |
|-------------------|-------------------------------------------------------|----------------------|
| CPI per<br>Thread | CPU_CLK_UNHALTED/<br>INSTRUCTIONS_EXECUTED            | > 4.0, or increasing |
| CPI per<br>Core   | (CPI per Thread) / Number of<br>hardware threads used | > 1.0, or increasing |

- Note the effect on CPI from applied optimizations
- Reduce high CPI through optimizations that target latency
  - Better prefetch
  - Increase data reuse through better blocking

Two more examples why absolute CPI value is less important than changes

Scaling data from a typical lab workload:

| Metric            | 1 hardware<br>thread /<br>core | 2<br>hardware<br>threads / core | 3 hardware<br>threads /<br>core | 4 hardware<br>threads /<br>core |
|-------------------|--------------------------------|---------------------------------|---------------------------------|---------------------------------|
| CPI per<br>Thread | 5.24                           | 8.80                            | 11.18                           | 13.74                           |
| CPI per Core      | 5.24                           | 4.40                            | 3.73                            | 3.43                            |

### Observed CPIs from several tuned workloads:



# Efficiency Metric: Compute to Data Access Ratio

 Measures an application's computational density, and suitability for Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessors

| Metric                             | Formula                                                   | Investigate if                            |
|------------------------------------|-----------------------------------------------------------|-------------------------------------------|
| Vectorization<br>Intensity         | VPU_ELEMENTS_ACTIVE /<br>VPU_INSTRUCTIONS_EXECUTED        |                                           |
| L1 Compute to<br>Data Access Ratio | VPU_ELEMENTS_ACTIVE /<br>DATA_READ_OR_WRITE               | < Vectorization<br>Intensity              |
| L2 Compute to<br>Data Access Ratio | VPU_ELEMENTS_ACTIVE /<br>DATA_READ_MISS_OR_<br>WRITE_MISS | < 100x L1 Compute to<br>Data Access Ratio |

 Increase computational density through vectorization and reducing data access (see cache issues, also, DATA ALIGNMENT!)

# Problem Area: L1 Cache Usage

• Significantly affects data access latency and therefore application performance

| Metric         | Formula                                                    | Investigate if |
|----------------|------------------------------------------------------------|----------------|
| L1<br>Misses   | DATA_READ_MISS_OR_WRITE_MISS +<br>L1_DATA_HIT_INFLIGHT_PF1 |                |
| L1 Hit<br>Rate | (DATA_READ_OR_WRITE – L1 Misses) /<br>DATA_READ_OR_WRITE   | < 95%          |

- Tuning Suggestions:
  - Software prefetching
  - Tile/block data access for cache size
  - Use streaming stores



If using 4K access stride, may be experiencing conflict misses Examine Compiler prefetching (Compiler-generated L1 prefetches should not miss)

\*tuning suggestions requiring deeper understanding of architectural tradeoffs and application data handling details are highlighted with this "ninja" notation



# Problem Area: Data Access Latency

| Metric                         | Formula                                                                                        | Investigate if |
|--------------------------------|------------------------------------------------------------------------------------------------|----------------|
| Estimated<br>Latency<br>Impact | (CPU_CLK_UNHALTED<br>– EXEC_STAGE_CYCLES<br>– DATA_READ_OR_WRITE)<br>/ DATA_READ_OR_WRITE_MISS | >145           |

- Tuning Suggestions:
- Software prefetching
- Tile/block data access for cache size



Use streaming stores



Check cache locality – turn off prefetching and use CACHE\_FILL events reduce sharing if needed/possible

If using 64K access stride, may be experiencing conflict misses

# Problem Area: TLB Usage

• Also affects data access latency and therefore application performance

| Metric                           | Formula                                     | Investigate<br>if: |
|----------------------------------|---------------------------------------------|--------------------|
| L1 TLB miss ratio                | DATA_PAGE_WALK/DATA_READ_OR_WRITE           | > 1%               |
| L2 TLB miss ratio                | LONG_DATA_PAGE_WALK<br>/ DATA_READ_OR_WRITE | > .1%              |
| L1 TLB misses per<br>L2 TLB miss | DATA_PAGE_WALK /<br>LONG_DATA_PAGE_WALK     | > 100x             |

### • Tuning Suggestions:

Improve cache usage & data access latency



If L1 TLB miss/L2 TLB miss is high, try using large pages For loops with multiple streams, try splitting into multiple loops If data access stride is a large power of 2, consider padding between arrays by one 4 KB page

# Problem Area: VPU Usage

Indicates whether an application is vectorized successfully and efficiently

| Metric                     | Formula                                            | Investigate if   |
|----------------------------|----------------------------------------------------|------------------|
| Vectorization<br>Intensity | VPU_ELEMENTS_ACTIVE /<br>VPU_INSTRUCTIONS_EXECUTED | <8 (DP), <16(SP) |

- Tuning Suggestions:
  - Use the Compiler vectorization report!
  - For data dependencies preventing vectorization, try using Intel<sup>®</sup> Cilk<sup>™</sup> Plus #pragma SIMD (if safe!)
  - Align data and tell the Compiler!
  - Restructure code if possible: Array notations, AOS->SOA

# Problem Area: Memory Bandwidth

• Can increase data latency in the system or become a performance bottleneck

| Metric              | Formula                                                                                                                | Investigate if                                                                |
|---------------------|------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Memory<br>Bandwidth | (UNC_F_CH0_NORMAL_READ +<br>UNC_F_CH0_NORMAL_WRITE+<br>UNC_F_CH1_NORMAL_READ +<br>UNC_F_CH1_NORMAL_WRITE) X<br>64/time | < 80GB/sec<br>(practical peak<br>140GB/sec)<br>(with 8 memory<br>controllers) |

- Tuning Suggestions:
  - Improve locality in caches
  - Use streaming stores
  - Improve software prefetching

# Final caution: coprocessor collections can generate dense volumes of data

#### Example: DGEMM on 60+ cores

| Eile View Help                                  |               |                    |                    |                                       |                                           |
|-------------------------------------------------|---------------|--------------------|--------------------|---------------------------------------|-------------------------------------------|
| Memory Bandwidth - Bandwidth                    | 1 2           |                    |                    | Intel VTune A                         | Amplifier XE 2013                         |
| 🛭 \varTheta Analysis Target 🔺 Analysis Type 📟 C | ollection Log | 🗄 Summary 🔗        | Bottom-up          |                                       |                                           |
| Q° <b>Q+</b> Q−Q <b>+</b>                       | 10s           | 15s                | 20s                | · · · · · · · · · · · · · · · · · · · | 🕑 Bandwidth,                              |
|                                                 |               |                    |                    |                                       | للعظة Bandwidt<br>Read Bandwi<br>Read Ban |
| package_0                                       |               |                    |                    |                                       |                                           |
| Cead Bandw                                      |               |                    |                    |                                       |                                           |
|                                                 |               |                    |                    | ) »                                   |                                           |
| Grouping: Function                              |               |                    |                    |                                       | ~                                         |
| Function                                        | CPU Time🛨 🌣   | Module             | Function (Full)    |                                       |                                           |
| [dgemm_linux_native]                            | 5314.070s     | dgemm_linux_native | [dgemm_linux_nativ | ve]                                   |                                           |
| [vmlinux]                                       | 54.130s       | vmlinux            | [vmlinux]          |                                       |                                           |
| [dropbearmulti]                                 | 0.310s        | dropbearmulti      | [dropbearmulti]    |                                       |                                           |
| Selected 1 row(s):                              | 5314.070s     |                    |                    |                                       |                                           |
|                                                 |               |                    |                    |                                       |                                           |
| No filters are applied. Any Process             |               | Any Thread         | ✓ Ar               | ny Module                             |                                           |
| Inline Mode: on 🖌                               |               |                    |                    |                                       |                                           |

#### Tip: Use a CPU Mask to reduce data volume while maintaining equivalent accuracy.

# The life of a program instruction



# For more information on Intel® Xeon Phi™ and VTune™ Amplifier XE

Optimization on the coprocessor: <u>http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization</u>

http://software.intel.com/en-us/articles/optimization-andperformance-tuning-for-intel-xeon-phi-coprocessors-part-2understanding

Coprocessor Performance Monitoring Unit: <u>http://software.intel.com/sites/default/files/forum/278102/intelr</u> <u>-xeon-phitm-pmu-rev1.01.pdf</u>

For general information: <u>http://software.intel.com/mic-developer</u>



# Thank You



# Backup

# **Suggested Order of Fixing Problems**

| Priority | Problem                                  |
|----------|------------------------------------------|
| 1        | Cache misses                             |
| 2        | Contested access                         |
| 3        | Other data access issues                 |
| 4        | Allocation Stalls                        |
| 5        | Micro Assists                            |
| 6        | Branch Mispredictions and machine clears |
| 7        | Other Front-end stalls                   |

See slides in backup section for a more detailed description



# **Cache Misses**

• Why: Cache misses raise the CPI of an application

 Focus on long-latency data accesses coming from 2<sup>nd</sup> and 3<sup>rd</sup> level misses

• How: General Exploration profile, Metrics: *LLC Hit*, *LLC Miss* 

### • What Now:

- If either metric is highlighted for your hotspot, consider reducing misses:
  - Use the cacheline replacement analysis outlined in the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section **B.3.4.2**
  - Use software prefetch instructions
  - Block data accesses to fit into cache
  - Use local variables for threads
  - Pad data structures to cacheline boundaries
  - Change your algorithm to reduce data storage



#### B.3.4.2 Cache-line Replacement Analysis

When an application has many cache misses, it is a good idea to determine where cache lines are being replaced at the highest frequency. The instructions responsible for high amount of cache replacements are not always where the application is spending the majority of its time, since replacements can be driven by the hardware prefetchers and store operations which in the common case do not hold up the pipe-line. Typically traversing large arrays or data structures can cause heavy cache line replacements.

#### Required events

L1D.REPLACEMENT - Replacements in the 1st level data cache.

L2\_LINES\_IN.ALL - Cache lines being brought into the L2 cache.

OFFCORE\_RESPONSE.DATA\_IN\_SOCKET.LLC\_MISS\_LOCAL.DRAM\_0 - Cache lines being brought into the LLC.

#### Usages of events

Identifying the replacements that potentially cause performance loss can be done at process, module, and function level. Do it in two steps:

- Use the precise load breakdown to identify the memory hierarchy level at which loads are satisfied and cause the highest penalty.
- Identify, using the formulas below, which portion of code causes the majority of the replacements in the level below the one that satisfies these high penalty loads.

For example, if there is high penalty due to loads hitting the LLC, check the code which is causing replacements in the L2 and the L1. In the formulas below, the nominators are the replacements accounted for a module or function. The sum of the replacements in the denominators is the sum of all replacements in a cache level for all processes. This enables you to identify the portion of code that causes the majority of the replacements.

#### L1D Cache Replacements

%L1D.REPLACEMENT =

L1D.REPLACEMENT / SumOverAllProcesses(L1D.REPLACEMENT );

#### L2 Cache Replacements

%L2.REPLACEMENT = L2\_LINES\_IN.ALL / SumOverAllProcesses(L2\_LINES\_IN.ALL );

#### L3 Cache Replacements

%L3.REPLACEMENT =

OFFCORE\_RESPONSE.DATA\_IN\_SOCKET.LLC\_MISS\_LOCAL.DRAM\_0/ SumOverAllProcesses(OFFCORE\_RESPONSE.DATA\_IN\_SOCKET.LLC\_MISS\_LOCAL.DRAM\_0 );





# **Contested Accesses**

• Why: Sharing modified data among cores can raise the latency of data access

• How: General Exploration profile, Metrics: *Contested Accesses* 

### • What Now:

• If either metric is highlighted for your hotspot, locate the source code line(s) that is generating HITMs by viewing the source. Look for the MEM\_LOAD\_UOPS\_LLC\_HIT\_RETIRED.XSNP\_HITM\_PS event which will tag to the next instruction after the one that generated the HITM.

• Then use knowledge of the code to determine if real or false sharing is taking place. Make appropriate fixes:

- For real sharing, reduce sharing requirements
- For false sharing, pad variables to cacheline boundaries

### **Hit Modified Data**





a cache nne ni a cache or another core ana the cache nne has not been mounter.

MEM\_LOAD\_UOPS\_LLC\_HIT\_RETIRED.XSNP\_HITM\_PS - Counts demand loads that hit a cache line in the cache of another core and the cache line has been written to by that other core. This event is important for many performance bottlenecks that can occur in multi-threaded applications, such as lock contention and false sharing.

B-46





#### **Back-End Bound**

### Other Data Access Issues: Blocked Loads Due to No Store Forwarding

- Why: If it is not possible to forward the result of a store through the pipeline, dependent loads may be blocked
- **How:** General Exploration profile, Metric: *Loads Blocked by Store Forwarding*

### • What Now:

• If the metric is highlighted for your hotspot, investigate:

• View source and look at the LD\_BLOCKS\_STORE\_FORWARD event. Usually this event tags to next instruction after the attempted load that was blocked. Locate the load, then try to find the store that cannot forward, which is usually within the prior 10-15 instructions. The most common case is that the store is to a smaller memory space than the load. Fix the store by storing to the same size or larger space as the ensuing load.





### 2.2.4.4 Store Forwarding

If a load follows a store and reloads the data that the store writes to memory, the Intel Core microarchitecture can forward the data directly from the store to the load. This process, called store to load forwarding, saves cycles by enabling the load to obtain the data directly from the store operation instead of through memory.

The following rules must be met for store to load forwarding to occur:

- The store must be the last store to that address prior to the load.
- The store must be equal or greater in size than the size of data being loaded.
- The load cannot cross a cache line boundary.
- The load cannot cross an 8-Byte boundary. 16-Byte loads are an exception to this rule.
- The load must be aligned to the start of the store address, except for the following exceptions:
  - An aligned 64-bit store may forward either of its 32-bit halves
  - An aligned 128-bit store may forward any of its 32-bit quarters
  - An aligned 128-bit store may forward either of its 64-bit halves

Software can use the exceptions to the last rule to move complex structures without losing the ability to forward the subfields.





### Other Data Access Issues: Cache Line Splits

- Why: Multiple cache line splits can result in load penalties.
- **How:** General Exploration profile, Metric: *Split Loads, Split Stores*

### • What Now:

• If the metric is highlighted for your hotspot, investigate by viewing the metrics at the source code level. The split load event, MEM\_UOP\_RETIRED.SPLIT\_LOADS\_PS, should tag to the next executed instruction after the one causing the split. If the split store ratio is greater than .01 at any source address, it is worth investigating.

• To fix these issues, ensure your data is aligned. Especially watch out for mis-aligned 256-bit AVX store operations.



### Other Data Access Issues: 4K Aliasing

- Why: Aliasing conflicts result in having to re-issue loads.
- How: General Exploration profile, Metric: 4K Aliasing
- What Now:
  - If this metric is highlighted for your hotspot, investigate at the sourcecode level.
  - Fix these issues by changing the alignment of the load. Try aligning data to 32 bytes, changing offsets between input and output buffers (if possible), or using 16-Byte memory accesses on memory that is not 32-Byte aligned.



# 11.8 **4K ALIASING**

4-KByte memory aliasing occurs when the code stores to one memory location and shortly after that it loads from a different memory location with a 4-KByte offset between them. For example, a load to linear address 0x400020 follows a store to linear address 0x401020.

The load and store have the same value for bits 5 - 11 of their addresses and the accessed byte offsets should have partial or complete overlap.

4K aliasing may have a five-cycle penalty on the load latency. This penalty may be significant when 4K aliasing happens repeatedly and the loads are on the critical path. If the load spans two cache lines it might be delayed until the conflicting store is committed to the cache. Therefore 4K aliasing that happens on repeated unaligned Intel AVX loads incurs a higher performance penalty.

To detect 4K aliasing, use the LD\_BLOCKS\_PARTIAL.ADDRESS\_ALIAS event that counts the number of times Intel AVX loads were blocked due to 4K aliasing.

To resolve 4K aliasing, try the following methods in the following order:

- Align data to 32 Bytes.
- Change offsets between input and output buffers if possible.
- Use 16-Byte memory accesses on memory which is not 32-Byte aligned.



### Other Data Access Issues: DTLB Misses

**Why:** First-level DTLB Load misses (Hits in the STLB) incur a latency penalty. Second-level misses require a page walk that can affect your application's performance.

**How:** General Exploration profile, Metric: *DTLB Overhead* 

- If this metric is highlighted for your hotspot, investigate at the sourcecode level.
- To fix these issues, target data locality to TLB size, use the Extended Page Tables (EPT) on virtualized systems, try large pages (database/server apps only), increase data locality by using better memory allocation or Profile-Guided Optimization







## **Allocation Stalls**

**Why:** Certain types of instructions can cause allocation stalls because they take longer to retire. These increase latencies overall.

**How:** General Exploration Profile, Metric: *LEA Stalls, Flags Merge Stalls* 

- If this metric is highlighted for your hotspot, investigate at the sourcecode level.
- Try to eliminate uses of 3-operand LEA instructions, Look for certain uses of an LEA instruction (see section 3.5.1.3 of <u>Intel® 64 and IA-32 Architectures Optimization</u> <u>Reference Manual</u>) or partial register use (see section 3.5.2.4 of <u>Intel® 64 and IA-32 Architectures Optimization</u> <u>Reference Manual</u>) and fix.



# **Microcode Assists**

**Why:** Assists from the microcode sequencer can have long latency penalties.

How: General Exploration Profile, Metric: Assists

- If this metric is highlighted for your hotspot, re-sample using the additional assist events to determine the cause.
- If FP\_ASSISTS.ANY / INST\_RETIRED.ANY is significant, check for denormals. To fix enable FTZ and/or DAZ if using SSE/AVX instructions or scale your results up or down depending on the problem
- If ((OTHER\_ASSISTS.AVX\_TO\_SSE\_NP\*75) / CPU\_CLK\_UNHALTED.THREAD) or ((OTHER\_ASSISTS.SSE\_TO\_AVX\_NP\*75) / CPU\_CLK\_UNHALTED.THREAD) is greater than .1, reduce transitions between SSE and AVX code





# **Branch Mispredicts**

**Why:** Mispredicted branches cause pipeline inefficiencies due to wasted work or instruction starvation (while waiting for new instructions to be fetched)

**How:** General Exploration Profile, Metric: *Branch Mispredict* 

- If this metric is highlighted for your hotspot try to reduce misprediction impact:
- Use compiler options or profile-guided optimization (PGO) to improve code generation
- Apply hand-tuning by doing things like hoisting the most popular targets in branch statements.





#### Cancelled

### **Machine Clears**

**Why:** Machine clears cause the pipeline to be flushed and the store buffers emptied, resulting in a significant latency penalty.

**How:** General Exploration Profile, Metric: *Machine Clears* 

### Now What:

- If this metric is highlighted for your hotspot try to determine the cause using the specific events:
- If MACHINE\_CLEARS.MEMORY\_ORDERING is significant, investigate at the sourcecode level. This could be caused by 4K aliasing conflicts or contention on a lock (both previous issues).
- If MACHINE\_CLEARS.SMC is significant, the clears are being caused by self-modifying code, which should be avoided.



## **Front-end Stalls**

**Why:** Front-end stalls (at the Issue stage of the pipeline) can cause instruction starvation, which may lead to stalls at the execute stage in the pipeline.

**How:** General Exploration profile, Metric: *Front-end Bound Pipeline Slots* 

- If this metric is highlighted for your hotspot, try using better code layout and generation techniques:
  - Try using profile-guided optimizations (PGO) with your compiler
  - Use linker ordering techniques (/ORDER on Microsoft's linker or a linker script on gcc)
  - Use switches that reduce code size, such as /O1 or /Os





## Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



