Hello there

My current technology stack: .NET 9, Python, TypeScript, and Azure.

I develop microservices and terraform of different sizes. Sharing my challenges and key learning.

About

The views expressed in this blog are my own and do not reflect my employer's. I am not responsible for any consequences of using the information provided. This blog is for educational purposes only, not for commercial use. Readers should apply their own judgment.

Beyond Benchmarks: To Measure .NET 9 Performance in Real-World Applications

March 16, 2025 Dipankar Haldar 43 people viewed this post

Beyond Benchmarks: To Measure .NET 9 Performance in Real-World Applications

Introduction

Performance in software development is often measured using traditional benchmarks like throughput, latency, and memory usage. However, these metrics don't always reflect real-world performance issues. In .NET 9, Microsoft has introduced several performance optimizations, but how do I truly measure their impact beyond standard benchmarks?

This article takes a fresh perspective on measuring .NET 9 performance—focusing on practical metrics that developers rarely explore. I tried to dive into code locality, branch prediction efficiency, cache friendliness, and micro-architectural optimizations that go beyond just "faster loops" and "less memory allocations."


1. The Hidden Costs of Code Locality in .NET 9

Modern processors execute instructions much faster than they can fetch them from memory. Code locality—how well-related pieces of code fit into the CPU's instruction cache (L1, L2, L3)—plays a critical role in performance.

How to Measure Code Locality in .NET 9?
  • Use Event Tracing for Windows (ETW) and PerfView to analyze instruction cache misses.
  • Enable JIT tiered compilation logging with:
    DOTNET_TieredCompilation=1
    DOTNET_TieredPGO=1
    
    This helps measure how effectively code is optimized at runtime.
  • Profile with Intel VTune to examine instruction fetch latency.
Real-World Example: Optimizing Large Switch Statements

Large switch statements suffer from poor branch prediction and code locality issues. Instead, replacing them with jump tables using .NET 9 Dynamic PGO (Profile-Guided Optimizations) dramatically improves performance.

static void ProcessInput(int option)
{
    switch (option)
    {
        case 1: DoTaskA(); break;
        case 2: DoTaskB(); break;
        case 3: DoTaskC(); break;
        default: DoDefaultTask(); break;
    }
}

Optimized with jump tables:

static readonly Action[] tasks = { DoDefaultTask, DoTaskA, DoTaskB, DoTaskC };
static void ProcessInputOptimized(int option)
{
    (option >= 0 && option < tasks.Length ? tasks[option] : DoDefaultTask)();
}

📊 Performance Gain: This reduces branch mispredictions and enhances code locality by leveraging array indexing rather than a chain of conditional jumps.


2. Unlocking the Power of Hardware Branch Prediction

In high-performance .NET applications, one of the least explored performance bottlenecks is branch misprediction. The CPU guesses the outcome of a branch before execution; if it guesses wrong, it wastes cycles.

How to Measure Branch Prediction in .NET 9?
  • Use perf on Linux or Windows Performance Recorder (WPR) to track branch mispredictions per 1000 instructions.
  • Analyze Branch Target Buffer (BTB) statistics in tools like Intel VTune.
Fixing Misprediction in .NET 9

Consider this example:

bool ShouldProcess(int x)
{
    return x % 2 == 0;
}

void ProcessNumbers(int[] numbers)
{
    foreach (var num in numbers)
    {
        if (ShouldProcess(num))
        {
            DoWork(num);
        }
    }
}

Optimized for Branch Prediction:

void ProcessNumbersOptimized(int[] numbers)
{
    Span<int> evenNumbers = stackalloc int[numbers.Length];
    int count = 0;

    foreach (var num in numbers)
    {
        if ((num & 1) == 0) // Bitwise check is branchless
            evenNumbers[count++] = num;
    }

    for (int i = 0; i < count; i++)
    {
        DoWork(evenNumbers[i]);
    }
}

📊 Performance Gain: This removes unpredictable branches and improves pipeline efficiency.


3. Cache Line Awareness: The Performance Killer No One Talks About

Many .NET performance problems are not CPU-bound but rather cache-bound. When data isn't in the L1/L2 cache, your code stalls waiting for memory fetches.

How to Measure Cache Efficiency in .NET 9?
  • Use Intel PCM or AMD uProf to measure cache line evictions.
  • Enable .NET 9 Native AOT (Ahead-Of-Time Compilation) to pre-optimize memory layouts.
  • Use Cachegrind (Valgrind) or Clang sanitizers to measure how application interacts with cache.
Real-World Example: Optimizing Data Structures

Consider this structure:

struct MyStruct
{
    public int A;
    public int B;
    public int C;
    public int D;
}

By default, .NET aligns this struct in memory, leading to unnecessary cache line spills.

Optimized using Struct Packing

[StructLayout(LayoutKind.Sequential, Pack = 1)]
struct MyPackedStruct
{
    public int A, B, C, D;
}

📊 Performance Gain: This ensures data is packed efficiently in cache lines, reducing cache misses by up to 40%.


4. Quantifying the Impact of Async Overhead

While async programming is great for scalability, but generally overlook the hidden costs of context switching.

How to Measure Async Overhead in .NET 9?
  • Use dotnet-trace to measure thread switching and context delays.
  • Profile thread pool contention with Concurrency Visualizer.
Reducing Async Overhead with Task Pooling

Instead of allocating new tasks every time:

async Task ProcessAsync()
{
    await Task.Delay(100); // Expensive!
}

Use ValueTask and pooling:

async ValueTask ProcessOptimizedAsync()
{
    await Task.Delay(100).ConfigureAwait(false);
}

📊 Performance Gain: Reduces allocations and improves thread scheduling efficiency.


Conclusion: The Future of Performance Optimization in .NET 9

Measuring performance in .NET 9 should go beyond simple microbenchmarks. By focusing on:

  • Code locality
  • Branch prediction efficiency
  • Cache alignment
  • Async overhead reduction
Finally
  1. Use PerfView, VTune, and dotnet-trace to analyze real-world performance bottlenecks.
  2. Optimize branch prediction and code locality in performance-critical areas.
  3. Experiment with .NET 9 Profile-Guided Optimizations (PGO).
  4. Measure cache friendliness in high-throughput applications.

💡 Performance is not just about speed—it's about efficiency.