dotnet · EgorBo · Oct 26, 2022 · Jun 23, 2022 · Jun 23, 2022 · Jun 24, 2022
diff --git a/docs/design/features/DynamicPgo-InstrumentedTiers-Plaintext-opt.png b/docs/design/features/DynamicPgo-InstrumentedTiers-Plaintext-opt.png
diff --git a/docs/design/features/DynamicPgo-InstrumentedTiers-Plaintext.png b/docs/design/features/DynamicPgo-InstrumentedTiers-Plaintext.png
diff --git a/docs/design/features/DynamicPgo-InstrumentedTiers.md b/docs/design/features/DynamicPgo-InstrumentedTiers.md
@@ -0,0 +1,79 @@
+# Instrumented Tiers
+
+_Disclaimer: the functionality described in this doc is still in the preview stage and is not enabled by default even for `DOTNET_TieredPGO=1`._
+
+[#70941](https://github.com/dotnet/runtime/pull/70941) introduced new opt-in strategies for Tiered Compilation + TieredPGO mainly to address
+two existing limitations of the current design:
+1) R2R code never benefits from Dynamic PGO as it's not instrumented and is promoted straight to Tier1 when it's hot
+2) Instrumentation in Tier0 comes with a big overhead and it's better to only instrument hot Tier0 code (whether it's ILOnly or R2R)
+
+A good example explaining boths problems is this TechEmpower benchmark (plaintext-plaintext):
+
+![Plaintext](DynamicPgo-InstrumentedTiers-Plaintext.png)
+
+Legend:
+* Red    - `DOTNET_TieredPGO=0`, `DOTNET_ReadyToRun=1` (default)
+* Black  - `DOTNET_TieredPGO=1`, `DOTNET_ReadyToRun=1`
+* Yellow - `DOTNET_TieredPGO=1`, `DOTNET_ReadyToRun=0`
+
+Yellow line provides the highest level of performance (RPS) by sacrificing start up speed (and, hence, time it takes to process the first request). It happens because the benchmark is quite simple and most of its code is already prejitted so we can only instrument it when we completely drop R2R and compile everything from scratch. It also explains why the black line (when we enable Dynamic PGO but still rely on R2R) didn't really show a lot of improvements. With the separate instrumentation tier for hot R2R we achieve "Yellow"-level of performance while maintaining the same start up speed as it was before. Also, for the mode where we have to compile a lot of code to Tier0, switching to "instrument only hot Tier0 code" strategy shows ~8% time-to-first-request reduction across all TE benchmarks.
+
+![Plaintext](DynamicPgo-InstrumentedTiers-Plaintext-opt.png)
+(_predicted results according to local runs of crank with custom binaries_)
+
+# Tiered compilation workflow in TieredPGO mode
+
+The following diagram explains how the instrumentation for hot R2R code works under the hood when TieredPGO is enabled (it's disabled by default):
+
+```mermaid
+flowchart
+    prestub(.NET Function) -->|Compilation| hasAO{"Marked with<br/>[AggressiveOpts]?"}
+    hasAO-->|Yes|tier1ao["JIT to <b><ins>Tier1</ins></b><br/><br/>(that attribute is extremely<br/> rarely a good idea)"]
+    hasAO-->|No|hasR2R
+    hasR2R{"Is prejitted (R2R)<br/>and ReadyToRun==1"?} -->|No| istrTier0Q
+
+    istrTier0Q{"<b>TieredPGO_Strategy:</b><br/>Instrument only<br/>hot Tier0 code?"}
+    istrTier0Q-->|No, always instrument tier0|tier0
+    istrTier0Q-->|Yes, only hot|tier000
+    tier000["JIT to <b><ins>Tier0</ins></b><br/><br/>(not optimized, not instrumented,<br/> with patchpoints)"]-->|Running...|ishot555
+    ishot555{"Is hot?<br/>(called >30 times)"}
+    ishot555-.->|No,<br/>keep running...|ishot555
+    ishot555-->|Yes|tier0
+
+    hasR2R -->|Yes| R2R
+    R2R["Use <b><ins>R2R</ins></b> code<br/><br/>(optimized, not instrumented,<br/>with patchpoints)"] -->|Running...|ishot1
+    ishot1{"Is hot?<br/>(called >30 times)"}-.->|No,<br/>keep running...|ishot1
+    ishot1--->|"Yes"|instrumentR2R
+
+    instrumentR2R{"<b>TieredPGO_Strategy:</b><br/>Instrument hot<br/>R2R'd code?"}
+    instrumentR2R-->|Yes, instrument R2R'd code|istier1inst
+    instrumentR2R-->|No, don't instrument R2R'd code|tier1nopgo["JIT to <b><ins>Tier1</ins></b><br/><br/>(no dynamic profile data)"]
+
+    tier0["JIT to <b><ins>InstrumentedTier</ins></b><br/><br/>(not optimized, instrumented,<br/> with patchpoints)"]-->|Running...|ishot5
+    tier1pgo2["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
+    tier1pgo2_1["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
+
+    istier1inst{"<b>TieredPGO_Strategy:</b><br/>Enable optimizations<br/>for InstrumentedTier?"}-->|"No"|tier0_1
+    istier1inst--->|"Yes"|tier1inst["JIT to <b><ins>InstrumentedTierOptimized</ins></b><br/><br/>(optimized, instrumented, <br/>with patchpoints)"]
+    tier1inst-->|Running...|ishot5_1
+    ishot5{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2
+    ishot5-.->|No,<br/>keep running...|ishot5
+
+
+    ishot5_1{"Is hot?<br/>(called >30 times)"}
+    ishot5_1-.->|No,<br/>keep running...|ishot5_1
+    ishot5_1{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2_1
+
+    tier0_1["JIT to <b><ins>InstrumentedTier</ins></b><br/><br/>(not optimized, instrumented,<br/> with patchpoints)"]
+    tier0_1-->|Running...|ishot5_1
+```
+(_VSCode doesn't support mermaid diagrams, consider installing external add-ins_)
+
+## Pros & cons of using optimizations inside the instrumented tiers
+
+Pros:
+* Lower overhead from instrumentation (and thanks to optimizations we _can_ optimize probes and emit less of those)
+* Optimized code is able to inline methods so we won't be producing new Compilation units for even small methods
+
+Cons:
+* Currently, we won't instrument inlinees -> we'll probably miss a lot of opportunities and produce less accurate profile leading to a less optimized final tier
diff --git a/docs/design/features/DynamicPgo.md b/docs/design/features/DynamicPgo.md
@@ -257,9 +257,9 @@ If we confidently could identify the top N% of methods (say 5%) then one could i
 R2R methods bypass Tier0 and so don't get instrumentation in the current TieredPGO prototype. We probably don't want to instrument the code in the R2R image. And many of these R2R methods are key framework methods that are important for performance. So we need to find a way to get data for these methods.
 
 There are a few basic ideas:
-* Leverage IBC. If there is IBC data in the R2R image then we can make that data available to the JIT. It may not be as relevant as in-process collected data, but it's quite likely better than synthetic data or no data.
-* Sampled instrumentation for R2R methods. Produce an instrumented version and run it every so often before the method gets promoted to Tier1. This may be costly, especially if we have to use unoptimized methods for instrumentation, as we'll do quite a bit of extra jitting.
-* Make R2R methods go through Tier0 on their way to Tier1. Likely introduces an unacceptable perf hit.
+1) Leverage IBC. If there is IBC data in the R2R image then we can make that data available to the JIT. It may not be as relevant as in-process collected data, but it's quite likely better than synthetic data or no data.
+2) Sampled instrumentation for R2R methods. Produce an instrumented version and run it every so often before the method gets promoted to Tier1. This may be costly, especially if we have to use unoptimized methods for instrumentation, as we'll do quite a bit of extra jitting.
+3) Make R2R methods go through a separate instrumentation tier on their way to Tier1, see [DynamicPgo-InstrumentedTiers.md](DynamicPgo-InstrumentedTiers.md) prototype.
 
 #### Dynamic PGO, QuickJitForLoops, OSR
 

diff --git a/src/coreclr/debug/daccess/request.cpp b/src/coreclr/debug/daccess/request.cpp
@@ -1183,6 +1183,12 @@ HRESULT ClrDataAccess::GetTieredVersions(
                 case NativeCodeVersion::OptimizationTierOptimized:
                     nativeCodeAddrs[count].OptimizationTier = DacpTieredVersionData::OptimizationTier_Optimized;
                     break;
+                case NativeCodeVersion::OptimizationTierInstrumented:
+                    nativeCodeAddrs[count].OptimizationTier = DacpTieredVersionData::OptimizationTier_InstrumentedTier;
+                    break;
+                case NativeCodeVersion::OptimizationTierInstrumentedOptimized:
+                    nativeCodeAddrs[count].OptimizationTier = DacpTieredVersionData::OptimizationTier_InstrumentedTierOptimized;
+                    break;
                 }
             }
             else if (pMD->IsJitOptimizationDisabled())

diff --git a/src/coreclr/inc/clrconfigvalues.h b/src/coreclr/inc/clrconfigvalues.h
@@ -612,6 +612,26 @@ RETAIL_CONFIG_STRING_INFO(INTERNAL_PGODataPath, W("PGODataPath"), "Read/Write PG
 RETAIL_CONFIG_DWORD_INFO(INTERNAL_ReadPGOData, W("ReadPGOData"), 0, "Read PGO data")
 RETAIL_CONFIG_DWORD_INFO(INTERNAL_WritePGOData, W("WritePGOData"), 0, "Write PGO data")
 RETAIL_CONFIG_DWORD_INFO(EXTERNAL_TieredPGO, W("TieredPGO"), 0, "Instrument Tier0 code and make counts available to Tier1")
+
+// TieredPGO_Strategy values:
+//
+// 0) Instrument any non-prejitted code
+// 1) Instrument any non-prejitted code and only hot R2R code
+// 2) Instrument any non-prejitted code and only hot R2R code (use optimizations in the instrumented tier for hot R2R)
+// 3) Instrument only hot non-prejitted code and only hot R2R code
+// 4) Instrument only hot non-prejitted code and only hot R2R code (use optimizations in the instrumented tier for hot R2R)
+//
+//
+// Pros & cons of using optimizations inside the instrumented tiers (mode '2' and '4')
+// Pros:
+//   * Lower overhead from instrumentation (and thanks to optimizations we _can_ optimize probes and emit less of those)
+//   * Optimized code is able to inline methods so we won't be producing new Compilation units for even small methods
+//
+// Cons:
+//   * Currently, we won't instrument inlinees -> we'll probably miss a lot of oportunities and produce less accurate profile
+//     leading to a less optimized final tier
+//
+RETAIL_CONFIG_DWORD_INFO(UNSUPPORTED_TieredPGO_Strategy, W("TieredPGO_Strategy"), 0, "Strategy for TieredPGO, see comments in clrconfigvalues.h")
 #endif
 
 ///

diff --git a/src/coreclr/inc/dacprivate.h b/src/coreclr/inc/dacprivate.h
@@ -610,6 +610,8 @@ struct MSLAYOUT DacpTieredVersionData
         OptimizationTier_OptimizedTier1,
         OptimizationTier_ReadyToRun,
         OptimizationTier_OptimizedTier1OSR,
+        OptimizationTier_InstrumentedTier,
+        OptimizationTier_InstrumentedTierOptimized,
     };
 
     CLRDATA_ADDRESS NativeCodeAddr;

diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
@@ -9097,6 +9097,16 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
         }
 #endif
 
+        bool IsInstrumented() const
+        {
+            return jitFlags->IsSet(JitFlags::JIT_FLAG_BBINSTR);
+        }
+
+        bool IsInstrumentedOptimized() const
+        {
+            return IsInstrumented() && jitFlags->IsSet(JitFlags::JIT_FLAG_TIER1);
+        }
+
         // true if we should use the PINVOKE_{BEGIN,END} helpers instead of generating
         // PInvoke transitions inline. Normally used by R2R, but also used when generating a reverse pinvoke frame, as
         // the current logic for frame setup initializes and pushes

diff --git a/src/coreclr/jit/fgprofile.cpp b/src/coreclr/jit/fgprofile.cpp
@@ -383,7 +383,7 @@ void BlockCountInstrumentor::Prepare(bool preImport)
     //
     // If we see any, we need to adjust our instrumentation pattern.
     //
-    if (m_comp->opts.IsOSR() && ((m_comp->optMethodFlags & OMF_HAS_TAILCALL_SUCCESSOR) != 0))
+    if (m_comp->opts.IsInstrumentedOptimized() && ((m_comp->optMethodFlags & OMF_HAS_TAILCALL_SUCCESSOR) != 0))
     {
         JITDUMP("OSR + PGO + potential tail call --- preparing to relocate block probes\n");
 
@@ -1887,8 +1887,8 @@ PhaseStatus Compiler::fgPrepareToInstrumentMethod()
         (JitConfig.TC_PartialCompilation() > 0);
     const bool prejit               = opts.jitFlags->IsSet(JitFlags::JIT_FLAG_PREJIT);
     const bool tier0WithPatchpoints = opts.jitFlags->IsSet(JitFlags::JIT_FLAG_TIER0) && mayHavePatchpoints;
-    const bool osrMethod            = opts.IsOSR();
-    const bool useEdgeProfiles = (JitConfig.JitEdgeProfiling() > 0) && !prejit && !tier0WithPatchpoints && !osrMethod;
+    const bool instrOpt             = opts.IsInstrumentedOptimized();
+    const bool useEdgeProfiles = (JitConfig.JitEdgeProfiling() > 0) && !prejit && !tier0WithPatchpoints && !instrOpt;
 
     if (useEdgeProfiles)
     {
@@ -1899,7 +1899,7 @@ PhaseStatus Compiler::fgPrepareToInstrumentMethod()
         JITDUMP("Using block profiling, because %s\n",
                 (JitConfig.JitEdgeProfiling() == 0)
                     ? "edge profiles disabled"
-                    : prejit ? "prejitting" : osrMethod ? "OSR" : "tier0 with patchpoints");
+                    : prejit ? "prejitting" : instrOpt ? "optimized instr" : "tier0 with patchpoints");
 
         fgCountInstrumentor = new (this, CMK_Pgo) BlockCountInstrumentor(this);
     }

diff --git a/src/coreclr/jit/importer.cpp b/src/coreclr/jit/importer.cpp
@@ -9602,7 +9602,16 @@ var_types Compiler::impImportCall(OPCODE                  opcode,
         {
             return impImportJitTestLabelMark(sig->numArgs);
         }
-#endif // DEBUG
+
+        // static ulong JitHelpers_JitFlags() => 0;
+        // can be defined anywhere and will be replaced by Debug-version of RyuJIT
+        if ((mflags & CORINFO_FLG_STATIC) && (sig->numArgs == 0) && (sig->retType == CorInfoType::CORINFO_TYPE_ULONG) &&
+            (strcmp("JitHelpers_JitFlags", eeGetMethodName(methHnd, nullptr)) == 0))
+        {
+            call = gtNewLconNode((__int64)opts.jitFlags->GetRawFlags());
+            goto DONE_CALL;
+        }
+#endif
 
         // <NICE> Factor this into getCallInfo </NICE>
         bool isSpecialIntrinsic = false;
@@ -22224,7 +22233,7 @@ bool Compiler::impConsiderCallProbe(GenTreeCall* call, IL_OFFSET ilOffset)
         return false;
     }
 
-    assert(opts.OptimizationDisabled() || opts.IsOSR());
+    assert(opts.OptimizationDisabled() || opts.IsInstrumentedOptimized());
     assert(!compIsForInlining());
 
     // During importation, optionally flag this block as one that

diff --git a/src/coreclr/jit/jitee.h b/src/coreclr/jit/jitee.h
@@ -157,6 +157,11 @@ class JitFlags
         return m_jitFlags == 0;
     }
 
+    unsigned __int64 GetRawFlags() const
+    {
+        return m_jitFlags;
+    }
+
     void SetFromFlags(CORJIT_FLAGS flags)
     {
         // We don't want to have to check every one, so we assume it is exactly the same values as the JitFlag

diff --git a/src/coreclr/vm/callcounting.cpp b/src/coreclr/vm/callcounting.cpp
@@ -574,7 +574,7 @@ bool CallCountingManager::SetCodeEntryPoint(
             // For a default code version that is not tier 0, call counting will have been disabled by this time (checked
             // below). Avoid the redundant and not-insignificant expense of GetOptimizationTier() on a default code version.
             !activeCodeVersion.IsDefaultVersion() &&
-            activeCodeVersion.GetOptimizationTier() != NativeCodeVersion::OptimizationTier0
+            activeCodeVersion.IsFinalTier()
         ) ||
         !g_pConfig->TieredCompilation_CallCounting())
     {
@@ -602,7 +602,7 @@ bool CallCountingManager::SetCodeEntryPoint(
                 return true;
             }
 
-            _ASSERTE(activeCodeVersion.GetOptimizationTier() == NativeCodeVersion::OptimizationTier0);
+            _ASSERTE(!activeCodeVersion.IsFinalTier());
 
             // If the tiering delay is active, postpone further work
             if (GetAppDomain()
@@ -649,7 +649,7 @@ bool CallCountingManager::SetCodeEntryPoint(
         }
         else
         {
-            _ASSERTE(activeCodeVersion.GetOptimizationTier() == NativeCodeVersion::OptimizationTier0);
+            _ASSERTE(!activeCodeVersion.IsFinalTier());
 
             // If the tiering delay is active, postpone further work
             if (GetAppDomain()
@@ -659,7 +659,7 @@ bool CallCountingManager::SetCodeEntryPoint(
                 return true;
             }
 
-            CallCount callCountThreshold = (CallCount)g_pConfig->TieredCompilation_CallCountThreshold();
+            CallCount callCountThreshold = g_pConfig->TieredCompilation_CallCountThreshold();
             _ASSERTE(callCountThreshold != 0);
 
             NewHolder<CallCountingInfo> callCountingInfoHolder = new CallCountingInfo(activeCodeVersion, callCountThreshold);
@@ -780,7 +780,7 @@ PCODE CallCountingManager::OnCallCountThresholdReached(TransitionBlock *transiti
     // used going forward under appropriate locking to synchronize further with deletion.
     GCX_PREEMP_THREAD_EXISTS(CURRENT_THREAD);
 
-    _ASSERTE(codeVersion.GetOptimizationTier() == NativeCodeVersion::OptimizationTier0);
+    _ASSERTE(!codeVersion.IsFinalTier());
 
     codeEntryPoint = codeVersion.GetNativeCode();
     do

diff --git a/src/coreclr/vm/codeversion.cpp b/src/coreclr/vm/codeversion.cpp
@@ -151,7 +151,11 @@ NativeCodeVersion::OptimizationTier NativeCodeVersionNode::GetOptimizationTier()
 void NativeCodeVersionNode::SetOptimizationTier(NativeCodeVersion::OptimizationTier tier)
 {
     LIMITED_METHOD_CONTRACT;
-    _ASSERTE(tier >= m_optTier);
+
+    _ASSERTE(
+        tier == m_optTier ||
+        (m_optTier != NativeCodeVersion::OptimizationTier::OptimizationTier1 &&
+         m_optTier != NativeCodeVersion::OptimizationTier::OptimizationTierOptimized));
 
     m_optTier = tier;
 }
@@ -333,6 +337,13 @@ NativeCodeVersion::OptimizationTier NativeCodeVersion::GetOptimizationTier() con
     }
 }
 
+bool NativeCodeVersion::IsFinalTier() const
+{
+    LIMITED_METHOD_DAC_CONTRACT;
+    OptimizationTier tier = GetOptimizationTier();
+    return tier == OptimizationTier1 || tier == OptimizationTierOptimized;
+}
+
 #ifndef DACCESS_COMPILE
 void NativeCodeVersion::SetOptimizationTier(OptimizationTier tier)
 {
@@ -808,7 +819,7 @@ bool ILCodeVersion::HasAnyOptimizedNativeCodeVersion(NativeCodeVersion tier0Nati
     _ASSERTE(!tier0NativeCodeVersion.IsNull());
     _ASSERTE(tier0NativeCodeVersion.GetILCodeVersion() == *this);
     _ASSERTE(tier0NativeCodeVersion.GetMethodDesc()->IsEligibleForTieredCompilation());
-    _ASSERTE(tier0NativeCodeVersion.GetOptimizationTier() == NativeCodeVersion::OptimizationTier0);
+    _ASSERTE(!tier0NativeCodeVersion.IsFinalTier());
 
     NativeCodeVersionCollection nativeCodeVersions = GetNativeCodeVersions(tier0NativeCodeVersion.GetMethodDesc());
     for (auto itEnd = nativeCodeVersions.End(), it = nativeCodeVersions.Begin(); it != itEnd; ++it)
@@ -1708,9 +1719,7 @@ PCODE CodeVersionManager::PublishVersionableCodeIfNecessary(
             {
             #ifdef FEATURE_TIERED_COMPILATION
                 _ASSERTE(!config->ShouldCountCalls() || pMethodDesc->IsEligibleForTieredCompilation());
-                _ASSERTE(
-                    !config->ShouldCountCalls() ||
-                    activeVersion.GetOptimizationTier() == NativeCodeVersion::OptimizationTier0);
+                _ASSERTE(!config->ShouldCountCalls() || !activeVersion.IsFinalTier());
                 if (config->ShouldCountCalls()) // the generated code was at a tier that is call-counted
                 {
                     // This is the first call to a call-counted code version of the method

diff --git a/src/coreclr/vm/codeversion.h b/src/coreclr/vm/codeversion.h
@@ -71,15 +71,19 @@ class NativeCodeVersion
     BOOL SetNativeCodeInterlocked(PCODE pCode, PCODE pExpected = NULL);
 #endif
 
+    // NOTE: Don't change existing values to avoid breaking changes in event tracing
     enum OptimizationTier
     {
         OptimizationTier0,
         OptimizationTier1,
         OptimizationTier1OSR,
         OptimizationTierOptimized, // may do less optimizations than tier 1
+        OptimizationTierInstrumented,
+        OptimizationTierInstrumentedOptimized,
     };
 #ifdef FEATURE_TIERED_COMPILATION
     OptimizationTier GetOptimizationTier() const;
+    bool IsFinalTier() const;
 #ifndef DACCESS_COMPILE
     void SetOptimizationTier(OptimizationTier tier);
 #endif