Skip to content

Conversation

@briansull
Copy link
Contributor

@briansull briansull commented Nov 27, 2020

Don't allow an unwrapped promoted field to be returned when we are expecting a TYP_STRUCT
(Expanded to include any mismatched type)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 27, 2020
@briansull
Copy link
Contributor Author

@dotnet/jit-contrib PTAL

There are no Asm diffs in System.Private.Corelib
and for the full frameworks the Arm64 had two methods that had small diffs when we have a multi-reg return struct. and we return one of the two values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could back this change out as it is not necessary for this fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change cause the diffs that you see with arm64 multi-reg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment:
Yes, it does cause the diffs that I saw. I am backing it out as it is unnecessary

@sandreenko
Copy link
Contributor

Nice catch!

Does the test take a long time to run, should it be Pri1 or longRunningGC test?

I have tried to run it locally and it failed with a Jit assert:

 "runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\corerun.exe" Runtime_44895.dll
               [000118] -----+------              *  LCL_VAR   ref    V15 tmp11

Assert failure(PID 51232 [0x0000c820], Thread: 4292 [0x10c4]): Assertion failed '!"Incompatible types for gtNewTempAssign"' in 'Property`1[Text][Text]:GetValue(System.Nullable`1[Text]):Text:this' during 'Morph - Global' (IL size 150)

    File: runtime\src\coreclr\src\jit\gentree.cpp Line: 15492

for this tree:

fgMorphTree BB06, STMT00027 (before)
               [000089] --C---------              *  RETURN    struct
               [000118] ------------              \--*  LCL_VAR   ref    V15 tmp11

looks like we catch this problem with our asserts but it makes it hard to understand what exactly goes wrong in release and why it leads to bad execution.

I think your test uncovers a bigger issue with FIELD struct (1 field with the same size) transformation like this:

LocalAddressVisitor visiting statement:
STMT00017 (IL   ???...  ???)
               [000059] --C---------              *  RETURN    struct
               [000149] ----G-------              \--*  FIELD     struct value
               [000147] ------------                 \--*  ADDR      byref 
               [000148] -------N----                    \--*  LCL_VAR   struct<System.Nullable`1[Text], 16>(P) V01 arg1         
                                                        \--*    bool   V01.hasValue (offs=0x00) -> V12 tmp8         
                                                        \--*    ref    V01.value (offs=0x08) -> V13 tmp9         
Replacing the field in promoted struct with local var V13
LocalAddressVisitor incrementing ref count from 2 to 3 for implict by-ref V01
LocalAddressVisitor modified statement:
STMT00017 (IL   ???...  ???)
               [000059] --C---------              *  RETURN    struct
               [000149] ------------              \--*  LCL_VAR   ref    V13 tmp9 

that can be hit not only for ref and not only for trees under return.

I have added another condition to your test so now it fails with the same assert but for TYP_LONG so your new condition does not guard against that, but maybe it can't lead to bad execution in release.

Runtime_44895.modified.txt

I think there are 2 main questions:

  1. Is it correct to have GT_OP struct -> LCL_VAR ref for GC tracking? We are lucky here because there is a limited number of trees which can have struct type and can use FIELD as a child (I can think of RETURN, ASG, STORE_*), so we can check all.
  2. Should we allow such cases in gtNewTempAssign?
  3. How big are the diffs if we forbid the optimization in fgMorphStructField for 5.0? As I said we don't many GT_OP struct -> FIELD cases, so maybe we won't see many regressions?

@briansull
Copy link
Contributor Author

briansull commented Dec 7, 2020

The test fails with an assert using Checked, With a retail build I think that there is a (Bad Codegen and or a GC hole). I will try running it with GCSTRESS.
I modified the original test that the user submitted slightly.

@briansull
Copy link
Contributor Author

We will want a conservative fix for 5.0, so that is what I will pursue as the initial fix in master as well.
As that allows the /backport command.

We wouldn't want to allow gtNewTemp assign to accept a mismatched TYP_REF type assignment.

@briansull briansull force-pushed the fix-44895 branch 2 times, most recently from 89445d5 to 3c990ff Compare December 8, 2020 05:41
@briansull
Copy link
Contributor Author

Joy, Joy:

This branch has conflicts that must be resolved
to resolve conflicts before continuing.
Conflicting files
src/coreclr/src/jit/morph.cpp

@JulieLeeMSFT
Copy link
Member

PR #44973 removed the second src folder.

@briansull
Copy link
Contributor Author

briansull commented Dec 8, 2020

@JulieLeeMSFT PR #44973 removed the second src folder.

Yes, I know and this caused all existing PR to have a merge conflict that has to be resolved by hand.
I'm sure that this also breaks the /backport command as well.

:-(

@briansull
Copy link
Contributor Author

briansull commented Dec 8, 2020

@dotnet/jit-contrib @sandreenko Please take a look

I expanded this chnage to disallow any mismatched type for a GT_RETURN of a struct,
There are no Asm Diffs in the frameworks

@sandreenko
Copy link
Contributor

@briansull do you know on which tree we had a GC hole, was it ASG(struct, ref)?
Could we have a GC hole without a merge return block (just Return struct(LCL_VAR ref))?

I was surprised to see that in Return struct(LCL_VAR ref) case we did not hit any asserts, probably if we understand where the hole was happening we could add additional checks there.

I agree with your latest change that the type doesn't matter but I am not sure that parent->OperGet(), could it be reproduced when parent is not a return instruction?

@briansull
Copy link
Contributor Author

@sandreenko
I didn't see a GC hole when I ran the test case with a release build under GCSTRESS.
So, I am not sure that there is a GC hole here, but there is bad codegen with release that is for sure.

@BruceForstall
Copy link
Contributor

#44895

@briansull
Copy link
Contributor Author

@AndyAyersMS @sandreenko
Ping -I would like to check this fix in.

@AndyAyersMS
Copy link
Member

I still don't know what the bug is, so it's hard to be confident this is a complete fix.

Also I wonder if there's some overlap here with #45557?

@sandreenko
Copy link
Contributor

sandreenko commented Dec 9, 2020

@briansull are you confident that the issue can't be repro when the parent is not a return?

@BruceForstall
Copy link
Contributor

@briansull are you confident that the issue can't be repro when the parent is not a struct?

Do you mean "when the parent is not a GT_RETURN"?

@sandreenko
Copy link
Contributor

Do you mean "when the parent is not a GT_RETURN"?

Yes, thanks, edited.

@BruceForstall
Copy link
Contributor

@briansull what does the bad IL (and codegen) look like in the test case, and what does it look like with your change?

@briansull
Copy link
Contributor Author

I am continuing to investigate. This is not an area in which I know a lot about. @sandreenko may know more about the invariants in this area than I know.

If I disallow all other parents there is a huge regression, but it looks like most of that is where the parent is a GT_ADDR node which should always be a safe parent node.

@briansull
Copy link
Contributor Author

briansull commented Dec 9, 2020

@BruceForstall The IL isn't bad, instead we hit a whole pile of asserts with a checked build that don't exist in a release build, so in release generates what I assume is bade code because the asserts are there for a good reason.

@briansull
Copy link
Contributor Author

briansull commented Dec 9, 2020

Only five methods have diffs in System.Private.Corelib.dll when I use:

if (parent->OperGet() != GT_ADDR)
{
    return;
}


Top method regressions (bytes):
          36 ( 3.84% of base) : System.Private.CoreLib.dasm - TranscodingStream:Write(ReadOnlySpan`1):this
          16 (16.67% of base) : System.Private.CoreLib.dasm - Range:Equals(Object):bool:this
Top method improvements (bytes):
         -25 (-1.64% of base) : System.Private.CoreLib.dasm - TimeZoneInfo:GetIsDaylightSavings(DateTime,AdjustmentRule,DaylightTimeStruct):bool
         -10 (-1.63% of base) : System.Private.CoreLib.dasm - TimeZoneInfo:GetIsAmbiguousTime(DateTime,AdjustmentRule,DaylightTimeStruct):bool
         -10 (-1.63% of base) : System.Private.CoreLib.dasm - TimeZoneInfo:GetIsInvalidTime(DateTime,AdjustmentRule,DaylightTimeStruct):bool

@briansull
Copy link
Contributor Author

briansull commented Dec 9, 2020

Here is an example of what the transformation does:

LocalAddressVisitor visiting statement:
STMT00011 (IL   ???...  ???)
               [000056] -AC---------              *  ASG       struct (copy)
               [000054] D------N----              +--*  LCL_VAR   struct<System.Index, 4>(P) V03 loc1         
                                                  +--*    int    V03._value (offs=0x00) -> V11 tmp7         
               [000091] ------------              \--*  FIELD     struct <End>k__BackingField
               [000089] ------------                 \--*  ADDR      byref 
               [000090] -------N----                    \--*  LCL_VAR   struct<System.Range, 8>(P) V02 loc0         
                                                        \--*    int    V02.<Start>k__BackingField (offs=0x00) -> V09 tmp5         
                                                        \--*    int    V02.<End>k__BackingField (offs=0x04) -> V10 tmp6         
Replacing the field in promoted struct with local var V10
LocalAddressVisitor modified statement:
STMT00011 (IL   ???...  ???)
               [000056] -AC---------              *  ASG       struct (copy)
               [000054] D------N----              +--*  LCL_VAR   struct<System.Index, 4>(P) V03 loc1         
                                                  +--*    int    V03._value (offs=0x00) -> V11 tmp7         
               [000091] -------N----              \--*  LCL_VAR   int    V10 tmp6         

The tree node being modified here is:
[000091] ------------ \--* FIELD struct <End>k__BackingField
it is referring to the promoted field contained in V10
\--* int V02.<End>k__BackingField (offs=0x04) -> V10 tmp6
Note that the types do not match, the FIELD is a TYP_STRUCT and the potential replacement is an int

In this case the parent node is a GT_ASG, so this is one of the five diffs that I got in System.Private.Corelib
it is method Range:Equals(Object):bool:this

@sandreenko
Copy link
Contributor

I've simplified the test so now it does not need any additional plugins or iterations, the test always passes in Debug and fails in Release (compile with VS with 5.0 target). I think it is easier to analyze,
Runtime_44895.simple.txt
it is clear what goes wrong in the asm code: after an incorrect ASG that asserts with !"Incompatible types for gtNewTempAssign" we also lose DONT_CSE on the lhs of the final tree:

               [000082] -A----------              *  ASG       ref   
               [000081] ------------              +--*  IND       ref   <- does not have DONT_CSE.
               [000080] ------------              |  \--*  ADDR      byref 
               [000079] D------N----              |     \--*  LCL_VAR   struct<InnerStruct, 8>(P) V03 tmp1         
                                                                |     \--*    ref    V03.d (offs=0x00) -> V05 tmp3         
               [000054] -----+-N----              \--*  LCL_VAR   ref    V04 tmp2 

so later on we CSE the LHS and write to a CSE temp instead of the actual value, when return still references V05 we get:

N002 (  4, 12) [000078] #---G-------        t78 = *  IND       ref    $140
                                                  /--*  t78    ref    
N004 (  4, 12) [000108] DA--G-------              *  STORE_LCL_VAR ref    V06 cse0         d:1 <- should be V05.
N001 (  1,  1) [000072] -------N----        t72 =    LCL_VAR   ref    V05 tmp3         u:2 (last use) $280
                                                  /--*  t72    ref    
N002 (  2,  2) [000073] ------------              *  RETURN    struct $205

and the actual return value is never written.
If we disable CSE then we hit asserts in liveness that we have not hit before and still generate a bad code.
Logs:
base.txt
diff.txt

Could somebody generate 3.1 JitDump? I do not have it on my current machine. I have checked that it works there but don't know what exactly exposed the issue, I suspect it was caused by mine JitDoOldStructRetyping=0 just because it had a big impact on these transformations.

It is still an open question if it can be reproduced without return and I think the answer is no but we are just lucky: without merge return block we call fgMorph on the asg tree and it always transforms incorrect

               [000082] -A----------              *  ASG       ref   
               [000081] ------------              +--*  IND       ref   // No-cse
               [000080] ------------              |  \--*  ADDR      byref 
               [000079] D------N----              |     \--*  LCL_VAR   struct<InnerStruct, 8>(P) V03 tmp1         
                                                                |     \--*    ref    V03.d (offs=0x00) -> V05 tmp3         
               [000054] -----+-N----              \--*  LCL_VAR   ref    V04 tmp2 

into correct

               [000082] -A----------              *  ASG       ref           
               [000079] D------N----             |--*   LCL_VAR   ref V05 tmp3 // has a rudiment CSE flag because it was under ADDR.
               [000054] -----+-N----              \--*  LCL_VAR   ref    V04 tmp2 

but morph is not called in fgMergeBlockReturn for primitive copies, only for blocks:

runtime/src/coreclr/jit/morph.cpp

Lines 16935 to 16942 in 76a443d

GenTree* tree = gtNewTempAssign(genReturnLocal, ret->gtGetOp1(), &pAfterStatement, offset, block);
if (tree->OperIsCopyBlkOp())
{
tree = fgMorphCopyBlock(tree);
}
if (pAfterStatement == lastStmt)
{

again, I think we are just getting lucky and the other phases are not really prepared for this FIELD transformation.

@sandreenko
Copy link
Contributor

Only five methods have diffs in System.Private.Corelib.dll when I use:

if (parent->OperGet() != GT_ADDR)
{
return;
}

it could be a safe solution, but this method looks important:

16 (16.67% of base) : System.Private.CoreLib.dasm - Range:Equals(Object):bool:this

what happens there?

@briansull
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is under #ifdef DEBUG it's not going to do anything useful.

Copy link
Contributor Author

@briansull briansull Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this new check and noway_assert:

    // Added this noway_assert for runtime\issue 44895, to protect against silent bad codegen
    //
    if ((dstTyp == TYP_STRUCT) && (valTyp == TYP_REF))
    {
        noway_assert(!"Incompatible types for gtNewTempAssign");
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for backport sake we should leave it out.

@briansull
Copy link
Contributor Author

briansull commented Dec 11, 2020

@briansull could you please update the test with a simpler repo or just delete the loop and confirm that it fails without the fix with complus_TieredCompilation=0?

@sandreenko Both tests fail when run with complus_TieredCompilation=0
@AndyAyersMS When I prepare the final PR with the minimal fix would you like one or two test cases?

@AndyAyersMS
Copy link
Member

When I prepare the final PR with the minimal fix would you like one or two test cases?

They're more or less equivalent so I would just keep the simpler one.

- Fix: Don't allow an unwrapped promoted field of TYP_REF to be returned when we are expecting a TYP_STRUCT

Backout change in gtGetStructHandleIfPresent for GT_RETURN as it isn't needed for this fix
Deoptimize all GT_RETURN's with mismatched types for promoted struct fields.
@briansull
Copy link
Contributor Author

@AndyAyersMS @sandreenko
I posted what I hope can be the final version of this fix
Please review and approve

@briansull
Copy link
Contributor Author

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@briansull
Copy link
Contributor Author

briansull commented Dec 11, 2020

runtime (Libraries Test Run release coreclr windows x64 Debug) Failing after 42m

  Starting:    System.Security.Cryptography.Csp.Tests (parallel test collections = on, max threads = 2)
    System.Security.Cryptography.Dsa.Tests.DsaArrayOffsetSignatureFormatTests.EmptyHashAlgorithm [FAIL]
      System.TypeInitializationException : The type initializer for 'System.Security.Cryptography.Dsa.Tests.DsaArrayOffsetSignatureFormatTests' threw an exception.
      ---- Internal.Cryptography.CryptoThrowHelper+WindowsCryptographicException : An internal error occurred.

@AndyAyersMS
Copy link
Member

I think this is a known issue that pops up from time to time: #29683.

@briansull
Copy link
Contributor Author

briansull commented Dec 11, 2020

@AndyAyersMS - Thanks I guess I will go ahead and merge and I'm sure it isn't caused by my changes.

@briansull briansull merged commit e8bcb12 into dotnet:master Dec 11, 2020
@briansull briansull deleted the fix-44895 branch December 12, 2020 00:14
@briansull
Copy link
Contributor Author

Fixes #44895

briansull added a commit to briansull/runtime that referenced this pull request Dec 12, 2020
* Fix for Issue 44895
- Fix: Don't allow an unwrapped promoted field of TYP_REF to be returned when we are expecting a TYP_STRUCT

Backout change in gtGetStructHandleIfPresent for GT_RETURN as it isn't needed for this fix
Deoptimize all GT_RETURN's with mismatched types for promoted struct fields.

* Allow both GT_ADDR and GT_ASG as a parent node

* Add second test case Repro2_44895.cs

* Change assert about Incompatible types to be a noway_assert in gtNewTempAssign

* Only use the smaller repro case for Runtime_44895.cs

* Added noway_assert in release build for an assignment of a TYP_REF to a TYP_STRUCT

* rerun jit-format
@JulieLeeMSFT JulieLeeMSFT added Servicing-consider Issue for next servicing release review and removed Servicing-consider Issue for next servicing release review labels Dec 12, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect codegen with multiple returns of a field of a Nullable or struct type

6 participants