Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ea2f3d5
Arm64: Add If conversion pass
a74nh Jul 15, 2022
8ee167f
Minor review fixups
a74nh Aug 10, 2022
039de48
Return a PhaseStatus
a74nh Aug 10, 2022
ddfad68
Fix formatting
a74nh Aug 11, 2022
41c3a9e
Check for side effects on NOPs
a74nh Aug 11, 2022
4c93a4c
Add function block comments for the phase
a74nh Aug 11, 2022
891510d
Remove creation of AND chains from if conversion pass
a74nh Aug 12, 2022
c1281b5
Update middleBlock flow
a74nh Aug 12, 2022
b900344
Check for order side effects
a74nh Aug 12, 2022
a3e89b9
Remove COLON_COND check
a74nh Aug 31, 2022
441a60c
Remove flag toggling
a74nh Aug 31, 2022
962c83b
Move the conditional assignment to the JTRUE block
a74nh Aug 31, 2022
90bff39
Fix formatting
a74nh Sep 2, 2022
36cf21d
Allow conditions with side effects
a74nh Sep 7, 2022
127f9a6
Fix formatting
a74nh Sep 7, 2022
d1f0862
Correct all moved SSA statements
a74nh Sep 15, 2022
852a5af
Add size costing check
a74nh Sep 15, 2022
6690566
Only move middle block ssa defs
a74nh Sep 16, 2022
58372e2
Fix formatting
a74nh Sep 16, 2022
10a8f1f
Fewer SSA assumptions
a74nh Sep 20, 2022
97b1af1
Use implicit func for value numbering
a74nh Sep 21, 2022
3f60851
Update header for gtFoldExprConditional
a74nh Oct 6, 2022
0f054fe
Cost based on speed
a74nh Oct 6, 2022
6f0abdf
Merge branch 'main'
a74nh Oct 6, 2022
393fd39
Add Stress mode for inner loops
a74nh Oct 7, 2022
09d62fe
Rework costings
a74nh Oct 7, 2022
73007b2
Check for invalid VNs
a74nh Oct 10, 2022
61af92d
Ignore compares against zero
a74nh Oct 10, 2022
4fdabea
Ensure float compares are contained
a74nh Oct 12, 2022
be60d30
Allow if conversion of test compares
a74nh Oct 12, 2022
806e061
Do not contain test compares within compare chains
a74nh Oct 13, 2022
a508e85
Add float versions of the JIT/opt/Compares tests
a74nh Oct 13, 2022
ac2568f
Fix formatting
a74nh Oct 13, 2022
bad8097
Compare chains use CmpCompares, selects use Compares
a74nh Oct 14, 2022
ed06f67
Merge main
a74nh Oct 17, 2022
935716c
Fix flow checking for empty blocks
a74nh Oct 20, 2022
12bfe0a
Merge main
a74nh Oct 20, 2022
f2f3a02
Fix to contexts setting JitStdOutFile
jakobbotsch Oct 26, 2022
2e183bd
Temporary fix for SPMI problem
jakobbotsch Oct 26, 2022
b2772ff
Fix attr and reg producing in select generation
a74nh Oct 28, 2022
4281fc0
Revert SPMI changes
jakobbotsch Oct 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Move the conditional assignment to the JTRUE block
  • Loading branch information
a74nh committed Sep 2, 2022
commit 962c83bba709e37e33b00bd06c823a3e46f3e8e9
92 changes: 62 additions & 30 deletions src/coreclr/jit/optimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4585,13 +4585,14 @@ PhaseStatus Compiler::optUnrollLoops()
// statement. For example:
// if (x < 7) { a = 5; }
//
// This is represented in IR by two basic blocks. The first block ends with a
// JTRUE statement which conditionally jumps to the second block. Both blocks
// then jump to the same destination. The second block just contains a single
// assign statement. (Note that the first block may contain additional statements
// prior to the JTRUE statement.)
// This is represented in IR by two basic blocks. The first block (block) ends with
// a JTRUE statement which conditionally jumps to the second block (middleBlock).
// The second block just contains a single assign statement. Both blocks then jump
// to the same destination (middleNext). Note that the first block may contain
// additional statements prior to the JTRUE statement.
//
// For example:
//
// ------------ BB03 [009..00D) -> BB05 (cond), preds={BB02} succs={BB04,BB05}
// STMT00004
// * JTRUE void $VN.Void
Expand All @@ -4605,15 +4606,20 @@ PhaseStatus Compiler::optUnrollLoops()
// +--* LCL_VAR int V00 arg0
// \--* CNS_INT int 5 $47
//
// The JTRUE statement in the first block is replace with a SELECT node in the
// assignment in the second block. If the result of the SELECTs op1 is true,
// then op2 is passed to the assignment. Otherwise op3 is used. For example:
//
// ------------ BB03 [009..00D), preds={BB02} succs={BB04}
// This is optimised by conditionally executing the store and removing the conditional
// jumps. First the JTRUE is replaced with a NOP. The assignment is updated so that
// the source of the store is a SELECT node with the condition set to the inverse of
// the original JTRUE condition. If the condition passes the original assign happens,
// otherwise the existing source value is used.
//
// In the example above, local var 0 is set to 5 if the LT returns true, otherwise
// the existing value of local var 0 is used:
//
// ------------ BB03 [009..00D) -> BB05 (always), preds={BB02} succs={BB05}
// STMT00004
// * NOP void
//
// ------------ BB04 [00D..010), preds={BB03} succs={BB05}
// STMT00005
// * ASG int $VN.Void
// +--* LCL_VAR int V00 arg0
Expand All @@ -4624,6 +4630,7 @@ PhaseStatus Compiler::optUnrollLoops()
// +--* CNS_INT int 5 $47
// \--* LCL_VAR int V00
//
// ------------ BB04 [00D..010), preds={} succs={BB05}
//
bool Compiler::optIfConvert(BasicBlock* block)
{
Expand Down Expand Up @@ -4653,7 +4660,8 @@ bool Compiler::optIfConvert(BasicBlock* block)
return false;
}

BasicBlock* middleBlock = block->bbNext;
BasicBlock* middleBlock = nullptr;
BasicBlock* middleNext = block->bbNext;
GenTree* asgNode = nullptr;
Statement* asgStmt = nullptr;

Expand All @@ -4663,21 +4671,23 @@ bool Compiler::optIfConvert(BasicBlock* block)
bool foundMiddle = false;
while (!foundMiddle)
{
middleBlock = middleNext;
noway_assert(middleBlock != nullptr);
BasicBlock* const middleNext = middleBlock->GetUniqueSucc();

if (middleBlock->bbNext == block->bbJumpDest)
// middleBlock should have a single successor.
middleNext = middleBlock->GetUniqueSucc();
if (middleNext == nullptr)
{
return false;
}

if (middleNext == block->bbJumpDest)
{
// This is our final middle block.
foundMiddle = true;
}

// Check that we have linear flow and are still in the same EH region
//
if (middleBlock->NumSucc() != 1 || middleBlock->bbJumpKind != BBJ_NONE)
{
return false;
}

if (middleBlock->GetUniquePred(this) == nullptr)
{
Expand Down Expand Up @@ -4737,8 +4747,6 @@ bool Compiler::optIfConvert(BasicBlock* block)
return false;
}
}

middleBlock = middleNext;
}
if (asgNode == nullptr)
{
Expand All @@ -4749,7 +4757,7 @@ bool Compiler::optIfConvert(BasicBlock* block)
#ifdef DEBUG
if (verbose)
{
JITDUMP("Conditionally executing " FMT_BB " from " FMT_BB "\n", middleBlock->bbNum, block->bbNum);
JITDUMP("Conditionally executing " FMT_BB " inside " FMT_BB "\n", middleBlock->bbNum, block->bbNum);
for (BasicBlock* dumpBlock = block; dumpBlock != middleBlock->bbNext; dumpBlock = dumpBlock->bbNext)
{
fgDumpBlock(dumpBlock);
Expand All @@ -4770,24 +4778,48 @@ bool Compiler::optIfConvert(BasicBlock* block)
// Create a select node.
GenTreeConditional* select =
gtNewConditionalNode(GT_SELECT, cond, asgNode->gtGetOp2(), destination, asgNode->TypeGet());
gtSetEvalOrder(select);

// Use the select as the source of the assignment.
asgNode->AsOp()->gtOp2 = select;
asgNode->AsOp()->gtFlags |= (select->gtFlags & GTF_ALL_EFFECT);

// Calculate costs.
gtSetEvalOrder(select);
gtSetEvalOrder(asgNode);
fgSetStmtSeq(asgStmt);

// Remove the JTRUE statement.
last->ReplaceWith(gtNewNothingNode(), this);
fgSetStmtSeq(block->lastStmt());
gtSetEvalOrder(last);
fgSetStmtSeq(asgStmt);
fgSetStmtSeq(block->lastStmt());

// Move the Asg to the end of the original block
Statement* stmtList1 = block->firstStmt();
Statement* stmtList2 = middleBlock->firstStmt();
Statement* stmtLast1 = block->lastStmt();
Statement* stmtLast2 = middleBlock->lastStmt();
stmtLast1->SetNextStmt(stmtList2);
stmtList2->SetPrevStmt(stmtLast1);
stmtList1->SetPrevStmt(stmtLast2);
middleBlock->bbStmtList = nullptr;

// Fix up the SSA of assignment destination.
if (asgNode->AsOp()->gtOp1->IsLocal())
{
GenTreeLclVarCommon* lclVar = asgNode->AsOp()->gtOp1->AsLclVarCommon();
unsigned lclNum = lclVar->GetLclNum();
unsigned ssaNum = lclVar->GetSsaNum();

if (ssaNum != SsaConfig::RESERVED_SSA_NUM)
{
LclSsaVarDsc* ssaDef = lvaTable[lclNum].GetPerSsaData(ssaNum);
noway_assert(ssaDef->GetBlock() == middleBlock);
ssaDef->SetBlock(block);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean LclSsaVarDsc::m_asg is left pointing at an assignment that has been removed?

Updating that too will mean we end up with two SSA defs for the same assignment. I wonder if that will be a problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose for now we won't duplicate any SSA defs since this PR does not handle if-else. But, once that is handled, we will hit the situation. So for now things should be ok.

Also small nit: lvaTable[lclNum] -> lvaGetDesc(lclNum)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asgNode is the same node as before the optimisation. In an earlier version of the patch I was duplicating it - but now I modify the assignment and then move it to the earlier block. So, I think this is correct (unless I'm missing something, the SSA mechanism isn't 100% clear in my head yet, and I couldn't find any obvious "fix up ssa" functions). This fixed a number of SSA errors I was seeing.

An issue I noticed yesterday whilst looking through code size regressions was that it's loading the same data twice. I think this is due to the uses not being correct. In the IR here post optimisation, node 278 was created by duplicating node 91:

***** BB06
STMT00022 ( 0x062[E-] ... 0x069 )
N010 ( 15, 14) [000092] -A------R--                         *  ASG       int    $VN.Void
N009 (  3,  2) [000091] D------N---                         +--*  LCL_VAR   int    V00 loc0         d:15 $VN.Void
N008 ( 11, 11) [000279] -----------                         \--*  SELECT    int   
N003 (  3,  4) [000042] J------N---                            +--*  NE        int    $149
N001 (  1,  1) [000040] -----------                            |  +--*  LCL_VAR   int    V01 loc1         u:2 (last use) $102
N002 (  1,  2) [000041] -----------                            |  \--*  CNS_INT   int    123 $4f
N006 (  5,  5) [000090] -----------                            +--*  ADD       int    $14a
N004 (  3,  2) [000088] -----------                            |  +--*  LCL_VAR   int    V00 loc0         u:7 $183
N005 (  1,  2) [000089] -----------                            |  \--*  CNS_INT   int    0x2000 $50
N007 (  3,  2) [000278] -----------                            \--*  LCL_VAR   int    V00 loc0         u:15 $VN.Void

When generating to code, it's creating a load for both of the nodes:

            ldr     w1, [fp, #0x14]	// [V00 loc0]
            add     w1, w1, #2, LSL #12
            ldr     w2, [fp, #0x14]	// [V00 loc0]
            cmp     w0, #123
            csel    w0, w1, w2, ne
            str     w0, [fp, #0x14]	// [V00 loc0]

I think I just need to connect the uses somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just need to connect the uses somehow.

... which isn't easy as I'd essentially need to run an entire CSE pass across the block to be able to spot that node 88 and my new node 278 are the same.

I wonder if this should be done before the CSE phase. But then we need to add SELECT support to CSE and assert prop. Which is why the If conversion was placed after CSE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even if this was before CSE, CSE currently doesn't handle LCL_VAR nodes (it wouldn't be hard to add, but may have some trailing regressions).

I'd echo Andy's suggestion to look into costing: https://github.com/dotnet/runtime/pull/73472/files#r942623060. E. g. one could have something like "estimated gain" (from removing the branch) vs "estimated loss" (from evaluating more expressions).

LCL_VAR int V00 loc0 u:15 $VN.Void

Side note: it is not right for this use to have $VN.Void.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even if this was before CSE, CSE currently doesn't handle LCL_VAR nodes (it wouldn't be hard to add, but may have some trailing regressions).

So what would ensure that this block doesn't load v00 twice?

*  ASG       int    $VN.Void
+--*  LCL_VAR   int    V00 loc0
\--*  ADD (or MUL, or whatever)
   +--*  LCL_VAR   int    V00 loc0
   \--*  LCL_VAR   int    V00 loc0 

I'd echo Andy's suggestion to look into costing:

Right, because even without the above fixed, the costing will ensure we never regress in code size. I'll add something here.

Meanwhile, looking back to the queue testing. I was using Perf_PriorityQueue tests which uses a PriorityQueue and not a queue.

I took those tests and rewrote them to use Queue: https://github.com/dotnet/performance/blob/1c5e29da0e745809bd911070bbef933d25328eaf/src/benchmarks/micro/libraries/System.Collections/Queue/Perf_Queue.cs

Results were much more stable, not really changing one run to the next:

| Dequeue_And_Enqueue | Job-BTPMBD | /runtime_HEAD/ |   10 |    124.2 ns |  0.17 ns |  0.16 ns |    124.2 ns |    123.9 ns |    124.4 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-OHELEL | /runtime_csel/ |   10 |    130.7 ns |  0.79 ns |  0.66 ns |    130.8 ns |    129.7 ns |    132.0 ns |  1.05 |          Slower |         - |          NA |
| Dequeue_And_Enqueue | Job-BTPMBD | /runtime_HEAD/ |  100 |  1,219.6 ns |  0.55 ns |  0.46 ns |  1,219.8 ns |  1,218.7 ns |  1,220.1 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-OHELEL | /runtime_csel/ |  100 |  1,200.3 ns |  4.17 ns |  3.69 ns |  1,199.4 ns |  1,196.5 ns |  1,206.3 ns |  0.98 |            Same |         - |          NA |
| Dequeue_And_Enqueue | Job-BTPMBD | /runtime_HEAD/ | 1000 | 11,892.0 ns |  4.18 ns |  3.49 ns | 11,891.9 ns | 11,887.0 ns | 11,899.7 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-OHELEL | /runtime_csel/ | 1000 | 11,776.6 ns | 41.06 ns | 34.29 ns | 11,758.3 ns | 11,745.8 ns | 11,835.3 ns |  0.99 |            Same |         - |          NA |

Essentially:

  • On 100 and 1000 iterations we are getting 1% to 2% improvement.
  • On 10 iterations we are getting 5% loss.
  • Why? The moveNext function is very consistent. The size of the array is unchanged. So for size 10, you get 9 fails, then one hit. The branch predictor is being smart and predicts this.
  • For size 100 and 1000 the branch predictor isn't as good at going that wide, and the csel is slightly better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version:

  • SSA changes (including fixing the void)
  • Cost calculations. I'm using the size here, as that's what we're concerned about. Oddly, calling gtSetEvalOrder() changes the previously calculated values, so I run it again to make sure everything is correct. Also, there is nothing taking into account that JTRUE(CMP(x,0)) would turn into cbnz x, so had to explicitly check for it here.

}
}

// Update the flow from the original block.
fgRemoveAllRefPreds(block->bbNext, block);
block->bbJumpKind = BBJ_ALWAYS;

// Update the flow.
fgRemoveAllRefPreds(block->bbJumpDest, block);
block->bbJumpKind = BBJ_NONE;
block->bbJumpDest = middleBlock;

#ifdef DEBUG
if (verbose)
Expand Down