opt: add cost for executing geo functions in filters #49074

madelynnblue · 2020-05-14T17:44:13Z

Release note: None

cockroach-teamcity · 2020-05-14T17:44:20Z

This change is

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @rytaft)

pkg/sql/opt/xform/coster.go, line 318 at r1 (raw file):

	// The filter has to be evaluated on each input row.
	inputRowCount := sel.Input.Relational().Stats.RowCount
	cost := memo.Cost(inputRowCount) * cpuCostFactor

Should this also call computeFiltersCost? Maybe we should move this cpuCostFactor * rowCount cost inside that function.

pkg/sql/opt/xform/coster.go, line 471 at r1 (raw file):

	cost += memo.Cost(rowsProcessed) * perRowCost

	cost += c.computeFiltersCost(join.On, util.FastIntMap{}, lookupCount)

this should be rowsProcessed not lookupCount

pkg/sql/opt/xform/coster.go, line 520 at r1 (raw file):

				// some float functions, so this is a somewhat
				// data-backed guess.
				const geoFnCost = cpuCostFactor * 10

[nit] move the 10 as a constant at the top of the file, we try to put them all there

madelynnblue

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @RaduBerinde, and @rytaft)

pkg/sql/opt/xform/coster.go, line 318 at r1 (raw file):

Previously, RaduBerinde wrote…

Should this also call computeFiltersCost? Maybe we should move this cpuCostFactor * rowCount cost inside that function.

just added a second commit for this

madelynnblue

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson, @RaduBerinde, and @rytaft)

pkg/sql/opt/xform/coster.go, line 471 at r1 (raw file):

Previously, RaduBerinde wrote…

this should be rowsProcessed not lookupCount

Ok. I assume this should be changed for computeHashJoinCost too, which is using leftRowCount+rightRowCount, instead of rowsProcessed? Help me understand the intuition here. I did it like this because I thought the function would have to be executed on each incoming row from either side? But I guess that doesn't make much sense because the geo functions have two arguments, so probably one from each side. row count vs rows processed apparently is still unclear to me.

madelynnblue

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @rytaft)

pkg/sql/opt/xform/coster.go, line 520 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] move the 10 as a constant at the top of the file, we try to put them all there

done

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mjibson and @rytaft)

pkg/sql/opt/xform/coster.go, line 471 at r1 (raw file):

Previously, mjibson (Matt Jibson) wrote…

Ok. I assume this should be changed for computeHashJoinCost too, which is using leftRowCount+rightRowCount, instead of rowsProcessed? Help me understand the intuition here. I did it like this because I thought the function would have to be executed on each incoming row from either side? But I guess that doesn't make much sense because the geo functions have two arguments, so probably one from each side. row count vs rows processed apparently is still unclear to me.

Yeah, rowsProcessed estimates exactly what we need - how many rows we evaluate the ON condition on (a row here being a concatenation of a left row and a right row). The output row count is the same with the number of rows that pass the ON condition (for inner join).

In the worst case (a cross join), we may need to evaluate the ON condition on leftRowCount * rightRowCount rows.

An example where this really matters: imagine we have a join on a=x AND b=y and we have a hash join that hashes both columns and a lookup join but on an index only on a. The second one would generate a lot of rows that match on a, on which we'd have to check b=y. In this example, rowsProcessed would be equal to the row count for the hash case, but it would be much higher for the lookup case.

pkg/sql/opt/xform/coster.go, line 323 at r3 (raw file):

	// The filter has to be evaluated on each input row.
	inputRowCount := sel.Input.Relational().Stats.RowCount
	cost := memo.Cost(inputRowCount) * cpuCostFactor

I think we can remove this now, since computeFiltersCost already adds a cpuCostFactor for each one (and also for consistency with the joins where we rely only on computeFiltersCost).

madelynnblue · 2020-05-14T20:02:57Z

Added a new commit that adds limit hints and rows processed.

madelynnblue

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @rytaft)

pkg/sql/opt/xform/coster.go, line 323 at r3 (raw file):

Previously, RaduBerinde wrote…

I think we can remove this now, since computeFiltersCost already adds a cpuCostFactor for each one (and also for consistency with the joins where we rely only on computeFiltersCost).

computeFiltersCost currently only adds per row costs for expensive geo functions. It only adds a single cpuCostFactor per filter condition, not per row. I think this needs to stay for now.

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @rytaft)

pkg/sql/opt/xform/coster.go, line 137 at r2 (raw file):

	hugeCost memo.Cost = 1e100

	// Some benchmarks showed that these geo functions were atleast 10

[nit] remove "these" (or replace with "some")
[nit] at least

pkg/sql/opt/xform/coster.go, line 323 at r3 (raw file):

Previously, mjibson (Matt Jibson) wrote…

computeFiltersCost currently only adds per row costs for expensive geo functions. It only adds a single cpuCostFactor per filter condition, not per row. I think this needs to stay for now.

I see, thanks!

pkg/sql/opt/xform/coster.go, line 508 at r5 (raw file):

	if required.LimitHint != 0 && lookupCount > 0 {
		outputRows := join.Relational().Stats.RowCount
		unlimitedLookupCount := lookupCount

Ha, you used the updated logic even before I merged my PR, nice!

andy-kimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mjibson and @rytaft)

pkg/sql/opt/xform/coster.go, line 541 at r5 (raw file):

func (c *coster) computeFiltersCost(
	filters memo.FiltersExpr, eqMap util.FastIntMap, rowCount float64,

Can we just add the filters cost into the perRowCost instead? I think it's awkward and surprising to have some of the filter operators take into account rowCount and some not. I bet making the existing EqOp cost be per-row instead of a one-time cost wouldn't change any plans, but would allow this API to be cleaner.

rytaft

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, 19 of 19 files at r3, 1 of 1 files at r4, 3 of 3 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @mjibson)

pkg/sql/opt/memo/statistics_builder.go, line 2360 at r5 (raw file):

		withoutOn := e.Memo().MemoizeLookupJoin(t.Input, nil /* on */, lookupJoinPrivate)
		return withoutOn.Relational().Stats.RowCount
	case *GeoLookupJoinExpr:

[nit] add a line above this case to be consistent with the formatting in the rest of this switch

pkg/sql/opt/xform/coster.go, line 494 at r5 (raw file):

	//   ab JOIN xy ON a=x AND a=10 AND x=10
	// which can become a lookup join with left-over condition x=10 which doesn't
	// actually filter anything.

Is there a way to make this comment more specific to geo lookup join? Maybe you could simplify the test case with NYC neighborhoods?

pkg/sql/opt/xform/coster.go, line 500 at r5 (raw file):

		// to be optimized twice, the coster should never be used after
		// logPropsBuilder.clear() is called.
		panic(errors.AssertionFailedf("could not get rows processed for lookup join"))

[nit] geo lookup join

pkg/sql/opt/xform/coster.go, line 541 at r5 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Can we just add the filters cost into the perRowCost instead? I think it's awkward and surprising to have some of the filter operators take into account rowCount and some not. I bet making the existing EqOp cost be per-row instead of a one-time cost wouldn't change any plans, but would allow this API to be cleaner.

The problem is when the estimated row count is very tiny, we need a way to prefer certain plans over others, which is why we have the "setup cost".

But I agree it's kind of awkward that this function now takes a row count, but it's only applied for the case of geo functions. I think it would be good to keep this function independent of the row count, but maybe always multiply the result by rowCount+1 instead of just rowCount. Does that change any plans?

Also add per-row filter costing. This changes some other costs but didn't result in any changed plans. Fixes #48214 Release note: None

This will account for expensive geo function executions in WHERE clauses.

Release note: None

madelynnblue

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @rytaft)

pkg/sql/opt/memo/statistics_builder.go, line 2360 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] add a line above this case to be consistent with the formatting in the rest of this switch

Done.

pkg/sql/opt/xform/coster.go, line 494 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is there a way to make this comment more specific to geo lookup join? Maybe you could simplify the test case with NYC neighborhoods?

Done.

pkg/sql/opt/xform/coster.go, line 500 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

[nit] geo lookup join

Done.

pkg/sql/opt/xform/coster.go, line 541 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

The problem is when the estimated row count is very tiny, we need a way to prefer certain plans over others, which is why we have the "setup cost".

But I agree it's kind of awkward that this function now takes a row count, but it's only applied for the case of geo functions. I think it would be good to keep this function independent of the row count, but maybe always multiply the result by rowCount+1 instead of just rowCount. Does that change any plans?

I've removed the row count argument and changed all callers of computeFiltersCost to multiply its output by their row count. This does indeed increase many costs (as we would expect since now we are assuming the base cost applies to all rows instead of just a one time setup), but didn't result in any changed plans in the tests.

rytaft

Reviewed 2 of 23 files at r6, 18 of 20 files at r7, 3 of 3 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andy-kimball and @mjibson)

pkg/sql/opt/xform/coster.go, line 323 at r8 (raw file):

	// The filter has to be evaluated on each input row.
	inputRowCount := sel.Input.Relational().Stats.RowCount
	cost := memo.Cost(inputRowCount) * cpuCostFactor

You can probably get rid of this cost now, since it's subsumed by computeFiltersCost

pkg/sql/opt/xform/coster.go, line 369 at r8 (raw file):

		rowsProcessed = join.Relational().Stats.RowCount
	}
	cost += memo.Cost(rowsProcessed) * cpuCostFactor

can get rid of this

pkg/sql/opt/xform/coster.go, line 389 at r8 (raw file):

		eqMap.Set(right, left)
	}
	cost += c.computeFiltersCost(*on, eqMap) * memo.Cost(rowsProcessed)

for all of these joins, I think you want to multiply by 1+rowsProcessed to account for the "setup cost" we had before

pkg/sql/opt/xform/coster.go, line 408 at r8 (raw file):

		panic(errors.AssertionFailedf("could not get rows processed for merge join"))
	}
	cost += memo.Cost(rowsProcessed) * cpuCostFactor

can get rid of this

pkg/sql/opt/xform/coster.go, line 481 at r8 (raw file):

		c.rowScanCost(join.Table, join.Index, numLookupCols)

	cost += memo.Cost(rowsProcessed) * perRowCost

can get rid of this, if you add the result of computeFiltersCost to the perRowCost

pkg/sql/opt/xform/coster.go, line 496 at r8 (raw file):

	// left-over conditions that are not selective.
	// A contrived example, where gid is a SERIAL PK:
	//   nyc_census_blocks c JOIN nyc_neighborhoods n ON ST_Intersects(c.geom, n.geom) AND c.gid = 10

[nit] wrap to 80 chars

pkg/sql/opt/xform/coster.go, line 498 at r8 (raw file):

	//   nyc_census_blocks c JOIN nyc_neighborhoods n ON ST_Intersects(c.geom, n.geom) AND c.gid = 10
	// which can become a lookup join with left-over condition c.gid = 10 which doesn't
	// actually filter anything.

The statement "which doesn't actually filter anything" isn't correct for this example (c.gid = 10 does in fact filter the output). But the example is still valid since we would need to evaluate the filter condition on every row processed by the GeoLookupJoin, and we need to account for this in the cost.

On a somewhat unrelated note, c.gid = 10 is such a selective filter that we'd probably never choose a GeoLookupJoin over a constrained primary index scan in this case. I'd suggest changing the filter in the example to something like c.gid < n.gid (still totally contrived, but at least more likely to result in a GeoLookupJoin with an ON condition).

pkg/sql/opt/xform/testdata/coster/join, line 539 at r8 (raw file):

 └── inner-join (lookup abcde@idx_abd)
      ├── columns: w:1!null x:2!null y:3!null z:4!null i:5!null j:6!null k:7!null l:8!null m:9!null n:10!null a:12!null b:13!null c:14!null
      ├── key columns: [1 2] = [12 13]

we lost one of the equality columns here -- I think this is the reason we needed the setup cost

madelynnblue · 2020-05-19T17:05:22Z

Closing due to too many unrelated changes happening at once. Getting all the geo stuff in and will do the other joins and select in another PR.

madelynnblue requested a review from rytaft May 14, 2020 17:44

madelynnblue requested a review from a team as a code owner May 14, 2020 17:44

RaduBerinde reviewed May 14, 2020

View reviewed changes

madelynnblue commented May 14, 2020

View reviewed changes

RaduBerinde reviewed May 14, 2020

View reviewed changes

madelynnblue commented May 14, 2020

View reviewed changes

RaduBerinde approved these changes May 15, 2020

View reviewed changes

andy-kimball reviewed May 15, 2020

View reviewed changes

rytaft reviewed May 18, 2020

View reviewed changes

madelynnblue added 3 commits May 18, 2020 12:19

opt: add cost for executing geo functions in filters

ed729a1

Also add per-row filter costing. This changes some other costs but didn't result in any changed plans. Fixes #48214 Release note: None

opt: compute filters cost in select

cd40b95

This will account for expensive geo function executions in WHERE clauses.

opt: teach geolookup join costing about limit hints and processed rows

7a2b14c

Release note: None

madelynnblue commented May 18, 2020

View reviewed changes

rytaft reviewed May 18, 2020

View reviewed changes

madelynnblue mentioned this pull request May 19, 2020

opt: add cost for executing geo functions in filters #49272

Merged

madelynnblue closed this May 19, 2020

madelynnblue deleted the geolookup-cost branch May 19, 2020 17:05

opt: add cost for executing geo functions in filters #49074

opt: add cost for executing geo functions in filters #49074

Uh oh!

Conversation

madelynnblue commented May 14, 2020

Uh oh!

cockroach-teamcity commented May 14, 2020

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

madelynnblue left a comment

Choose a reason for hiding this comment

Uh oh!

madelynnblue left a comment

Choose a reason for hiding this comment

Uh oh!

madelynnblue left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

madelynnblue commented May 14, 2020

Uh oh!

madelynnblue left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

madelynnblue left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

madelynnblue commented May 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants