Skip to content

MDEV-32371 Deadlock between buf_page_get_zip() and buf_pool_t::corrupted_evict()#2866

Merged
dr-m merged 1 commit into
10.6from
10.6-MDEV-32371
Nov 30, 2023
Merged

MDEV-32371 Deadlock between buf_page_get_zip() and buf_pool_t::corrupted_evict()#2866
dr-m merged 1 commit into
10.6from
10.6-MDEV-32371

Conversation

@dr-m

@dr-m dr-m commented Nov 23, 2023

Copy link
Copy Markdown
Contributor
  • The Jira issue number for this PR is: MDEV-32371

Description

buf_page_get_zip(): Do not wait for the page latch while holding hash_lock. If the latch is not available, ensure that any concurrent buf_pool_t::corrupted_evict() will be able to acquire the hash_lock, and then retry the lookup. If the page was corrupted and evicted, we will finally goto must_read_page, retry the read once more, and then report an error.

How can this PR be tested?

I think that this is best tested together with MDEV-31817 #2865, which is included for the purpose of testing. The workload must use ROW_FORMAT=COMPRESSED tables, and we might want to use CMAKE_BUILD_TYPE=RelWithDebInfo.

Basing the PR against the correct MariaDB version

  • This is a new feature and the PR is based against the latest MariaDB development branch.
  • This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m requested a review from Thirunarayanan November 23, 2023 10:19
@dr-m dr-m self-assigned this Nov 23, 2023
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m

dr-m commented Nov 23, 2023

Copy link
Copy Markdown
Contributor Author

While we do have SET GLOBAL DEBUG_DBUG='+d,intermittent_read_failure'; it does not look like this fault injection has been implemented in buf_page_t::read_complete(), which we would need in order to trigger this bug.

However, if a concurrent read of the same block is in progress, the changed code path should be exercised. To test this, a tiny innodb_buffer_pool_size would be beneficial.

…ted_evict()

buf_page_get_zip(): Do not wait for the page latch while holding hash_lock.
If the latch is not available, ensure that any concurrent
buf_pool_t::corrupted_evict() will be able to acquire the hash_lock,
and then retry the lookup. If the page was corrupted, we will finally
"goto must_read_page", retry the read once more, and then report an error.

Reviewed by: Thirunarayanan Balathandayuthapani
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants