MDEV-32371 Deadlock between buf_page_get_zip() and buf_pool_t::corrupted_evict() by dr-m · Pull Request #2866 · MariaDB/server

dr-m · 2023-11-23T10:19:58Z

The Jira issue number for this PR is: MDEV-32371

Description

buf_page_get_zip(): Do not wait for the page latch while holding hash_lock. If the latch is not available, ensure that any concurrent buf_pool_t::corrupted_evict() will be able to acquire the hash_lock, and then retry the lookup. If the page was corrupted and evicted, we will finally goto must_read_page, retry the read once more, and then report an error.

How can this PR be tested?

I think that this is best tested together with MDEV-31817 #2865, which is included for the purpose of testing. The workload must use ROW_FORMAT=COMPRESSED tables, and we might want to use CMAKE_BUILD_TYPE=RelWithDebInfo.

Basing the PR against the correct MariaDB version

This is a new feature and the PR is based against the latest MariaDB development branch.
This is a bug fix and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2023-11-23T10:20:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m · 2023-11-23T10:25:36Z

While we do have SET GLOBAL DEBUG_DBUG='+d,intermittent_read_failure'; it does not look like this fault injection has been implemented in buf_page_t::read_complete(), which we would need in order to trigger this bug.

However, if a concurrent read of the same block is in progress, the changed code path should be exercised. To test this, a tiny innodb_buffer_pool_size would be beneficial.

…ted_evict() buf_page_get_zip(): Do not wait for the page latch while holding hash_lock. If the latch is not available, ensure that any concurrent buf_pool_t::corrupted_evict() will be able to acquire the hash_lock, and then retry the lookup. If the page was corrupted, we will finally "goto must_read_page", retry the read once more, and then report an error. Reviewed by: Thirunarayanan Balathandayuthapani

dr-m requested a review from Thirunarayanan November 23, 2023 10:19

dr-m self-assigned this Nov 23, 2023

Thirunarayanan approved these changes Nov 28, 2023

View reviewed changes

dr-m force-pushed the 10.6-MDEV-32371 branch from 963a88b to bb511de Compare November 30, 2023 08:36

dr-m merged commit bb511de into 10.6 Nov 30, 2023

dr-m deleted the 10.6-MDEV-32371 branch November 30, 2023 10:01

adityahase mentioned this pull request Jan 29, 2024

fix: Install patched MariaDB packages from packages.frappe.cloud frappe/press#1398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MDEV-32371 Deadlock between buf_page_get_zip() and buf_pool_t::corrupted_evict()#2866

MDEV-32371 Deadlock between buf_page_get_zip() and buf_pool_t::corrupted_evict()#2866
dr-m merged 1 commit into
10.6from
10.6-MDEV-32371

dr-m commented Nov 23, 2023

Uh oh!

CLAassistant commented Nov 23, 2023

Uh oh!

dr-m commented Nov 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dr-m commented Nov 23, 2023

Description

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Nov 23, 2023

Uh oh!

dr-m commented Nov 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants