Skip to content

Conversation

@karthikps97
Copy link
Member

@karthikps97 karthikps97 commented Nov 20, 2025

Describe the Problem

When dedup checks for existing chunks in the system, the query filters by system, dedup_key and bucket. When the table size is large, the index is not being used by the pg planner.

Explain the Changes

  1. Added an upgrade script which deletes the index 'idx_btree_datachunks_dedup_key'.
  2. New composite index (dedup_key, system, bucket) is created during bootstrapping.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

  1. Upgrade noobaa from 4.20 to 4.21.
  2. After successful upgrade compare the time taken by upload with dedup enabled.
  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • Chores
    • Updated a database index to a composite index by including additional fields to improve query targeting and performance.
    • Added an upgrade script for v5.21.0 to remove a deprecated index during migration, ensuring smoother upgrades and index replacement.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 20, 2025

Walkthrough

The pull request expands a composite index in src/server/object_services/schemas/data_chunk_indexes.js to include system and bucket alongside dedup_key, and adds an upgrade script src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js that drops the old idx_btree_datachunks_dedup_key index during migration.

Changes

Cohort / File(s) Summary
Index Schema Update
src/server/object_services/schemas/data_chunk_indexes.js
Modified the second index definition: changed the indexed fields from { dedup_key: 1 } to { dedup_key: 1, system: 1, bucket: 1 }. Index name, partialFilterExpression, and non-unique setting remain unchanged.
Migration Script
src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
Added upgrade script exporting run and description; run drops the old index idx_btree_datachunks_dedup_key via DB pool, logs success, and propagates errors after logging.

Sequence Diagram(s)

sequenceDiagram
    rect rgb(245,250,255)
    participant Upgrader
    participant UpgradeScript
    participant DB as Database
    note right of UpgradeScript `#eef6ff`: Migration drops old btree index before bootstrap recreates composite index
    Upgrader->>UpgradeScript: run({ dbg, db_client })
    UpgradeScript->>DB: DROP INDEX IF EXISTS idx_btree_datachunks_dedup_key
    alt success
        DB-->>UpgradeScript: OK
        UpgradeScript-->>Upgrader: log success
    else error
        DB-->>UpgradeScript: ERROR
        UpgradeScript-->>Upgrader: dbg.error(...) and rethrow
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review points:
    • Confirm composite index field order matches query patterns and selector usage.
    • Validate partialFilterExpression and index naming consistency with bootstrap that creates the new composite index.
    • Ensure the upgrade script runs in the correct upgrade phase and has required DB privileges and error handling.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: introducing a composite index for datachunks to improve performance.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@karthikps97 karthikps97 linked an issue Nov 20, 2025 that may be closed by this pull request
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js (1)

4-15: Upgrade script behavior looks good; consider fixing small log typo

The script correctly drops the old index in an idempotent way and surfaces errors via dbg.error and rethrow, which is appropriate for an upgrade step. One small polish item: the error message has a typo ("ocurred""occurred"), which you may want to fix for clearer logs.

A minimal diff for the message:

-    dbg.error('An error ocurred in the upgrade process:', err);
+    dbg.error('An error occurred in the upgrade process:', err);
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 418bb3b and 3d9f7ac.

📒 Files selected for processing (2)
  • src/server/object_services/schemas/data_chunk_indexes.js (1 hunks)
  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:6-22
Timestamp: 2025-08-11T06:12:12.318Z
Learning: In the noobaa-core upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js, bucket migration from the internal mongo pool to a new default pool is planned to be handled in separate future PRs with comprehensive testing, rather than being included directly in the pool removal script.
📚 Learning: 2025-08-08T13:12:46.728Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.

Applied to files:

  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
📚 Learning: 2025-08-11T06:12:12.318Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:6-22
Timestamp: 2025-08-11T06:12:12.318Z
Learning: In the noobaa-core upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js, bucket migration from the internal mongo pool to a new default pool is planned to be handled in separate future PRs with comprehensive testing, rather than being included directly in the pool removal script.

Applied to files:

  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
📚 Learning: 2025-08-26T06:29:52.095Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9195
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:25-31
Timestamp: 2025-08-26T06:29:52.095Z
Learning: In noobaa-core upgrade scripts, there is only one internal/mongo pool per system, so removing mongo_pools[0] is sufficient and correct rather than handling multiple pools.

Applied to files:

  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: run-package-lock-validation
  • GitHub Check: Build Noobaa Image
🔇 Additional comments (1)
src/server/object_services/schemas/data_chunk_indexes.js (1)

18-22: Verification confirms the original concern: dedup_key-only queries will lose index optimization

The ripgrep output reveals critical queries in get_aprox_dedup_keys_number() (md_store.js lines 1734–1761) that filter and sort only on dedup_key, without system or bucket:

  • Line 1740: estimatedQueryCount({ dedup_key: { $exists: true } })
  • Line 1747: Range query dedup_key: { $lt: marker, $exists: true }
  • Lines 1751, 1754: Sorting by dedup_key: 1 and dedup_key: -1

Since the new index has leading columns (system, bucket), these dedup_key-only queries will not benefit from it. The old single-column index was optimal for this access pattern. This function is called by the dedup_indexer background service (line 39 in dedup_indexer.js), so the performance impact is in production code.

Before rollout, verify either: (1) that these dedup_key-only queries are not performance-sensitive, or (2) that a separate single-column dedup_key index will be retained or created to maintain performance.

@coderabbitai
Copy link

coderabbitai bot commented Nov 20, 2025

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #9296

coderabbitai bot added a commit that referenced this pull request Nov 20, 2025
Docstrings generation was requested by @karthikps97.

* #9293 (comment)

The following files were modified:

* `src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js`
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d9f7ac and 39465ee.

📒 Files selected for processing (1)
  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:6-22
Timestamp: 2025-08-11T06:12:12.318Z
Learning: In the noobaa-core upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js, bucket migration from the internal mongo pool to a new default pool is planned to be handled in separate future PRs with comprehensive testing, rather than being included directly in the pool removal script.
📚 Learning: 2025-08-08T13:12:46.728Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.

Applied to files:

  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
📚 Learning: 2025-08-11T06:12:12.318Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:6-22
Timestamp: 2025-08-11T06:12:12.318Z
Learning: In the noobaa-core upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js, bucket migration from the internal mongo pool to a new default pool is planned to be handled in separate future PRs with comprehensive testing, rather than being included directly in the pool removal script.

Applied to files:

  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
📚 Learning: 2025-08-26T06:29:52.095Z
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9195
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:25-31
Timestamp: 2025-08-26T06:29:52.095Z
Learning: In noobaa-core upgrade scripts, there is only one internal/mongo pool per system, so removing mongo_pools[0] is sufficient and correct rather than handling multiple pools.

Applied to files:

  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build Noobaa Image
  • GitHub Check: run-package-lock-validation
  • GitHub Check: run-jest-unit-tests

Comment on lines +9 to +17
try {
const pool = db_client.instance().get_pool();
await pool.query(`DROP INDEX IF EXISTS ${indexName};`);

dbg.log0("Executed upgrade script for dropping index ", indexName);
} catch (err) {
dbg.error('An error ocurred in the upgrade process:', err);
throw err;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tighten error logging message (typo and missing context).

Functionally this looks good and idempotent; only nit is the error log text. There’s a typo and the message doesn’t mention which index failed, which can hinder debugging.

-  } catch (err) {
-    dbg.error('An error ocurred in the upgrade process:', err);
-    throw err;
-  }
+  } catch (err) {
+    dbg.error('An error occurred while dropping index', indexName, err);
+    throw err;
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
const pool = db_client.instance().get_pool();
await pool.query(`DROP INDEX IF EXISTS ${indexName};`);
dbg.log0("Executed upgrade script for dropping index ", indexName);
} catch (err) {
dbg.error('An error ocurred in the upgrade process:', err);
throw err;
}
try {
const pool = db_client.instance().get_pool();
await pool.query(`DROP INDEX IF EXISTS ${indexName};`);
dbg.log0("Executed upgrade script for dropping index ", indexName);
} catch (err) {
dbg.error('An error occurred while dropping index', indexName, err);
throw err;
}
🤖 Prompt for AI Agents
In src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js around lines 9
to 17, the catch block's log message has a typo ("ocurred") and lacks context
about which index failed; update the dbg.error call to use the correct spelling
("occurred") and include the indexName and the error object in the message so
the log clearly shows which index failed and why (e.g., construct a single error
string or pass indexName and err to dbg.error).

@liranmauda
Copy link
Contributor

We need to test the upgrade from 4.20 to 4.21 with a lot of records

@karthikps97
Copy link
Member Author

karthikps97 commented Nov 25, 2025

Tested with 1Million + records in datachunks. These are the results:

Query: explain analyze SELECT * FROM datachunks WHERE (data->>'system'='69252cdff617af002209ecd5' and data->>'bucket'='69252cdff617af002209ecdd' and (data->>'dedup_key' IN ('WmlmtwV50YV//LfuCIm+EIlmbLdhUfCs0WJj+yp0l+0=','3OfkqbNFPXrU4odaUHK6A+mu7CwSVmnc09nVdLE0M2A=','WxcA5eoNJEYMic3n/qlU/og4PTP1bYXSDtA4E3m5YJc=','d4k9vAgAlFwviIWz/wFY0q3sECypiiCi6pF6T0fxhD0=') and data ? 'dedup_key') and (data->'deleted' IS NULL OR data->'deleted' = 'null'::jsonb)) ORDER BY data->>'_id' DESC;

Before (With only dedup_key index):

_Sort  (cost=184.34..184.34 rows=1 width=1418) (actual time=37.361..37.362 rows=4 loops=1)
   Sort Key: ((data ->> '_id'::text)) DESC
   Sort Method: quicksort  Memory: 30kB
   ->  Bitmap Heap Scan on datachunks  (cost=182.05..184.33 rows=1 width=1418) (actual time=37.352..37.355 rows=4 lo
ops=1)
         Recheck Cond: (((data ->> 'dedup_key'::text) = ANY ('{WmlmtwV50YV//LfuCIm+EIlmbLdhUfCs0WJj+yp0l+0=,3OfkqbNF
PXrU4odaUHK6A+mu7CwSVmnc09nVdLE0M2A=,WxcA5eoNJEYMic3n/qlU/og4PTP1bYXSDtA4E3m5YJc=,d4k9vAgAlFwviIWz/wFY0q3sECypiiCi6p
F6T0fxhD0=}'::text[])) AND (data ? 'dedup_key'::text) AND (((data -> 'deleted'::text) IS NULL) OR ((data -> 'deleted
'::text) = 'null'::jsonb)))
         Filter: (((data ->> 'system'::text) = '69252cdff617af002209ecd5'::text) AND ((data ->> 'bucket'::text) = '6
9252cdff617af002209ecdd'::text))
         Heap Blocks: exact=2
         ->  BitmapAnd  (cost=182.05..182.05 rows=2 width=0) (actual time=37.342..37.343 rows=0 loops=1)
               ->  Bitmap Index Scan on idx_btree_datachunks_dedup_key  (cost=0.00..7.66 rows=207 width=0) (actual t
ime=0.027..0.027 rows=4 loops=1)
                     Index Cond: ((data ->> 'dedup_key'::text) = ANY ('{WmlmtwV50YV//LfuCIm+EIlmbLdhUfCs0WJj+yp0l+0=
,3OfkqbNFPXrU4odaUHK6A+mu7CwSVmnc09nVdLE0M2A=,WxcA5eoNJEYMic3n/qlU/og4PTP1bYXSDtA4E3m5YJc=,d4k9vAgAlFwviIWz/wFY0q3sE
CypiiCi6pF6T0fxhD0=}'::text[]))
               ->  Bitmap Index Scan on idx_btree_datachunks_id_desc  (cost=0.00..174.14 rows=10323 width=0) (actual
 time=37.312..37.312 rows=1209940 loops=1)
 Planning Time: 0.071 ms
 Execution Time: 37.383 ms_

After (dedup_key, system, bucket):

_Sort  (cost=7.78..7.79 rows=1 width=1443) (actual time=0.067..0.067 rows=4 loops=1)
   Sort Key: ((data ->> '_id'::text)) DESC
   Sort Method: quicksort  Memory: 30kB
   ->  Index Scan using idx_btree_datachunks_dedup_key_system_bucket on datachunks  (cost=0.55..7.77 rows=1 width=1443) (actual ti
me=0.023..0.053 rows=4 loops=1)
         Index Cond: (((data ->> 'dedup_key'::text) = ANY ('{WmlmtwV50YV//LfuCIm+EIlmbLdhUfCs0WJj+yp0l+0=,3OfkqbNFPXrU4odaUHK6A+mu
7CwSVmnc09nVdLE0M2A=,WxcA5eoNJEYMic3n/qlU/og4PTP1bYXSDtA4E3m5YJc=,d4k9vAgAlFwviIWz/wFY0q3sECypiiCi6pF6T0fxhD0=}'::text[])) AND ((d
ata ->> 'system'::text) = '69252cdff617af002209ecd5'::text) AND ((data ->> 'bucket'::text) = '69252cdff617af002209ecdd'::text))
         Filter: (((data -> 'deleted'::text) IS NULL) OR ((data -> 'deleted'::text) = 'null'::jsonb))
 Planning Time: 0.137 ms
 Execution Time: 0.085 ms
(8 rows)_

Copy link
Member

@dannyzaken dannyzaken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikps97, this looks good by itself.
Can you please go over the code in src/server/bg_services/dedup_indexer.js? This is a background worker intended to control the size of the dedup index by removing the dedup_key from chunks when they reach a specific size. This is old code from when we still used MongoDB, and it hasn't been touched much since. Try looking into the queries it does and make sure your changes did not affect it for the worse.
Unrelated to that, we should probably reconsider its purpose.

@karthikps97
Copy link
Member Author

I went through the module. But since we are using postgres rather than MongoDB, the stats method in postgres_client is still a stub. So there is no functionality that if affected. And yes, the indexer module is outdated since we are no longer using MongoDB. But I still can update the index name there. What would you suggest @dannyzaken.

@dannyzaken
Copy link
Member

@karthikps97, You are correct that stats is still a stub in the postgres_client, but AFAIR, its value is only used for log messages, and the rest of the code is functional (although I don't think we've reevaluated it since moving from MongoDB).
Handling the dedup indexer bg is out of scope for this PR, but we should ensure that changing the index does not affect it negatively.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/server/object_services/schemas/data_chunk_indexes.js (1)

17-22: Consider adding postgres: true flag for consistency.

The first index explicitly declares postgres: true, but this composite index does not. Given that MongoDB is no longer used (per PR comments) and this change targets PostgreSQL performance, consider adding the flag for consistency and clarity.

Apply this diff if desired:

 {
+    postgres: true,
     fields: {
         dedup_key: 1,
         system: 1,
         bucket: 1
     },
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e937e8b and a207558.

📒 Files selected for processing (2)
  • src/server/object_services/schemas/data_chunk_indexes.js (1 hunks)
  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/upgrade/upgrade_scripts/5.21.0/remove_datachunks_index.js
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:6-22
Timestamp: 2025-08-11T06:12:12.318Z
Learning: In the noobaa-core upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js, bucket migration from the internal mongo pool to a new default pool is planned to be handled in separate future PRs with comprehensive testing, rather than being included directly in the pool removal script.
Learnt from: naveenpaul1
Repo: noobaa/noobaa-core PR: 9182
File: src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js:9-17
Timestamp: 2025-08-08T13:12:46.728Z
Learning: In upgrade script src/upgrade/upgrade_scripts/5.20.0/remove_mongo_pool.js for noobaa-core, rely on structural detection (e.g., pool.mongo_info, and resource_type === 'INTERNAL') with name-prefix fallback for removing legacy mongo/internal pools, instead of depending solely on config.INTERNAL_STORAGE_POOL_NAME or config.DEFAULT_POOL_NAME. Handle multi-system stores and remove all matching pools in one change.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build Noobaa Image
  • GitHub Check: run-package-lock-validation
  • GitHub Check: run-jest-unit-tests

@karthikps97 karthikps97 merged commit f17a42a into noobaa:master Dec 3, 2025
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MDStore.find_chunks_by_dedup_key is not hitting indexes

3 participants