Skip to content

Conversation

@sid597
Copy link
Collaborator

@sid597 sid597 commented Nov 24, 2025

duplicate node alert on page

add to node create dialog, only show small list, fix lint errors, fix bug to show in all open pages

Summary by CodeRabbit

  • New Features
    • Added similarity matching display that shows related pages below the title input field
    • Results include relevance scores and are limited to 5 matches per search
    • Clicking a match opens the related page in the right sidebar for quick navigation

✏️ Tip: You can customize this high-level summary in your review settings.

https://www.loom.com/share/cb366c09942b4030a9933118ad281bc5

@linear
Copy link

linear bot commented Nov 24, 2025

@supabase
Copy link

supabase bot commented Nov 24, 2025

This pull request has been ignored for the connected project zytfjzqyijgagqxrzbmz because there are no changes detected in packages/database/supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

Copy link
Collaborator Author

sid597 commented Nov 24, 2025

@sid597 sid597 changed the title duplicate node alert on page ENG-1074 Prod duplicate node alert on page Nov 24, 2025
@sid597 sid597 marked this pull request as ready for review November 24, 2025 10:18
@sid597
Copy link
Collaborator Author

sid597 commented Nov 24, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

📝 Walkthrough

Walkthrough

This PR integrates vector similarity matching into the Roam note-taking application. It introduces new UI components and utilities to detect and display similar notes based on vector embeddings, with integration into the create-node dialog and page title observers for discourse nodes.

Changes

Cohort / File(s) Summary
New Vector Similarity Components
apps/roam/src/components/VectorDuplicateMatches.tsx
Adds collapsible panel component to display vector-matched similar nodes with loading states and result scoring; includes renderPossibleDuplicates helper for DOM injection into page header.
Node Context Hook
apps/roam/src/utils/useNodeContext.ts
New React hook that computes NodeContext for a Roam page title by locating matching DiscourseNode and extracting search text via format parsing.
Vector Search API
apps/roam/src/utils/hyde.ts
Updates SearchFunc type signature to accept simplified object structure; exports searchEmbeddings function; introduces VectorMatch type and findSimilarNodesVectorOnly function for vector-based node matching.
Create Node Dialog Integration
apps/roam/src/components/CreateNodeDialog.tsx
Adds debounced title state (500ms delay) and renders VectorDuplicateMatches component below title input with debounced title for duplicate detection.
Observer Integration
apps/roam/src/utils/initializeObserversAndListeners.ts
Integrates renderPossibleDuplicates calls for discourse nodes when page titles are processed via pageTitleObserver.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CreateNodeDialog
    participant Debounce
    participant VectorDuplicateMatches
    participant searchAPI[findSimilarNodesVectorOnly]
    participant Supabase
    
    User->>CreateNodeDialog: Type title
    CreateNodeDialog->>Debounce: Trigger 500ms debounce
    Note over Debounce: Wait 500ms for input stabilization
    Debounce->>VectorDuplicateMatches: Pass debouncedTitle
    VectorDuplicateMatches->>searchAPI: Fetch similar nodes
    searchAPI->>Supabase: Query vector embeddings
    Supabase-->>searchAPI: Return matching nodes with scores
    searchAPI-->>VectorDuplicateMatches: Return VectorMatch[] results
    VectorDuplicateMatches->>VectorDuplicateMatches: Render collapsible panel
    VectorDuplicateMatches-->>User: Display similar notes list
Loading
sequenceDiagram
    participant Observer
    participant initializeObserversAndListeners
    participant useNodeContext
    participant renderPossibleDuplicates
    participant DiscourseNode[DiscourseNode Utils]
    
    Observer->>initializeObserversAndListeners: Page title detected
    initializeObserversAndListeners->>DiscourseNode: Check if discourse node
    alt Is Discourse Node
        DiscourseNode-->>initializeObserversAndListeners: true
        initializeObserversAndListeners->>renderPossibleDuplicates: Call with H1 element
        renderPossibleDuplicates->>renderPossibleDuplicates: Mount VectorDuplicateMatches
        renderPossibleDuplicates-->>Observer: Component rendered in header
    else Not Discourse Node
        DiscourseNode-->>initializeObserversAndListeners: false
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • VectorDuplicateMatches.tsx: Complex component with state management, effect dependencies, DOM manipulation via ReactDOM.render, and vector search integration—requires verification of effect dependencies and state flow.
  • useNodeContext.ts: Hook logic involving discourse node lookup, title parsing via format expressions, and conditional context derivation—verify discourse node matching logic and context computation.
  • SearchFunc type signature change: Altering public API type from CandidateNodeWithEmbedding[] to { uid: string; text: string }[]—verify all usages are updated correctly and that the type change doesn't break existing callers.
  • Cross-file integration: Multiple integration points in CreateNodeDialog and initializeObserversAndListeners—verify proper prop passing, conditional logic for discourse nodes, and debounce behavior.

Possibly related PRs

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'ENG-1074 Prod duplicate node alert on page' accurately summarizes the main objective of the changeset: adding a duplicate node alert feature that displays on pages in production.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
apps/roam/src/utils/initializeObserversAndListeners.ts (1)

55-57: Discourse-node duplicate rendering integration looks correct; consider guarding against empty UIDs.

The flow of resolving the page UID via getPageUidByPageTitle and gating renderPossibleDuplicates behind isDiscourseNode(uid) is sound and matches how discourse-node UIDs are handled elsewhere. As a small polish, you could avoid calling isDiscourseNode when Roam returns an empty UID:

-      const uid = getPageUidByPageTitle(title);
-      if (isDiscourseNode(uid)) {
+      const uid = getPageUidByPageTitle(title);
+      if (uid && isDiscourseNode(uid)) {
         renderPossibleDuplicates(h1, title);
       }

Based on learnings, using getPageUidByPageTitle synchronously here is correct.

Also applies to: 89-92

apps/roam/src/utils/useNodeContext.ts (2)

13-36: Content extraction from formatted titles is well-aligned with placeholder semantics.

The logic of deriving placeholders from node.format, using getDiscourseNodeFormatExpression, and preferring a {content} placeholder (falling back to the first capture) is a good match for how format expressions are built. If you find yourself needing the same behavior outside Roam (e.g., the Obsidian util), consider lifting this into a shared extractContentFromTitle helper to avoid divergence between apps.


38-69: useNodeContext matching strategy looks solid; UID-first then title-based fallback.

Resolving the discourse node by pageUid via findDiscourseNode and then falling back to matchDiscourseNode({ ...node, title: pageTitle }) gives a sensible priority order, and returning null when nothing matches lets callers cleanly short‑circuit (as VectorDuplicateMatches does). If you expect empty or transient titles, you could optionally short‑circuit early in the effect when !pageTitle.trim() to skip getDiscourseNodes() work, but the current implementation is correct as is.

Based on learnings, using the Roam page UID with findDiscourseNode (where DiscourseNode.type is the UID field) is consistent with existing conventions.

apps/roam/src/components/VectorDuplicateMatches.tsx (2)

8-21: Avoid duplicating the vector match shape; import the shared type from hyde.ts.

VectorMatchItem mirrors the VectorMatch type exposed from hyde.ts, and the cast const vectorSearch = findSimilarNodesVectorOnly as (...) => Promise<VectorMatchItem[]>; will silently drift if the underlying type ever changes. To keep things aligned, consider reusing the exported type instead of re-declaring and casting:

-import type { Result } from "~/utils/types";
-import { findSimilarNodesVectorOnly } from "../utils/hyde";
+import type { Result } from "~/utils/types";
+import { findSimilarNodesVectorOnly, type VectorMatch } from "../utils/hyde";
@@
-type VectorMatchItem = {
-  node: Result;
-  score: number;
-};
+type VectorMatchItem = VectorMatch & { node: Result };
@@
-const vectorSearch = findSimilarNodesVectorOnly as (
-  params: VectorSearchParams,
-) => Promise<VectorMatchItem[]>;
+const vectorSearch = (params: VectorSearchParams) =>
+  findSimilarNodesVectorOnly(params);

95-157: UI rendering and DOM mounting for duplicate matches are consistent with the Roam environment.

The conditional rendering (no panel when activeContext is null, spinner vs. "No matches found" vs. list) reads cleanly, and the list items correctly open the matched node in the right sidebar. The renderPossibleDuplicates helper’s use of a dedicated container class plus data-page-title to reuse or recreate the mount point, and repositioning relative to the H1’s container, should keep the injected UI stable across title changes and re-renders in Roam’s header.

Also applies to: 160-199

apps/roam/src/utils/hyde.ts (1)

59-63: RPC signatures verified; defensive UID filtering remains a good practice.

The parameter names and JSON serialization approach are correct—p_query_embedding (string-serialized vector) and p_subset_roam_uids (text array) match the SQL function definition, and JSON.stringify() is the established pattern across the codebase for vector RPC calls. The findSimilarNodesVectorOnly function uses the same serialization strategy with match_content_embeddings.

The defensive filter suggestion still holds as a safeguard against unexpected falsy values:

-  const subsetRoamUids = indexData.map((node) => node.uid);
+  const subsetRoamUids = indexData
+    .map((node) => node.uid)
+    .filter((uid): uid is string => !!uid);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 434df94 and d574e8d.

📒 Files selected for processing (5)
  • apps/roam/src/components/CreateNodeDialog.tsx (3 hunks)
  • apps/roam/src/components/VectorDuplicateMatches.tsx (1 hunks)
  • apps/roam/src/utils/hyde.ts (3 hunks)
  • apps/roam/src/utils/initializeObserversAndListeners.ts (2 hunks)
  • apps/roam/src/utils/useNodeContext.ts (1 hunks)
🧰 Additional context used
🧠 Learnings (8)
📚 Learning: 2025-06-22T10:40:52.752Z
Learnt from: sid597
Repo: DiscourseGraphs/discourse-graph PR: 232
File: apps/roam/src/utils/getAllDiscourseNodesSince.ts:18-31
Timestamp: 2025-06-22T10:40:52.752Z
Learning: In apps/roam/src/utils/getAllDiscourseNodesSince.ts, the user confirmed that querying for `?title` with `:node/title` and mapping it to the `text` field in the DiscourseGraphContent type is the correct implementation for retrieving discourse node content from Roam Research, despite it appearing to query page titles rather than block text content.

Applied to files:

  • apps/roam/src/components/CreateNodeDialog.tsx
  • apps/roam/src/utils/useNodeContext.ts
  • apps/roam/src/utils/hyde.ts
  • apps/roam/src/utils/initializeObserversAndListeners.ts
📚 Learning: 2025-08-25T15:53:21.799Z
Learnt from: sid597
Repo: DiscourseGraphs/discourse-graph PR: 372
File: apps/roam/src/components/DiscourseNodeMenu.tsx:116-116
Timestamp: 2025-08-25T15:53:21.799Z
Learning: In apps/roam/src/components/DiscourseNodeMenu.tsx, when handling tag insertion, multiple leading hashtags (like ##foo) should be preserved as they represent user intent, not normalized to a single hashtag. The current regex /^#/ is correct as it only removes one leading # before adding one back, maintaining any additional hashtags the user intended.

Applied to files:

  • apps/roam/src/components/CreateNodeDialog.tsx
📚 Learning: 2025-06-17T23:37:45.289Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 220
File: apps/roam/src/utils/conceptConversion.ts:42-56
Timestamp: 2025-06-17T23:37:45.289Z
Learning: In the DiscourseNode interface from apps/roam/src/utils/getDiscourseNodes.ts, the field `type` serves as the unique identifier field, not a type classification field. The interface has no `uid` or `id` field, making `node.type` the correct field to use for UID-related operations.

Applied to files:

  • apps/roam/src/utils/useNodeContext.ts
📚 Learning: 2025-06-17T23:37:45.289Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 220
File: apps/roam/src/utils/conceptConversion.ts:42-56
Timestamp: 2025-06-17T23:37:45.289Z
Learning: In the DiscourseNode interface from apps/roam/src/utils/getDiscourseNodes.ts, the field `node.type` serves as the UID field rather than having a conventional `node.uid` field. This is an unusual naming convention where the type field actually contains the unique identifier.

Applied to files:

  • apps/roam/src/utils/useNodeContext.ts
📚 Learning: 2025-11-05T21:57:14.909Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 534
File: apps/roam/src/utils/createReifiedBlock.ts:40-48
Timestamp: 2025-11-05T21:57:14.909Z
Learning: In the discourse-graph repository, the function `getPageUidByPageTitle` from `roamjs-components/queries/getPageUidByPageTitle` is a synchronous function that returns a string directly (the page UID or an empty string if not found), not a Promise. It should be called without `await`.

Applied to files:

  • apps/roam/src/utils/useNodeContext.ts
  • apps/roam/src/utils/initializeObserversAndListeners.ts
📚 Learning: 2025-05-30T14:49:24.016Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 182
File: apps/website/app/utils/supabase/dbUtils.ts:22-28
Timestamp: 2025-05-30T14:49:24.016Z
Learning: In apps/website/app/utils/supabase/dbUtils.ts, expanding the KNOWN_EMBEDDINGS and DEFAULT_DIMENSIONS mappings to support additional embedding models requires corresponding database model changes (creating new embedding tables), which should be scoped as separate work from API route implementations.

Applied to files:

  • apps/roam/src/utils/hyde.ts
📚 Learning: 2025-05-20T14:04:19.632Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 165
File: packages/database/supabase/schemas/embedding.sql:66-95
Timestamp: 2025-05-20T14:04:19.632Z
Learning: In the `match_embeddings_for_subset_nodes` SQL function in packages/database/supabase/schemas/embedding.sql, the number of results is implicitly limited by the length of the input array parameter `p_subset_roam_uids` since the function filters content using `WHERE c.source_local_id = ANY(p_subset_roam_uids)`.

Applied to files:

  • apps/roam/src/utils/hyde.ts
📚 Learning: 2025-06-25T22:56:17.522Z
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 0
File: :0-0
Timestamp: 2025-06-25T22:56:17.522Z
Learning: In the Roam discourse-graph system, the existence of the configuration page (identified by DISCOURSE_CONFIG_PAGE_TITLE) and its corresponding UID is a system invariant. The code can safely assume this page will always exist, so defensive null checks are not needed when using `getPageUidByPageTitle(DISCOURSE_CONFIG_PAGE_TITLE)`.

Applied to files:

  • apps/roam/src/utils/initializeObserversAndListeners.ts
🧬 Code graph analysis (5)
apps/roam/src/components/VectorDuplicateMatches.tsx (3)
apps/roam/src/utils/types.ts (1)
  • Result (42-46)
apps/roam/src/utils/hyde.ts (1)
  • findSimilarNodesVectorOnly (539-590)
apps/roam/src/utils/useNodeContext.ts (2)
  • NodeContext (8-11)
  • useNodeContext (38-69)
apps/roam/src/components/CreateNodeDialog.tsx (1)
apps/roam/src/components/VectorDuplicateMatches.tsx (1)
  • VectorDuplicateMatches (23-158)
apps/roam/src/utils/useNodeContext.ts (2)
apps/obsidian/src/utils/extractContentFromTitle.ts (1)
  • extractContentFromTitle (3-10)
apps/obsidian/src/utils/getDiscourseNodeFormatExpression.ts (1)
  • getDiscourseNodeFormatExpression (1-9)
apps/roam/src/utils/hyde.ts (2)
apps/roam/src/utils/types.ts (1)
  • Result (42-46)
apps/roam/src/utils/supabaseContext.ts (1)
  • getLoggedInClient (95-112)
apps/roam/src/utils/initializeObserversAndListeners.ts (1)
apps/roam/src/components/VectorDuplicateMatches.tsx (1)
  • renderPossibleDuplicates (160-199)
🔇 Additional comments (3)
apps/roam/src/components/CreateNodeDialog.tsx (1)

15-15: Debounced duplicate search integration into the create dialog looks good.

The 500 ms debounced debouncedTitle and its use as both pageTitle and explicit text for VectorDuplicateMatches is a sensible way to surface duplicate suggestions without firing vector searches on every keystroke, and using limit={5} keeps the UI compact in the dialog context.

Also applies to: 37-52, 150-154

apps/roam/src/components/VectorDuplicateMatches.tsx (1)

37-82: Search triggering and cancellation logic is well-structured.

Using useNodeContext(pageTitle) combined with an override when text is provided, resetting hasSearched on activeContext.searchText changes, and guarding the fetch on isOpen && !hasSearched provides a clear, debounced search lifecycle. The isCancelled flag in the effect is a good safeguard against setting state after unmount or context changes; no changes needed here.

apps/roam/src/utils/hyde.ts (1)

534-590: Now I need to check the actual SQL definition to confirm the issue:

Based on my verification against Supabase documentation and pgvector standards, the embedding should be passed directly to the rpc() method, not stringified. Supabase documentation examples show passing query_embedding: embedding directly, and the Supabase client automatically serializes parameters as JSON. The JSON.stringify() call converts the embedding array to a string, which will cause a type mismatch with the SQL function's vector type parameter.

Remove JSON.stringify() from the embedding parameter in the RPC call

Change line 558:

const { data, error } = await supabase.rpc("match_content_embeddings", {
  query_embedding: queryEmbedding,
  match_threshold: threshold,
  match_count: limit,
});
⛔ Skipped due to learnings
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 503
File: apps/roam/src/utils/syncDgNodesToSupabase.ts:304-360
Timestamp: 2025-10-18T12:05:23.834Z
Learning: When inserting pgvector embeddings via Supabase (PostgREST) in TypeScript, use JSON.stringify(vector) to serialize the number[] array to a string, as PostgREST does not natively support the pgvector type. This is the official workaround for working with vector columns through the REST API.
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 182
File: apps/website/app/utils/supabase/dbUtils.ts:22-28
Timestamp: 2025-05-30T14:49:24.016Z
Learning: In apps/website/app/utils/supabase/dbUtils.ts, expanding the KNOWN_EMBEDDINGS and DEFAULT_DIMENSIONS mappings to support additional embedding models requires corresponding database model changes (creating new embedding tables), which should be scoped as separate work from API route implementations.
Learnt from: maparent
Repo: DiscourseGraphs/discourse-graph PR: 165
File: packages/database/supabase/schemas/embedding.sql:66-95
Timestamp: 2025-05-20T14:04:19.632Z
Learning: In the `match_embeddings_for_subset_nodes` SQL function in packages/database/supabase/schemas/embedding.sql, the number of results is implicitly limited by the length of the input array parameter `p_subset_roam_uids` since the function filters content using `WHERE c.source_local_id = ANY(p_subset_roam_uids)`.

@sid597 sid597 requested a review from mdroidian November 24, 2025 10:40
Copy link
Contributor

@mdroidian mdroidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! But I think this would be worth re-looking at useNodeContext because there is a lot going on there that seems unnecessary, which will result in longer page load times.

let matchedNode: DiscourseNode | null = null;

if (pageUid) {
const found = findDiscourseNode(pageUid, discourseNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we already run isDiscourseNode on page load (which calls both getDiscourseNodes and findDiscourseNode)? We shouldn't run these again.

}

if (!matchedNode) {
matchedNode =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matchDiscourseNode is called by findDiscourseNode ... so this would be a third time that this is running.

@@ -0,0 +1,69 @@
import { useEffect, useState } from "react";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me that useNodeContext is actually needed.

Let's make sure we need this file and all of these queries. It seems like we might only need extractContentFromTitle, which should be placed in a util file because a similar version is already used in getExportTypes (getContentFromNodes).


try {
const supabase = await getLoggedInClient();
if (!supabase) return [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this returns a false negative for those who haven't set up in our database yet. They might actually think we've searched and found nothing, but actually we haven't even searched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EG
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We need to think through the database enabled features. Here we are checking if the client exists in supabase, but its wrong we need to check if the graph the embeddings are there in db.

And to check that we have only single route of suggestive mode, so should we be searching for if suggestive mode is enabled only then enable this feature, we can do this now but for future it could be that users don't want to enable suggestive mode but want duplicate node alert.

We need to decouple sync with suggestive mode setting until then we can use that.

@sid597 sid597 force-pushed the eng-1074-prod-duplicate-node-alert-using-vector-search branch from c860b45 to 49170f6 Compare December 1, 2025 11:27
@sid597 sid597 force-pushed the eng-1074-prod-duplicate-node-alert-using-vector-search branch from 49170f6 to 9c559c3 Compare December 7, 2025 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants