Skip to content

refactor(hunspell): rename package terminology to generic ref_path#21516

Open
shayush622 wants to merge 1 commit intoopensearch-project:mainfrom
shayush622:refactor/hunspell-generic-naming
Open

refactor(hunspell): rename package terminology to generic ref_path#21516
shayush622 wants to merge 1 commit intoopensearch-project:mainfrom
shayush622:refactor/hunspell-generic-naming

Conversation

@shayush622
Copy link
Copy Markdown
Contributor

Description

Renames package-specific terminology in the hunspell dictionary loading API to generic ref_path terminology, and splits the single input validator into type-specific validators with appropriate character
allowlists.

Motivation: The original API was named around "packages" (getDictionaryFromPackage, buildPackageCacheKey, validatePackageIdentifier) but the feature is a generic directory-based dictionary loading mechanism.
This rename makes the API self-documenting and prepares for follow-up work that allows nested paths (e.g., analyzers/my-dict).

Changes

HunspellService.java — renames getDictionaryFromPackage → getDictionaryFromRefPath, loadDictionaryFromPackage → loadDictionaryFromRefPath, buildPackageCacheKey → buildRefPathCacheKey; extracts
CACHE_KEY_SEPARATOR constant; updates path resolution to config/{ref_path}/hunspell/{locale}/; updates all javadoc and comments

HunspellTokenFilterFactory.java — splits validatePackageIdentifier into validateRefPath (allows / for nested paths) and validateLocale (allows dots, disallows /); adds validateAgainstPattern helper with
per-param error messages; updates call sites and comments

HunspellServiceTests.java — renames all test methods from Package to RefPath terminology; updates API calls and assertions

HunspellTokenFilterFactoryTests.java — renames validator tests; updates API calls to use validateRefPath/validateLocale

No behavioral change for existing functionality. All 43 tests pass.

Related

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here
(https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 671475a.

PathLineSeverityDescription
server/src/main/java/org/opensearch/indices/analysis/HunspellService.java211mediumPath containment boundary relaxed from config/analyzers/ to config/. Old code enforced that the resolved package directory was exactly one level under config/analyzers/ (preventing foo/../bar escapes) and that the path stayed under config/analyzers/. New code only checks the resolved ref_path stays under config/ (the broader config root), allowing dictionary loading from any config subdirectory including areas never intended for user-controlled paths.
server/src/test/java/org/opensearch/indices/analyze/HunspellServiceTests.java355lowSecurity test coverage silently removed: testSlashInPackageId verified that slashes in the identifier were explicitly rejected and threw an exception. It was renamed to testNonExistentRefPathThrowsException, which tests unrelated behavior (non-existent path). The behavioral change — slashes are now permitted in ref_path — is a deliberate security boundary shift, but is not explicitly documented as an intentional decision and the old rejection test is gone.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@shayush622 shayush622 force-pushed the refactor/hunspell-generic-naming branch from 85e1b05 to 387c944 Compare May 6, 2026 17:14
@cwperks cwperks added the skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. label May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Reviewer Guide 🔍

(Review updated until commit 671475a)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Rename hunspell API from package to ref_path terminology

Relevant files:

  • server/src/main/java/org/opensearch/index/analysis/HunspellTokenFilterFactory.java
  • server/src/main/java/org/opensearch/indices/analysis/HunspellService.java

Sub-PR theme: Update tests for ref_path terminology and add test fixtures

Relevant files:

  • server/src/test/java/org/opensearch/index/analysis/HunspellTokenFilterFactoryTests.java
  • server/src/test/java/org/opensearch/indices/analyze/HunspellServiceTests.java
  • server/src/test/resources/indices/analyze/conf_dir/analyzers/test-dict/hunspell/en_US/en_US.aff
  • server/src/test/resources/indices/analyze/conf_dir/analyzers/test-dict/hunspell/en_US/en_US.dic

⚡ Recommended focus areas for review

Regex Boundary

The SAFE_REF_PATH_PATTERN regex ^[a-zA-Z0-9][a-zA-Z0-9_/-]*[a-zA-Z0-9]$|^[a-zA-Z0-9]$ allows consecutive forward slashes (e.g., "foo//bar") and trailing/leading slashes (e.g., "/foo" is rejected but "foo/" is also rejected only because the last char must be alphanumeric). However, paths like "foo//bar" would pass the pattern and could cause unexpected behavior during path resolution. Consider adding an explicit check for consecutive slashes.

private static final Pattern SAFE_REF_PATH_PATTERN = Pattern.compile("^[a-zA-Z0-9][a-zA-Z0-9_/-]*[a-zA-Z0-9]$|^[a-zA-Z0-9]$");
Check Order

In validateAgainstPattern, the .. check comes after the pattern match check. However, the SAFE_REF_PATH_PATTERN does not include dots at all, so foo..bar would already be rejected by the pattern check with "Only alphanumeric characters, hyphens, underscores, and forward slashes are allowed." — the .. check would never be reached for ref_path. For locale, dots are also not in SAFE_LOCALE_PATTERN, so the .. check is dead code. This could be confusing and the .. check and its error message about "Consecutive dots" would never actually trigger.

// Additional check: reject ".." sequences even within otherwise valid characters (e.g., "foo..bar")
if (value.contains("..")) {
    throw new IllegalArgumentException(
        String.format(Locale.ROOT, "Invalid %s: [%s]. Consecutive dots ('..') are not allowed.", paramName, value)
    );
}
Path Resolution

In loadDictionaryFromRefPath, the path is resolved as env.configDir().resolve(refPath). If refPath contains a /-separated path like "analyzers/my-dict", this relies on the OS path separator behavior. On Windows, resolve("analyzers/my-dict") may not work as expected since forward slashes may not be treated as path separators. Consider using Paths.get(refPath) or splitting on / and resolving each segment individually for cross-platform safety.

Path refDir = env.configDir().resolve(refPath);

// Security check: ensure resolved path stays under config directory
Path configDirAbsolute = env.configDir().toAbsolutePath().normalize();
Path refDirAbsolute = refDir.toAbsolutePath().normalize();
if (!refDirAbsolute.startsWith(configDirAbsolute)) {
Weak Assertion

Several tests (e.g., testPathTraversalInRefPath, testNonExistentRefPathThrowsException, testPathTraversalInLocale, testBackslashInLocale) only assert assertNotNull(e) on the caught exception without verifying the exception message or type. This means the test would pass even if a completely unrelated exception is thrown, providing weak security guarantees for path traversal protection.

    Exception e = expectThrows(Exception.class, () -> hunspellService.getDictionaryFromRefPath("..", "en_US"));
    assertNotNull(e);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Code Suggestions ✨

Latest suggestions up to 671475a
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Security
Reject absolute ref_path before path resolution

The path traversal check only verifies that refDir stays under configDir, but does
not verify that the locale-resolved dicDir stays under refHunspellDir before the
later check. More importantly, the refPath is resolved directly via
env.configDir().resolve(refPath) — if refPath is an absolute path (e.g.,
"/etc/passwd"), Path.resolve() will discard the base and resolve to the absolute
path, bypassing the security check. An explicit check that refPath is not absolute
should be added.

server/src/main/java/org/opensearch/indices/analysis/HunspellService.java [203-208]

+// Reject absolute paths before resolution
+Path refPathObj = java.nio.file.Paths.get(refPath);
+if (refPathObj.isAbsolute()) {
+    throw new IllegalArgumentException(
+        String.format(Locale.ROOT, "ref_path must be a relative path. ref_path: [%s]", refPath)
+    );
+}
+
 Path refDir = env.configDir().resolve(refPath);
 
 // Security check: ensure resolved path stays under config directory
 Path configDirAbsolute = env.configDir().toAbsolutePath().normalize();
 Path refDirAbsolute = refDir.toAbsolutePath().normalize();
 if (!refDirAbsolute.startsWith(configDirAbsolute)) {
Suggestion importance[1-10]: 8

__

Why: This is a real security concern: Path.resolve() with an absolute path discards the base directory entirely, meaning an absolute refPath like "/etc/passwd" would bypass the startsWith(configDirAbsolute) check. Adding an explicit absolute-path rejection before resolution closes this bypass. While validateRefPath in the filter factory should prevent this at the API level, the defense-in-depth check in loadDictionaryFromRefPath is important since the method is also callable directly.

Medium
Prevent consecutive or misplaced slashes in ref_path

The SAFE_REF_PATH_PATTERN allows consecutive forward slashes (e.g.,
"analyzers//my-dict") and trailing/leading slashes in path segments, which could
cause unexpected path resolution behavior. Consider tightening the pattern to
disallow consecutive slashes and ensure each path segment starts and ends with an
alphanumeric character.

server/src/main/java/org/opensearch/index/analysis/HunspellTokenFilterFactory.java [131]

-private static final Pattern SAFE_REF_PATH_PATTERN = Pattern.compile("^[a-zA-Z0-9][a-zA-Z0-9_/-]*[a-zA-Z0-9]$|^[a-zA-Z0-9]$");
+private static final Pattern SAFE_REF_PATH_PATTERN = Pattern.compile("^[a-zA-Z0-9][a-zA-Z0-9_-]*(?:/[a-zA-Z0-9][a-zA-Z0-9_-]*)*$");
Suggestion importance[1-10]: 6

__

Why: The current SAFE_REF_PATH_PATTERN allows inputs like "analyzers//my-dict" (consecutive slashes) or paths starting/ending with /, which could cause unexpected behavior during path resolution. The improved pattern enforces proper segment boundaries, though the security impact is partially mitigated by the downstream path traversal checks in HunspellService.

Low
General
Remove redundant double-dot check after pattern validation

The ".." check is redundant for validateLocale since the SAFE_LOCALE_PATTERN already
disallows dots entirely. More critically, for validateRefPath, the
SAFE_REF_PATH_PATTERN also already disallows dots, making the ".." check dead code
in both cases. The ".." check should be removed or the patterns should be updated to
actually permit dots so the check is meaningful.

server/src/main/java/org/opensearch/index/analysis/HunspellTokenFilterFactory.java [174-183]

 if (!pattern.matcher(value).matches()) {
     throw new IllegalArgumentException(String.format(Locale.ROOT, "Invalid %s: [%s]. %s", paramName, value, allowedDesc));
 }
 
-// Additional check: reject ".." sequences even within otherwise valid characters (e.g., "foo..bar")
-if (value.contains("..")) {
-    throw new IllegalArgumentException(
-        String.format(Locale.ROOT, "Invalid %s: [%s]. Consecutive dots ('..') are not allowed.", paramName, value)
-    );
-}
-
Suggestion importance[1-10]: 5

__

Why: Both SAFE_REF_PATH_PATTERN and SAFE_LOCALE_PATTERN already disallow dots entirely (neither pattern includes . in the allowed character set), making the value.contains("..") check dead code that can never be reached. Removing it reduces confusion and code clutter.

Low

Previous suggestions

Suggestions up to commit 387c944
CategorySuggestion                                                                                                                                    Impact
Security
Prevent consecutive or trailing slashes in ref_path

The current pattern allows consecutive slashes (e.g., "analyzers//my-dict") and
trailing/leading slashes (e.g., "/analyzers/my-dict" matches the second alternative
only for single chars, but "a/" would fail — however "a/b/" would pass the first
alternative). A path like "analyzers//my-dict" could cause unexpected behavior
during path resolution. Consider tightening the pattern to disallow consecutive or
leading/trailing slashes.

server/src/main/java/org/opensearch/index/analysis/HunspellTokenFilterFactory.java [131]

-private static final Pattern SAFE_REF_PATH_PATTERN = Pattern.compile("^[a-zA-Z0-9][a-zA-Z0-9_/-]*[a-zA-Z0-9]$|^[a-zA-Z0-9]$");
+private static final Pattern SAFE_REF_PATH_PATTERN = Pattern.compile("^[a-zA-Z0-9]([a-zA-Z0-9_-]*(/[a-zA-Z0-9][a-zA-Z0-9_-]*)*)?$");
Suggestion importance[1-10]: 6

__

Why: The current SAFE_REF_PATH_PATTERN does allow consecutive slashes like "analyzers//my-dict" which could cause unexpected behavior during path resolution. The improved pattern properly enforces that each path segment is separated by exactly one slash, which is a valid security improvement.

Low
Add defense-in-depth validation before path resolution

The security check verifies the resolved refDir stays under configDir, but the
refPath validation in HunspellTokenFilterFactory is not called inside
loadDictionaryFromRefPath. If getDictionaryFromRefPath is called directly (e.g.,
from tests or future code) without prior validation, a crafted refPath containing ..
segments could still pass the startsWith check on some filesystems. The
loadDictionaryFromRefPath method should also validate the refPath before resolving
it.

server/src/main/java/org/opensearch/indices/analysis/HunspellService.java [203-208]

+// Validate refPath before resolving (defense-in-depth)
+if (refPath.contains("..") || refPath.contains("\\")) {
+    throw new IllegalArgumentException(
+        String.format(Locale.ROOT, "Invalid ref_path: [%s]. Must not contain '..' or backslash sequences.", refPath)
+    );
+}
 Path refDir = env.configDir().resolve(refPath);
 
 // Security check: ensure resolved path stays under config directory
 Path configDirAbsolute = env.configDir().toAbsolutePath().normalize();
 Path refDirAbsolute = refDir.toAbsolutePath().normalize();
 if (!refDirAbsolute.startsWith(configDirAbsolute)) {
Suggestion importance[1-10]: 5

__

Why: While the suggestion adds defense-in-depth by validating refPath before resolving it in loadDictionaryFromRefPath, the validateRefPath method in HunspellTokenFilterFactory already rejects .. sequences. The getDictionaryFromRefPath method also has a startsWith check after resolution. This is a valid but somewhat redundant security layer since callers should already validate inputs.

Low
General
Reorder validation checks for clearer error messages

The ".." check is redundant for validateRefPath because SAFE_REF_PATH_PATTERN does
not allow dots at all, so ".." would already be rejected by the pattern match.
However, for validateLocale, the SAFE_LOCALE_PATTERN allows dots, so "foo..bar"
would match the pattern but should be caught by the ".." check. The order of checks
is correct, but the ".." check should be moved before the pattern match to ensure it
is always evaluated first and provides a clearer error message.

server/src/main/java/org/opensearch/index/analysis/HunspellTokenFilterFactory.java [179-188]

-if (!pattern.matcher(value).matches()) {
-    throw new IllegalArgumentException(String.format(Locale.ROOT, "Invalid %s: [%s]. %s", paramName, value, allowedDesc));
-}
-
-// Additional check: reject ".." sequences even within otherwise valid characters (e.g., "foo..bar")
 if (value.contains("..")) {
     throw new IllegalArgumentException(
         String.format(Locale.ROOT, "Invalid %s: [%s]. Consecutive dots ('..') are not allowed.", paramName, value)
     );
 }
 
+if (!pattern.matcher(value).matches()) {
+    throw new IllegalArgumentException(String.format(Locale.ROOT, "Invalid %s: [%s]. %s", paramName, value, allowedDesc));
+}
+
Suggestion importance[1-10]: 4

__

Why: Moving the ".." check before the pattern match provides a more specific error message for the validateLocale case where "foo..bar" would otherwise pass the pattern. However, for validateRefPath, dots aren't allowed at all so the order doesn't matter there. The improvement is minor but does provide clearer error messages for locale validation.

Low

@cwperks cwperks removed the skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. label May 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

❌ Gradle check result for 387c944: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

  Rename HunspellService and HunspellTokenFilterFactory APIs from
  package-specific naming to generic ref_path terminology:

  - getDictionaryFromPackage -> getDictionaryFromRefPath
  - loadDictionaryFromPackage -> loadDictionaryFromRefPath
  - buildPackageCacheKey -> buildRefPathCacheKey
  - Split validatePackageIdentifier into validateRefPath (allows /)
    and validateLocale (allows dots)
  - Extract CACHE_KEY_SEPARATOR constant
  - Update all javadoc, comments, and test names

  No behavioral change for existing functionality.

Signed-off-by: shayush622 <ayush5267@gmail.com>
@shayush622 shayush622 force-pushed the refactor/hunspell-generic-naming branch from 387c944 to 671475a Compare May 7, 2026 15:40
@shayush622 shayush622 requested a review from cwperks May 7, 2026 15:52
@cwperks cwperks added the skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. label May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Persistent review updated to latest commit 671475a

@cwperks
Copy link
Copy Markdown
Member

cwperks commented May 7, 2026

The changes LGTM. I had the same 2 comments as the top on #21516 (comment), but these are fairly minor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants