Skip to content

RFC: Extract shared input normalization into buildDocumentInitParameters #156

@l2ysho

Description

@l2ysho

Problem

Two functions contain nearly identical input-normalization logic with no shared implementation:

core.ts validateParameters (lines 202–226) and getPdfMetadata.ts (lines 41–63) both contain:

  1. An identical switch (true) block converting Buffer | string | Uint8Array | URL into DocumentInitParameters.data or DocumentInitParameters.url — including the non-obvious ordering constraint where Buffer.isBuffer must precede instanceof Uint8Array because Buffer extends Uint8Array
  2. The same four pdfjs performance flags set identically: verbosity = VerbosityLevel.ERRORS, disableAutoFetch = true, disableStream = true, disableRange = true

Bug fixes or new input type support (e.g., ArrayBuffer) require two coordinated edits. The ordering hazard in the switch is silently duplicated. The four pdfjs flags are a library-level policy decision but appear scattered across two files.

Proposed Interface

A new file src/inputNormalizer.ts with one function and one type alias:

// src/inputNormalizer.ts

export type PdfInput = Buffer | string | Uint8Array | URL;

export async function buildDocumentInitParameters(
  input: PdfInput,
): Promise<DocumentInitParameters>
// Returns DocumentInitParameters with data/url set + the 4 standard pdfjs flags.
// Password is NOT included — each caller handles it differently by design.

core.tsvalidateParameters becomes:

import { buildDocumentInitParameters, type PdfInput } from '#afpp/src/inputNormalizer';

const validateParameters = async (input: PdfInput, options?: AfppParseOptions) => {
  const documentInitParameters = await buildDocumentInitParameters(input);
  documentInitParameters.password = options?.password; // layered on after

  const scale = options?.scale ?? 1.0;
  // ... scale validation

  const concurrency = ...;
  // ... concurrency validation

  const encoding = options?.imageEncoding ?? 'png';
  // ... encoding validation

  return { concurrency, documentInitParameters, encoding, scale };
};

getPdfMetadata.ts — switch block deleted entirely:

import { buildDocumentInitParameters, type PdfInput } from '#afpp/src/inputNormalizer';

export async function getPdfMetadata(
  input: PdfInput,
  options?: Pick<AfppParseOptions, 'password'>,
): Promise<PdfMetadata> {
  const documentInitParameters = await buildDocumentInitParameters(input);
  // password handled via loadingTask.onPassword callback — unchanged

  let isEncrypted = false;
  const loadingTask = getDocument(documentInitParameters);
  // ... rest unchanged
}

getPdfMetadata.ts also loses its readFile, DocumentInitParameters, and VerbosityLevel imports — all three move into inputNormalizer.ts.

The PdfInput type alias replaces the repeated inline Buffer | string | Uint8Array | URL union across all six public-facing function signatures and should be re-exported from index.ts.

Dependency Strategy

In-process — pure async computation with one I/O call (readFile). No network, no external service.

inputNormalizer.ts sits at the bottom of the dependency graph:

  • Imports only: node:fs/promises, pdfjs-dist (for VerbosityLevel and DocumentInitParameters type)
  • No imports from any other afpp source file
  • No @napi-rs/canvas, no p-limit

This keeps getPdfMetadata's module graph lean — it currently imports only a type from core.ts. Placing the function in core.ts instead would transitively pull @napi-rs/canvas and p-limit into getPdfMetadata's module graph unnecessarily.

buildDocumentInitParameters is not exported from index.ts — it is an internal implementation detail. Only PdfInput crosses the public API boundary.

Testing Strategy

New boundary tests to write (in a new test/inputNormalizer.test.ts):

  • String path → data: Uint8Array (mocking readFile)
  • Bufferdata: Uint8Array (verifies Buffer.isBuffer branch, not instanceof Uint8Array)
  • Uint8Arraydata: Uint8Array (passes through as-is)
  • URLurl: URL
  • Invalid input → throws Error with message matching Invalid source type: ...
  • All cases: returned object has verbosity, disableAutoFetch, disableStream, disableRange set to the expected values

Existing tests: No changes needed. The existing input-type coverage in getPdfMetadata.test.ts, pdf2string.test.ts, and pdf2image.test.ts continues to serve as integration-level verification that the refactor didn't break behavior. The new unit tests are additive.

Implementation Recommendations

What buildDocumentInitParameters should own:

  • The switch (true) input-type dispatch (including the Buffer-before-Uint8Array ordering)
  • The readFile call for string paths
  • The four pdfjs performance flags as a library invariant

What it should hide:

  • The Buffer.isBuffer / instanceof Uint8Array ordering constraint
  • The readFile async step
  • The pdfjs flag names and values

What it should expose:

  • A DocumentInitParameters object ready to pass to getDocument()
  • No password — callers set this themselves, since core.ts sets it eagerly and getPdfMetadata.ts handles it reactively via onPassword

Caller migration:

  • validateParameters in core.ts: replace the switch block + 4 flag lines with one await buildDocumentInitParameters(input) call, then add documentInitParameters.password = options?.password
  • getPdfMetadata.ts: replace the switch block + 4 flag lines with one await buildDocumentInitParameters(input) call; remove readFile, DocumentInitParameters, and VerbosityLevel imports
  • All public function signatures: replace Buffer | string | Uint8Array | URL with PdfInput

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions