Skip to content

tzkmx/indexer-tools

Repository files navigation

PDF Processing Utilities

This repository contains PowerShell scripts for processing PDF documents.

Scripts

1. Convert-PdfTo-Text.ps1

This script recursively finds PDF files in a specified directory and converts them into text-based formats.

Features:

  • Converts PDFs to individual .md or .txt files.
  • Generates a frontmatter header in each file with metadata (source path, creation time, size, etc.).
  • Optionally merges the content of all PDFs into a single file.
  • Creates a conversion.log file in the output directory to track successes and failures.

Prerequisites:

  • Requires the PSWritePDF PowerShell module. Install it by running:
    Install-Module -Name PSWritePDF -Scope CurrentUser

Usage:

To get help and see examples, run:

Get-Help .\Convert-PdfTo-Text.ps1 -Full

Examples:

  • Convert all PDFs to individual Markdown files: (Output will be in a new output folder inside your target directory)

    .\Convert-PdfTo-Text.ps1 -Path "C:\path\to\your\documents"
  • Merge all PDFs into a single file:

    .\Convert-PdfTo-Text.ps1 -Path "C:\path\to\your\documents" -MergeFiles
  • Specify an output directory and format:

    .\Convert-PdfTo-Text.ps1 -Path "C:\path\to\your\documents" -OutputDirectory "C:\my\output" -Format txt

2. Find-TextInPdf.ps1

This script recursively searches for a specific text string within all PDF files in a given directory.

Prerequisites:

  • Requires the PSWritePDF PowerShell module.

Usage:

To get help and see examples, run:

Get-Help .\Find-TextInPdf.ps1 -Full

Example:

.\Find-TextInPdf.ps1 -Directory "C:\path\to\your\documents" -TextToSearch "your-text-here"

3. Enrich-FrontMatter.ps1

This script enriches the front matter of markdown files with document content metadata using the Gemini API.

Features:

  • Extracts key metadata from the document content, such as document type, dates, parties involved, etc.
  • Extracts specific metadata from SIGER documents (siger_numero_unico_de_documento and siger_fecha_de_inscripcion).
  • Updates the markdown file's front matter with the new metadata.
  • Creates an enrichment.log file in the input directory to track successes and failures.

Prerequisites:

  • Requires the Gemini CLI to be installed and configured.

Usage:

To get help and see examples, run:

Get-Help .\Enrich-FrontMatter.ps1 -Full

Examples:

  • Enrich all markdown files in a directory:

    .\Enrich-FrontMatter.ps1 -InputDirectory "C:\path\to\your\markdown_files"
  • Enrich files using a specific Gemini model:

    .\Enrich-FrontMatter.ps1 -InputDirectory "C:\path\to\your\markdown_files" -GeminiModel "gemini-1.5-pro-latest"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors