SAS to Databricks Converter

A production-ready Streamlit application that converts legacy SAS code to PySpark or Databricks SQL using Databricks Foundation Models (Claude Sonnet 4.5).

Features

Multi-format Output: Generate PySpark DataFrame API or pure Databricks SQL
Smart Conversion: 12 production-validated rules prevent common migration errors
Interactive UI: Streamlit-based interface with real-time previews
Foundation Models: Powered by Claude Sonnet 4.5 for optimal conversion quality
Healthcare Focused: Optimized for healthcare payer analytics (HEDIS, risk adjustment, claims)
Production Ready: Includes comprehensive instructions for LLM (1,700+ lines)

Quick Start

Prerequisites

Databricks workspace with Unity Catalog enabled
Databricks CLI installed and configured (Installation Guide)
Python 3.10+ (for local development only)
Workspace permissions: Ability to create catalogs, schemas, and volumes

Two Ways to Use This Framework

🚀 Option 1: Streamlit App (Recommended)

Best for: Automated bulk conversions with a user-friendly UI

Step-by-Step Setup

1️⃣ Clone the Repository

git clone <your-repo-url>
cd SAS-work

2️⃣ Configure Databricks CLI

# If not already configured
databricks configure

# Or set your profile (optional)
export DATABRICKS_PROFILE="your_profile_name"

3️⃣ Run Full Setup (First Time Only)

This creates demo catalogs, data, volume, and uploads SAS files:

./setup_databricks_assistant.sh

Time: ~10 minutes | Creates: 2 catalogs, 29K rows of data, 6 SAS files

4️⃣ Deploy the Streamlit App

./deploy_streamlit_app.sh

Time: ~30 seconds | Deploys: Streamlit app to Databricks Apps

The script will show you a big reminder box at the end:

╔════════════════════════════════════════════════════════════════╗
║  ⚠️  ACTION REQUIRED: Grant App Permissions (One-Time)         ║
╚════════════════════════════════════════════════════════════════╝

5️⃣ Grant App Permissions (One-Time)

The deploy script outputs SQL commands. Copy them to Databricks SQL Editor and run:

GRANT USE CATALOG ON CATALOG payer_dev TO `<auto-detected-uuid>`;
GRANT USE SCHEMA ON SCHEMA payer_dev.sas_migration TO `<auto-detected-uuid>`;
GRANT READ VOLUME ON VOLUME payer_dev.sas_migration.legacy_sas TO `<auto-detected-uuid>`;
GRANT SELECT ON CATALOG payer_dev TO `<auto-detected-uuid>`;

💡 Tip: Run ./show_grant_commands.sh anytime to regenerate these commands with your app's UUID.

6️⃣ Access Your App

Open the app URL shown in the deploy output:

https://sas-converter-<workspace-id>.azuredatabricksapps.com

✅ You're Done! Start converting SAS code to PySpark/SQL.

🔄 Quick Updates (After Initial Setup)

Already deployed? Just need to update the app or instructions?

# Update app + instructions only (no data reload)
./deploy_streamlit_app.sh  # ~30 seconds

📝 Option 2: Databricks Assistant (Manual Conversion)

Best for: Interactive conversions directly in notebooks with AI assistance

Setup

Run the same setup script as Option 1:

./setup_databricks_assistant.sh

This deploys .assistant_instructions.md to your workspace at:

/Workspace/Users/your.email/.assistant_instructions.md

Usage

Open any Databricks notebook
Click the Assistant icon
Paste your SAS code
Ask: "Convert this to PySpark"
Assistant uses your custom instructions automatically

Benefits:

Context-aware conversions based on your patterns
Interactive Q&A about SAS → PySpark/SQL
Works across all notebooks in your workspace

Learn more about custom instructions

Local Development

# Install dependencies
pip install -r dashboard/requirements.txt

# Run locally
cd dashboard
streamlit run sas_converter_app.py

Project Structure

SAS-work/
├── dashboard/                           # Main application
│   ├── sas_converter_app.py            # Entry point (1,318 lines)
│   ├── file_handler.py                 # File I/O utilities
│   ├── utils.py                        # Validation & formatting
│   ├── app.yaml                        # Databricks app config
│   └── config/
│       └── .assistant_instructions.md  # LLM instructions (1,706 lines)
│
├── config/
│   └── .assistant_instructions.md      # Source of truth (edit here!)
│
├── notebooks/                          # Demo notebooks
│   ├── 00c_setup_with_inline_data.py  # Setup script
│   ├── 01-09_*.py                     # Example conversions
│   └── README.md
│
├── legacy_sas/                        # Sample SAS files
│   └── *.sas
│
├── databricks.yml                     # Databricks Asset Bundle config
├── setup_databricks_assistant.sh     # Setup Databricks Assistant with instructions
└── deploy_streamlit_app.sh           # Deploy Streamlit app to Databricks Apps

Which Approach Should You Use?

Use the Streamlit App when:

You want a self-service UI for business users
You need batch conversions of multiple SAS files
You want consistent, automated output
You're demoing to customers or stakeholders

Use Databricks Assistant when:

You're actively developing in notebooks
You want AI help while coding (iterative approach)
You need to ask questions about SAS patterns
You prefer manual control over each conversion step

Use both! Many teams deploy the Streamlit app for business users while developers use Databricks Assistant for hands-on work.

Key Features

1. Intelligent Code Conversion

The converter follows 12 critical rules validated over 3 days of production testing:

Always cast aggregations (.cast("double") / .cast("long"))
Always use overwriteSchema=true on table writes
Always import both F and Window for PySpark
Never use SELECT *, col pattern (causes COLUMN_ALREADY_EXISTS errors)
Always check for division by zero
Always use 3-level namespace (catalog.schema.table)
Plus 6 more...

2. Two Output Modes

PySpark Mode:

Pure DataFrame API
Best for complex transformations
Auto-overwrites duplicate columns

SQL Mode:

Pure Databricks SQL
Familiar syntax for SQL users
Includes warnings about common pitfalls

3. Foundation Model Integration

Uses Claude Sonnet 4.5 configured in dashboard/app.yaml:

- databricks-claude-sonnet-4-5         # Best quality for SAS conversions

4. Databricks Assistant Integration

This framework includes comprehensive custom instructions for Databricks Assistant:

Setup:

./setup_databricks_assistant.sh

This deploys .assistant_instructions.md to your workspace folder (/Workspace/Users/your.email/.assistant_instructions.md).

Usage in Notebooks:

Open any Databricks notebook
Use Databricks Assistant (AI helper)
Paste your SAS code and ask: "Convert this to PySpark"
Assistant uses your custom instructions automatically

Benefits:

Context-aware conversions based on your patterns
Interactive Q&A about SAS → PySpark/SQL
Learns from your custom rules and domain knowledge
Works across all notebooks in your workspace

Learn more about custom instructions

Documentation

ARCHITECTURE.md - System design and data flow
DEMO_SCRIPT.md - Demo walkthrough
DEPLOYMENT_GUIDE.md - Detailed deployment instructions
AUTOMATED_TESTING_GUIDE.md - Testing procedures

Usage Example

Upload SAS File: Click "Browse files" or paste code
Configure: Set source/target catalogs and schemas
Select Mode: Choose PySpark or SQL output
Convert: Click "Convert to Databricks"
Download: Save as .py notebook file
Deploy: Upload to Databricks workspace

File Update Workflow

Critical: Always edit the source file, then deploy!

# 1. Edit source
vim config/.assistant_instructions.md

# 2a. Deploy to Streamlit app (auto-syncs to dashboard/config/)
./deploy_streamlit_app.sh

# OR 

# 2b. Deploy to workspace for Databricks Assistant
./setup_databricks_assistant.sh

# Done! Instructions deployed

Production Deployment

Via Script (Recommended)

Choose based on your use case:

# Option 1: Deploy Streamlit App (~30 seconds)
./deploy_streamlit_app.sh
# Use for: Automated UI-based conversions

# Option 2: Setup Databricks Assistant (~10 minutes)
./setup_databricks_assistant.sh
# Use for: Manual notebook conversions with AI Assistant

⚠️ Important: Grant App Permissions (One-Time Step)

After deploying the Streamlit app, you need to grant it access to Unity Catalog volumes. The deployment script will remind you to do this.

Step 1: Get the grant commands (with your app's UUID auto-detected)

./show_grant_commands.sh

Step 2: Copy the SQL output and run in Databricks SQL Editor

Example output (your UUID will be different):

-- Grant permissions to app service principal
GRANT USE CATALOG ON CATALOG payer_dev TO `e9dfbf80-9204-43f4-9758-41b204defc1c`;
GRANT USE SCHEMA ON SCHEMA payer_dev.sas_migration TO `e9dfbf80-9204-43f4-9758-41b204defc1c`;
GRANT READ VOLUME ON VOLUME payer_dev.sas_migration.legacy_sas TO `e9dfbf80-9204-43f4-9758-41b204defc1c`;
GRANT SELECT ON CATALOG payer_dev TO `e9dfbf80-9204-43f4-9758-41b204defc1c`;

Why manual? Unity Catalog permission grants require specific admin privileges that may not be available to automation scripts. This one-time step (takes 30 seconds) ensures the app can access volumes containing SAS files.

Verify: Check Catalog Explorer → payer_dev.sas_migration.legacy_sas → Permissions tab to confirm the app has access.

Via Databricks CLI (Manual)

# For Streamlit app
databricks bundle deploy --profile DEFAULT_azure
databricks apps deploy sas-converter --profile DEFAULT_azure

# For Databricks Assistant
databricks workspace import --overwrite \
  --file config/.assistant_instructions.md \
  /Workspace/Users/[email protected]/.assistant_instructions.md

Configuration

Environment Variables (dashboard/app.yaml)

env:
  - name: "DATABRICKS_HOST"
    value: "your-workspace.azuredatabricks.net"
  
  - name: "SERVING_ENDPOINT_NAME"
    value: "databricks-claude-sonnet-4-5"
  
  - name: "MAX_TOKENS"
    value: "8000"

Customization

See config/.assistant_instructions.md for:

Conversion patterns
Healthcare payer-specific rules
SAS function mappings
Error handling strategies

Testing

After deployment:

Test PySpark conversion (should include overwriteSchema=true)
Test SQL conversion (should show warning about duplicate columns)
Verify imports work (from file_handler import ...)
Check notebooks run successfully

Troubleshooting

Common Issues

Import errors after restructure: Verify file_handler.py and utils.py are in dashboard/ root
Rate limit errors: Check Foundation Model API quotas in Databricks workspace
COLUMN_ALREADY_EXISTS errors: Read SQL notebook warning, use SELECT * REPLACE
DELTA_FAILED_TO_MERGE_FIELDS: Add overwriteSchema=true to table writes

Logs

# Check app logs
databricks apps logs sas-converter --profile DEFAULT_azure

# Check bundle status
databricks bundle validate

Personal Setup (Optional)

This project uses comprehensive instruction files for optimal LLM conversion. You can:

Use as-is: The included .assistant_instructions.md files work out of the box
Customize: Copy .assistant_instructions.md to your own environment tracking
Extend: Add your own industry-specific rules and patterns

Note: The authors use centralized configuration management for personal development, but this is optional.

Contributing

This is production code from a 3-day development sprint with comprehensive lessons learned. Key improvements validated:

12 critical conversion rules
Claude Sonnet 4.5 integration
Comprehensive error prevention
Healthcare payer-specific patterns

Contributions welcome! Please test thoroughly before submitting PRs.

License

MIT License - See LICENSE file for details

Support

For issues or questions:

Review the comprehensive .assistant_instructions.md for conversion patterns
Check troubleshooting section above for common issues
Test with ./deploy_streamlit_app.sh for quick app iterations
Or use ./setup_databricks_assistant.sh for notebook-based conversions
Open an issue on GitHub with details and error logs

Version History

v2.4 (Current): Pure SQL mode with warnings, Claude Sonnet 4.5 integration
v2.3: Hybrid mode implementation
v2.2: Comprehensive validation rules
v2.1: Initial production release

Built with Databricks Foundation Models | Optimized for Healthcare Payers

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
dashboard		dashboard
legacy_sas		legacy_sas
notebooks		notebooks
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEMO_SCRIPT.md		DEMO_SCRIPT.md
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
README.md		README.md
databricks.yml		databricks.yml
deploy_streamlit_app.sh		deploy_streamlit_app.sh
requirements.txt		requirements.txt
setup_databricks_assistant.sh		setup_databricks_assistant.sh
show_grant_commands.sh		show_grant_commands.sh

bigdatavik/SAS-work

Folders and files

Latest commit

History

Repository files navigation