Skip to content

SpeedoDude/decisionsai

 
 

Repository files navigation

DecisionsAI

DecisionsAI is a voice-controlled digital assistant that combines real-time speech recognition, local language model processing, and text-to-speech synthesis to provide a seamless hands-free computing experience. Built on the Pipecat framework, it efficiently orchestrates audio and text processing with minimal memory overhead.

Built for offline-first, upgraded with powerful cloud capabilities. DecisionsAI is designed to work completely offline using local models (Whisper.cpp for speech recognition, Ollama for language processing, and Kokoro for text-to-speech), ensuring privacy and reliability. However, when you enter your API keys, the experience transforms—you gain access to cutting-edge cloud models and services that take it to the next level.

Very strong support for third-party services. DecisionsAI now includes comprehensive support for major AI providers including OpenAI, Anthropic, ElevenLabs, and OpenRouter. With OpenRouter integration, you can tap into the latest models including GPT-5.2, Gemini 3 Flash, Nano Banana, and all the newest model releases as they become available. This gives you instant access to the most advanced AI capabilities without waiting for individual provider integrations.

Remote control from anywhere. DecisionsAI includes full Telegram integration for remote access—send voice messages, text commands, or use the web-based remote control interface to navigate and control your computer screens from anywhere using your mobile device.

Native Google Workspace integration. For optimal performance with Google services, DecisionsAI includes direct integration with Google Workspace (Gmail, Calendar, Drive, Docs, Sheets). This native integration provides faster, more reliable access to your Google data compared to third-party routing. For everything else, the platform supports workflow automation through Rube/Composio, connecting to 500+ additional apps and services including Slack, GitHub, Notion, Microsoft Teams, and more.

Full file control and document processing. DecisionsAI gives you complete control over your files—transcribe audio and video files, process PDFs and Word documents for context, upload documents to Google Drive, and convert Markdown files directly into Google Docs. Create files on command during any conversation, and generate tickets for your development workflow as you go.

Universal vision support across all providers. DecisionsAI now supports image input and vision capabilities across all major LLM providers—OpenAI, Anthropic (Claude), OpenRouter, Groq, KiloCode, and Ollama. Upload images directly in chat and ask questions about screenshots, photos, diagrams, or any visual content. All providers automatically optimize images with WebP compression (typically 50-80% smaller than PNG/JPEG), saving bandwidth and API costs while maintaining quality. Vision tools include built-in screenshot analysis and image processing capabilities that work seamlessly with any vision-capable model.

Beyond voice commands, DecisionsAI includes a chat interface for text-based conversations with full conversation history and an interactive Oracle/globe visual interface. The chat interface also lets you download any generated text-to-speech audio as mp3 files for later use. For automation, the built-in Actions feature lets you record keyboard and mouse input as macros, then replay them on command—perfect for automating repetitive tasks or creating complex workflows that can be triggered with a simple voice command.

DecisionsAI UI

IDE Integration & Project Workflows

DecisionsAI includes a Visual Studio Code extension that seamlessly integrates with your development workflow. When you're working on a project, simply tell DecisionsAI about it—the extension listens for tickets and instructions you create through voice commands. As you discuss features, bugs, or tasks, DecisionsAI automatically generates structured tickets that your IDE picks up and processes.

The workflow is simple: start a conversation about your project, describe what needs to be done, and DecisionsAI creates tickets with full context. IDEs like Cursor or Visual Studio Code with the extension installed will automatically detect these tickets and can begin working on them. This hands-free approach means you can brainstorm, plan, and delegate tasks to your IDE without ever touching the keyboard—DecisionsAI and your IDE work hand in glove to turn your ideas into code.

Performance & Architecture

DecisionsAI is built on the Pipecat framework, a real-time voice AI pipeline that significantly optimizes memory usage and performance. Pipecat orchestrates the flow of audio, text, and control frames between speech recognition (STT), language models (LLM), and text-to-speech (TTS) services using efficient frame-based communication.

Key Improvements & Optimizations

  • Memory Efficiency: Pipecat's frame-based architecture eliminates redundant data copying and enables efficient streaming, reducing overall memory footprint by up to 40-50% compared to traditional approaches
  • Real-time Processing: Frame-based communication enables low-latency voice interactions with minimal buffering
  • Interruption Handling: Built-in interruption support allows natural conversation flow with immediate response to user input
  • Streaming Architecture: Audio and text are processed in chunks, reducing memory spikes and enabling smooth performance on lower-end hardware
  • Service Coordination: Intelligent frame routing ensures optimal resource utilization across STT, LLM, and TTS services

Technology Stack

Offline-First Core:

  • Whisper.cpp - Efficient offline speech recognition (supports various model sizes)
  • Kokoro - High-quality offline text-to-speech with natural voice synthesis
  • Ollama - Local language model inference (supports various models including Llama, Gemma, and more)
  • Pipecat-ai - Real-time voice AI pipeline orchestration framework with frame-based streaming

Third-Party Services (Optional - Enter API Keys to Enable):

AI Model Providers:

  • OpenRouter - Unified access to the latest models including GPT-5.2, Gemini 3 Flash, Nano Banana, Claude 3.7, and all cutting-edge model releases as they become available
  • OpenAI - GPT-5.2, GPT-4 Turbo, GPT-4o, and other OpenAI models
  • Anthropic - Claude 3.7 Sonnet, Claude 3.5 Opus, Claude 3 Haiku, and other Claude models
  • Ollama - Local and remote Ollama instances for self-hosted models

Speech & Voice Services:

  • ElevenLabs - Cloud-based text-to-speech with high-quality voice synthesis and voice cloning
  • AssemblyAI - Advanced speech recognition and transcription services

Integration & Automation Platforms:

  • Rube/Composio - Connect to 500+ apps and services for workflow automation (Slack, GitHub, Gmail, Notion, Google Workspace, Microsoft Teams, and more)

DecisionsAI About

System Requirements

Local/Offline Mode (Default)

When running DecisionsAI in offline mode with local models, you'll need:

  • Operating System:
    • macOS: Fully tested and supported
    • Linux/Unix: Intended support (may require additional configuration)
    • Windows: Intended support (may require additional configuration)
  • RAM: Minimum 12GB (16GB recommended for optimal performance)
  • Python: 3.8 or higher
  • System Dependencies: PortAudio and FFmpeg
  • Disk Space: Minimum 6GB free space for model downloads
  • Internet Connection: Stable connection required for initial setup (model downloads are ~5GB total)

⏱️ Initial Setup & Model Downloads (Offline Mode):

  • Total download size: ~5.0GB (Kokoro TTS models: ~100MB + Ollama llama3.1:8b: ~4.9GB)
  • Download time estimates:
    • Fast connection (100 Mbps): ~7-10 minutes
    • Medium connection (50 Mbps): ~15-20 minutes
    • Slow connection (10 Mbps): ~1+ hours
  • Progress bars will be displayed during downloads. Please be patient and ensure you have a stable internet connection.

Note: In offline mode, the application uses llama3.1:8b (~4.9GB) which stays loaded in memory for optimal performance. Combined with the operating system, application overhead, and other models (Kokoro TTS, Whisper.cpp), a minimum of 12GB RAM is required. 16GB is recommended for smooth operation, especially when running other applications simultaneously. Thanks to Pipecat's optimized architecture, DecisionsAI efficiently orchestrates audio and text processing with minimal memory overhead beyond the model itself.

Online/Cloud Mode (With API Keys)

Using OpenAI, Anthropic, OpenRouter, or other cloud-based services drastically reduces the system footprint!

When using online services, you can run DecisionsAI with significantly lower requirements:

  • Operating System:
    • macOS: Fully tested and supported
    • Linux/Unix: Intended support (may require additional configuration)
    • Windows: Intended support (may require additional configuration)
  • RAM: Minimum 4GB (8GB recommended)
  • Python: 3.8 or higher
  • System Dependencies: PortAudio and FFmpeg
  • Disk Space: Minimum 200MB free space (only for Kokoro TTS models and Whisper.cpp)
  • Internet Connection: Stable, high-speed connection required for real-time AI interactions

Benefits of Online Mode:

  • No large model downloads: No need to download 4.9GB Ollama models
  • Reduced memory usage: Models run on cloud servers, not your local machine
  • Access to latest models: Get instant access to Gemini 3, GPT-4 Turbo, Claude 3.5 Sonnet, and other cutting-edge models
  • Better performance on low-end hardware: Perfect for laptops and systems with limited RAM
  • Faster setup: Only download lightweight local components (~200MB total)

Note: You can mix and match! Use local models for privacy-sensitive tasks and cloud models for complex reasoning. DecisionsAI intelligently routes requests based on your configuration. The application includes cross-platform support with platform-specific optimizations for clipboard operations, keyboard shortcuts, and system paths.

Installation & Usage

Quick Start (Recommended)

The easiest way to get started is to use the provided executables, which handle all setup automatically:

  1. Clone the repository:

    git clone https://github.com/tensology/decisionsai.git
    cd decisionsai
  2. Run the appropriate executable for your platform:

    macOS:

    • Double-click decisions.app in Finder, or
    • Run ./decisions in Terminal

    Windows:

    • Double-click decisions.bat in File Explorer, or
    • Run decisions.bat from Command Prompt

    Unix/Linux:

    • Run ./decisions in Terminal

    These executables will automatically:

    • Check and install system dependencies (portaudio, ffmpeg) if missing
    • Detect or create a Python virtual environment (prioritizing virtualenvwrapper if available)
    • Install all Python dependencies from requirements.txt
    • Download required AI models (if not already present) via bin/setup.py
    • Start the application via bin/start.py

    Note: On Linux/macOS, system dependency installation may require sudo/admin privileges. On Windows, the script will attempt to use winget, Chocolatey, or Scoop if available.

  3. Interact with the assistant using voice commands.

Manual Installation & Setup

If you prefer to set up the project manually or need more control over the installation process, you can use the scripts in the bin/ directory directly.

Prerequisites

  • Python: 3.8 or higher
  • System Dependencies: PortAudio and FFmpeg
    • macOS: brew install portaudio ffmpeg
    • Linux: sudo apt-get install portaudio19-dev ffmpeg (Debian/Ubuntu) or equivalent
    • Windows: Install via winget, Chocolatey, or Scoop

Step 1: Set Up Python Environment

Create and activate a virtual environment:

# Using venv (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Or using virtualenvwrapper (if installed)
mkvirtualenv decisions

Step 2: Install Python Dependencies

# For Python 3.13+, set compatibility flag for tiktoken (LlamaIndex dependency)
# For Python 3.12 or earlier, you can skip the export line
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1  # Only needed for Python 3.13+
pip install -r requirements.txt

Optional Dependencies:

  • LlamaIndex (for RAG): Better performance for document indexing and retrieval

These are included in requirements.txt but can be installed separately if needed. See docs/INSTALLATION.md for details.

Step 3: Download AI Models

Run the setup script to download all required AI models:

# Download required models
python bin/setup.py

# Optional: Also install optional dependencies (LlamaIndex)
python bin/setup.py --install-optional

What bin/setup.py does:

The setup script downloads and configures the following AI models:

  1. Kokoro TTS Models (Text-to-Speech):

    • Downloads kokoro-v1.0.onnx (~50MB) - The main TTS model
    • Downloads voices-v1.0.bin (~10MB) - Voice configuration file
    • Location: ./distr/agent/models/
    • Source: kokoro-onnx releases
  2. Ollama Language Model:

    • Pulls llama3.1:8b model (~4.9GB) via Ollama API
    • Better accuracy and lower hallucination rates compared to smaller models like llama3.2:3b
    • Checks if model is up-to-date (within 24 hours) before re-downloading
    • Requires Ollama to be installed and running
    • Source: Ollama

Model Download Process:

  • The script checks if models already exist before downloading
  • Progress bars are displayed for large downloads
  • Failed downloads can be resumed by running the script again
  • Total download size: ~100MB (Kokoro models) + ~4.9GB (Ollama llama3.1:8b model) = ~5.0GB total

Optional: Vosk Speech Recognition Model

Whisper.cpp is the default STT (Speech-to-Text) engine and is included with the main setup. However, if you prefer to use Vosk as an alternative STT option, you can install it separately:

python bin/setup_vosk.py

What bin/setup_vosk.py does:

  • Downloads vosk-model-en-us-0.22.zip (~1.8GB) - English speech recognition model
  • Extracts to ./distr/agent/models/vosk-model-en-us-0.22/
  • Location: ./distr/agent/models/vosk-model-en-us-0.22/
  • Source: Vosk models

Note: Vosk is optional. The application uses Whisper.cpp by default for speech recognition. You can switch between Whisper.cpp and Vosk in the application settings (Settings > AI > Transcription Model) if both are installed.

Step 4: Start the Application

Once all models are downloaded, start the application:

python bin/start.py

What bin/start.py does:

  • Adds the project root to Python's path for module imports
  • Suppresses macOS memory logging warnings (if on macOS)
  • Initializes the AppKit framework for macOS integration (if on macOS)
  • Launches the main application via distr.app.run()

The application will start and you can begin interacting with the assistant using voice commands.

Contributing

We welcome contributions to DecisionsAI! If you have suggestions or improvements, please open an issue or submit a pull request.

Development Status

This project is actively being developed. Current focus areas include:

  • Improving voice recognition accuracy
  • Enhancing offline capabilities
  • Adding support for additional AI models
  • Enhanced dictation and transcription features

Code Execution

DecisionsAI includes built-in code execution capabilities, enabling the assistant to execute Python code, perform file operations, and carry out complex tasks on your local machine directly through voice commands or chat interactions.

Voice Commands

DecisionsAI responds to a wide range of voice commands.

Here's a comprehensive list of available commands:

Navigation and Window Management

Command Description
Open / Focus / Focus on Open or focus on a specific window
Open file menu Open the file menu
Hide oracle / Hide globe Hide the oracle/globe interface
Show oracle / Show globe Show the oracle/globe interface
Change oracle / Change globe Change the oracle/globe interface
Change previous oracle / Change previous globe Change to the previous oracle/globe image
Open GPT Open GPT (Alt+Space shortcut)
Open spotlight / Spotlight search Open Spotlight search (Cmd+Space)
New tab Create a new tab (Cmd+T)
Previous tab Switch to the previous tab (Cmd+Alt+Left)
Next tab Switch to the next tab (Cmd+Alt+Right)
Close Close the current window (Cmd+W)
Quit Quit the current application (Cmd+Q)

Chat Management

Command Description
New Chat / Start over / New conversation Start a new chat conversation

Text Editing and Navigation

Command Description
Copy Copy selected text (Cmd+C)
Paste Paste copied text (Cmd+V)
Cut Cut selected text (Cmd+X)
Select all Select all text (Cmd+A)
Undo Undo last action (Cmd+Z)
Redo Redo last undone action (Cmd+Shift+Z)
Back space / Backspace Delete character before cursor
Delete Delete character after cursor
Clear line Clear the current line
Delete line Delete the entire line (Cmd+Shift+K)
Force delete Force delete (Cmd+Backspace)

Carot Movement

Command Description
Up / Down / Left / Right Move cursor in specified direction
Page up / Page down Scroll page up/down (Fn+Up/Down)
Home Move cursor to beginning of line (Fn+Left)
End Move cursor to end of line (Fn+Right)

Mouse Control

Command Description
Mouse up / Mouse down / Mouse left / Mouse right Move mouse in specified direction
Mouse slow up / Mouse slow down / Mouse slow left / Mouse slow right Move mouse slowly in specified direction
Move mouse center Move mouse to center of screen
Move mouse middle Move mouse to horizontal middle of screen
Move mouse vertical middle Move mouse to vertical middle of screen
Move mouse top Move mouse to top of screen
Move mouse bottom Move mouse to bottom of screen
Move mouse far left Move mouse to left edge of screen
Move mouse far right Move mouse to right edge of screen
Right click Perform a right-click
Click Perform a left-click
Double click Perform a double left-click
Scroll up / Scroll down Scroll the page up/down

Sound Controls

Command Description
Refresh / Reload Refresh the current page (Cmd+R)
Pause / Stop / Play Control media playback
Next track / Previous track Switch between tracks
Mute Mute audio
Volume up / Volume down Adjust volume

Function Keys

Command Description
Press F1 through Press F12 Press the corresponding function key

Special Keys

Command Description
Space bar / Space / Spacebar Press the space bar
Control Press the Control key
Command Press the Command key
Enter this Press the Enter key
Press alt / Alt Press the Alt key
Press escape / Escape / Cancel Press the Escape key
Tab Press the Tab key

AI Assistant Interactions

Command Description
Dictate Start dictation mode, enters in whatever you say, except for ending phrases, ie. "Enter this"
Transcribe / Listen / Listen to Start transcription mode, stores whatever you say to clipboard until you say "Enter this" or "stop listening"
Read / Speak / Recite / Announce Read out the transcribed text, or if you say "this", it will read out whatever you've selected
Agent / Hey / Jarvis Activate the AI agent for complex tasks
Explain / Elaborate Explanation or elaboration of the copy that is in the clipboard
Rework this / Reword this Rework/improve selected text using LLM, updates clipboard, then pastes
Rework from clipboard / Reword from clipboard Rework/improve clipboard content using LLM, updates clipboard (no paste)
Summarize this Summarize selected text using LLM, updates clipboard, then pastes
Summarize from clipboard Summarize clipboard content using LLM, updates clipboard (no paste)
What's in the clipboard / Get the clipboard / Show clipboard Display current clipboard content in the conversation
Save this as audio Generate audio from selected text using TTS and save it as a WAV file to the Desktop
Calculate / Figure out / Analyze Perform calculations or analysis of clipboard content
Translate Translate text from source language to target language
Type 'text' / Type "text" Immediately type the specified text as keyboard input (e.g., "type 'hello world'" or "type from clipboard")

Control Commands

Command Description
Start listening / Listen / Listen to Begin voice command recognition
Stop listening / Stop / Halt Stop voice command recognition
Stop speaking / Shut up / Be quiet Stop the AI from speaking
Exit Exit the application

Telegram Integration

DecisionsAI includes comprehensive Telegram integration for remote control and communication:

  • Remote Control: Say "remote control" or "remote" in Telegram to receive a link to a web-based remote control interface. This allows you to navigate and control your computer screens through WebSockets using your Telegram chat ID as the subscription identifier. You can view screens, take screenshots, control mouse position, click, double-click, type text, and send keyboard commands directly from the web interface.

  • Voice & Text Messages: Send voice messages or text to your connected Telegram bot, and DecisionsAI will process them as commands or questions, responding with voice notes, text, and screenshots as appropriate.

  • Connection: Connect your Telegram account through Settings > Advanced > Telegram Connection. Once connected, you can interact with DecisionsAI remotely via Telegram.

  • Optimized Performance:

    • Screenshots are automatically compressed to WebP format (typically 25-35% smaller than PNG/JPEG) for faster uploads and reduced bandwidth
    • Silent connection polling - ping/pong keepalive messages are handled silently without log spam
    • Smart auto-reconnect - automatic reconnections don't send notification messages to avoid spam
    • Efficient connection status tracking - only logs meaningful connection state changes

Google Workspace Integration

DecisionsAI includes native Google Workspace integration for direct access to Google services:

  • Gmail: Read emails, send emails, create drafts, reply to messages, and manage your inbox with natural voice commands
  • Google Calendar: Create events, check your schedule, and manage appointments
  • Google Drive: List folders, read files, upload documents, and access PDFs
  • Google Docs: Create documents directly from markdown files
  • Google Sheets: Read and interact with spreadsheet data

Setup: Connect your Google account through Settings > Connections > Google Workspace. The integration uses OAuth 2.0 for secure authentication.

Why Native Integration? DecisionsAI prioritizes its native Google Workspace integration over third-party routing (like Rube/Composio) because direct API access provides faster response times and more reliable performance when working with your Google data.

Actions (Macro Recording & Playback)

DecisionsAI includes a powerful action recording system that lets you record keyboard and mouse input, then replay those actions on command:

  • Record Actions: Say "start recording" to begin capturing your keyboard presses, mouse movements, clicks, and drags. Everything you do is recorded with precise timing.

  • Stop Recording: Say "stop recording" or click the tray icon to stop. You'll be prompted to name your action.

  • Run Actions: Say "run action [name]" or "play action [name]" to replay the recorded sequence. DecisionsAI automatically generates trigger words from your action title, so an action named "Open Terminal and SSH" can be triggered by saying "open terminal", "SSH", or the full name.

  • Stop Playback: Say "stop action" to immediately halt a running action.

Use Cases:

  • Automate repetitive tasks (form filling, file operations, application workflows)
  • Create keyboard shortcuts for complex multi-step processes
  • Build macros for applications that don't support native automation

Management: Access the Actions window from the system tray menu to view, edit, rename, or delete your recorded actions.

Note: Voice recognition is currently limited to English. Some features may require internet connectivity depending on your configuration.

Summary

DecisionsAI is an intelligent digital assistant designed to understand and execute various tasks on your computer. It leverages cutting-edge AI technologies to provide voice interaction, automation, and adaptive learning capabilities.

License

This project is licensed under the TENSOLOGY COMMUNITY LICENSE AGREEMENT. See the LICENSE.md file for details.

About

Your Open Intelligent AI Assistant

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 88.4%
  • JavaScript 6.2%
  • HTML 2.9%
  • Shell 1.4%
  • Other 1.1%