Skip to content

binary-amigo/crunchbase-automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Crunchbase Automation System

A powerful, multi-client CSV processing system that automatically uploads company data to Google Sheets with intelligent duplicate prevention and client isolation.

✨ Features

🔒 Multi-Client Support

  • Client A & Client B - Separate data sheets for different companies
  • Global Duplicate Prevention - Master sheet prevents cross-client duplicates
  • Visual Separation - Color-coded rows for easy client identification

📊 Smart Data Processing

  • CSV Upload - Drag & drop or file picker interface
  • Column Mapping - Automatic CSV to Google Sheets column mapping
  • Bulk Processing - Efficient handling of large datasets (1000+ companies)
  • Real-time Progress - Live progress bar and status updates

🔐 Google Sheets Integration

  • Master Sheet - Global registry of all companies (Client, Company, Date Added)
  • Client Sheets - Full data storage for each client
  • OAuth 2.0 - Secure Google API authentication
  • Automatic Updates - Next available row detection

🎨 Modern User Interface

  • React + Vite - Fast, responsive frontend
  • Tailwind CSS - Beautiful, modern design
  • Drag & Drop - Intuitive file upload
  • Real-time Feedback - Processing status and progress tracking

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend        │    │   Google Sheets │
│   (React)       │◄──►│   (Flask)        │◄──►│   API           │
│                 │    │                  │    │                 │
│ • File Upload   │    │ • CSV Processing │    │ • Master Sheet  │
│ • Client Select │    │ • Duplicate Check│    │ • Client Sheets │
│ • Progress Bar  │    │ • Data Mapping   │    │ • Visual Colors │
└─────────────────┘    └──────────────────┘    └─────────────────┘

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • Node.js 16+
  • Google Cloud Project with Sheets API enabled
  • Google Service Account credentials

1. Clone Repository

git clone <your-repo-url>
cd Crunchbase-AUtomation

2. Backend Setup

cd backend

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Setup environment variables
cp env.example .env
# Edit .env with your Google Sheets configuration

# Upload Google credentials
# - credentials.json (Service Account key)
# - token.json (OAuth tokens)

# Run backend
python app.py

3. Frontend Setup

cd frontend

# Install dependencies
npm install

# Setup environment variables
cp .env.example .env.development  # For local development
cp .env.example .env.production   # For production deployment

# Edit environment files with your configuration
# .env.development - Use for local development
# .env.production - Use for production deployment

# Run development server
npm run dev

4. Access Application

⚙️ Configuration

Environment Variables (.env)

# Master Sheet Configuration
MASTER_SHEET_ID=your_master_sheet_id
MASTER_SHEET_NAME=Master

# Client A Configuration
CLIENT_A_NAME=Client A
CLIENT_A_SHEET_ID=client_a_sheet_id
CLIENT_A_SHEET_NAME=Sheet1

# Client B Configuration
CLIENT_B_NAME=Client B
CLIENT_B_SHEET_ID=client_b_sheet_id
CLIENT_B_SHEET_NAME=Sheet1

# Google API Configuration
GOOGLE_SHEETS_CREDENTIALS_FILE=credentials.json
GOOGLE_SHEETS_TOKEN_FILE=token.json

# File Upload Configuration
UPLOAD_FOLDER=uploads
MAX_CONTENT_LENGTH=16777216

Frontend Configuration

The frontend uses environment variables for configuration:

# .env.development (Local Development)
VITE_API_BASE_URL=http://localhost:5000
VITE_APP_NAME=Crunchbase Automation (Dev)

# .env.production (Production)
VITE_API_BASE_URL=https://your-domain.com
VITE_APP_NAME=Crunchbase Automation

Important: Update VITE_API_BASE_URL in .env.production to match your deployed backend URL.

Column Mapping

The system automatically maps CSV columns to Google Sheets headers:

CSV Column Sheet Column Action
Organization Name Company Name Direct Map
Website Website Direct Map
LinkedIn Company Linkedin Direct Map
Last Funding Type Campaign Direct Map
Date Today's date (DD/MM/YY)
Source "Crunchbase"
POCs Leave blank
Email ID Leave blank
Reachout LinkedIn Leave blank
Reachout Email Leave blank
Response Leave blank

📁 Project Structure

Crunchbase-AUtomation/
├── backend/                    # Python Flask backend
│   ├── app.py                 # Main Flask application
│   ├── config.py              # Configuration management
│   ├── csv_processor.py       # CSV processing logic
│   ├── google_sheets_service.py # Google Sheets integration
│   ├── master_sheet_service.py # Master sheet operations
│   ├── requirements.txt       # Python dependencies
│   ├── .env                   # Environment variables
│   ├── credentials.json       # Google API credentials
│   └── uploads/               # File upload directory
├── frontend/                   # React frontend
│   ├── src/                   # Source code
│   │   ├── App.jsx            # Main application component
│   │   ├── App.css            # Application styles
│   │   └── main.jsx           # Application entry point
│   ├── package.json           # Node.js dependencies
│   └── dist/                  # Built frontend files
├── .gitignore                 # Git ignore rules
├── README.md                  # This file
├── deploy-digitalocean.sh     # DigitalOcean deployment script
└── DIGITALOCEAN_SETUP.md      # DigitalOcean setup guide

🚀 Deployment

DigitalOcean (Recommended)

# 1. Create $6/month droplet
# 2. SSH into your droplet
ssh root@your-droplet-ip

# 3. Run deployment script
curl -O https://raw.githubusercontent.com/your-repo/deploy-digitalocean.sh
chmod +x deploy-digitalocean.sh
./deploy-digitalocean.sh

Other Cloud Platforms

  • Heroku: Use Procfile and environment variables
  • Google Cloud Run: Containerized deployment
  • AWS EC2: Similar to DigitalOcean setup

📊 How It Works

1. File Upload

  • User selects CSV file and client
  • Frontend sends file to backend API

2. Data Processing

  • Backend parses CSV and maps columns
  • Checks master sheet for existing companies
  • Filters out duplicates

3. Google Sheets Update

  • Adds new companies to master sheet
  • Updates client-specific sheet with full data
  • Applies visual formatting (client colors)

4. Real-time Feedback

  • Progress bar shows processing status
  • Success/error messages displayed
  • Processing details shown to user

🔒 Security Features

  • OAuth 2.0 authentication with Google
  • Service Account for secure API access
  • Environment Variables for sensitive data
  • Input Validation and sanitization
  • File Type Restrictions (CSV only)
  • Client Isolation (no cross-contamination)

🧪 Testing

Backend Testing

cd backend
python -m pytest tests/

Frontend Testing

cd frontend
npm test

API Testing

# Health check
curl http://localhost:5000/api/health

# Get available clients
curl http://localhost:5000/api/clients

# Test master sheet connection
curl http://localhost:5000/api/test-master-connection

🚨 Troubleshooting

Common Issues

Authentication Errors

  • Verify credentials.json is correct service account key
  • Check token.json exists and is valid
  • Ensure Google Sheets API is enabled

Column Mapping Issues

  • Check CSV column names match expected format
  • Verify Google Sheet headers exist
  • Review column mapping configuration

Duplicate Detection Problems

  • Ensure master sheet is properly initialized
  • Check DUPLICATE_CHECK_FIELDS configuration
  • Verify company name field mapping

Debug Mode

# Enable debug logging
export FLASK_ENV=development
python app.py

📈 Performance

  • Processing Speed: 1000+ companies in under 2 minutes
  • Memory Usage: Efficient pandas operations
  • API Calls: Minimal Google Sheets API usage
  • Scalability: Easy to add more clients

🔄 Updates & Maintenance

Update Application

git pull origin main
cd backend && pip install -r requirements.txt
cd ../frontend && npm install && npm run build
systemctl restart crunchbase-automation  # If deployed

Backup Strategy

  • Google Sheets: Version history and revision tracking
  • Application Data: Git repository backup
  • Credentials: Secure credential storage

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

  • Issues: Create GitHub issue
  • Documentation: Check this README and setup guides
  • Deployment: Refer to DIGITALOCEAN_SETUP.md

🎯 Roadmap

  • Multi-file Upload - Process multiple CSVs simultaneously
  • Advanced Filtering - Industry, funding, location filters
  • Data Export - Export processed data to various formats
  • Analytics Dashboard - Processing statistics and insights
  • API Rate Limiting - Better Google API quota management
  • Webhook Support - Notifications on completion

Built with ❤️ for efficient business automation

Your Crunchbase data, organized and secure across multiple clients. 🚀

About

Crunchbase to Excel Automation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors