A professional ETL (Extract, Transform, Load) pipeline for collecting, processing, and analyzing character data from the Rick and Morty API. This system provides a robust foundation for data extraction and analysis with production-ready features including PostgreSQL integration.
An integrated system for extracting and analyzing Rick and Morty character data using ETL technologies. The system represents an automated platform for collecting data from public APIs, processing it, and storing it in multiple formats (JSON, CSV, PostgreSQL) to enable comprehensive analysis operations.
- Extraction: Secure and reliable data retrieval from Rick and Morty API
- Transformation: Data cleaning, enrichment, and preparation for analysis
- Load: Data storage in multiple formats (JSON, CSV, PostgreSQL)
- PostgreSQL Support: Full database integration with connection management
- Configurable Settings: Secure database configuration using INI files
- Connection Pooling: Robust connection handling with error recovery
- Modular design allowing easy integration of new data sources
- Multi-format storage support with database expansion capability
- Comprehensive error handling ensuring operational continuity
- Intelligent request management respecting server limitations
- Parallel data processing across multiple pages
- Request rate control preventing blocking and ensuring stability
- Python 3.7+
- PostgreSQL 12+
- pip package manager
# Clone the repository
git clone https://github.com/yourusername/rick-morty-etl.git
cd rick-morty-etl
# Install dependencies
pip install -r requirements.txt- Install PostgreSQL and create a database
- Update the
database.inifile with your credentials:
[postgresql]
host=localhost
database=postgres
user=postgres
password=[your_password]
port=[your_port]
connect_timeout=10rick-morty-etl/
โโโ src/
โ โโโ __init__.py
โ โโโ etl_pipeline.py # Main ETL pipeline
โ โโโ database/
โ โ โโโ __init__.py
โ โ โโโ config.py # Database configuration
โ โ โโโ connection.py # Database connection manager
โ โโโ utils/
โ โโโ __init__.py
โ โโโ data_processor.py # Data transformation functions
โ โโโ file_exporter.py # JSON/CSV export functions
โโโ config/
โ โโโ database.ini # Database configuration
โโโ requirements.txt
โโโ README.md
from src.etl_pipeline import main_etl_pipeline
# Run complete ETL pipeline
main_etl_pipeline()from src.database.connection import db_connection
from src.etl_pipeline import get_all_characters, save_to_postgresql
# Custom ETL process
characters = get_all_characters()
# Save to PostgreSQL with custom settings
with db_connection(section='production') as conn:
save_to_postgresql(characters, connection=conn)
# Export to files
from src.utils.file_exporter import save_to_json, save_to_csv
save_to_json(characters, 'output/characters.json')
save_to_csv(characters, 'output/characters.csv')from src.database.connection import db_connection
from src.database.config import test_connection
# Test database connection
test_connection()
# Execute custom queries
with db_connection() as conn:
with conn.cursor() as cursor:
cursor.execute("SELECT COUNT(*) FROM characters")
count = cursor.fetchone()[0]
print(f"Total characters: {count}")CREATE TABLE characters (
id INTEGER PRIMARY KEY,
name VARCHAR(255) NOT NULL,
status VARCHAR(50),
species VARCHAR(100),
episode_count INTEGER,
location VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);-- Character statistics by status
SELECT status, COUNT(*) as count
FROM characters
GROUP BY status
ORDER BY count DESC;
-- Top characters by episode appearance
SELECT name, episode_count
FROM characters
ORDER BY episode_count DESC
LIMIT 10;
-- Species distribution
SELECT species, COUNT(*) as count
FROM characters
GROUP BY species
ORDER BY count DESC;Create config/database.ini:
[postgresql]
host=localhost
database=postgres
user=postgres
password=[your_password]
port=[your_port]
connect_timeout=10
export DB_HOST=localhost
export DB_NAME=rick_morty_db
export DB_USER=postgres
export DB_PASSWORD=your_passwordfrom src.database.connection import test_connection
# Test connection to default database
test_connection()
# Test connection to specific section
test_connection(section='postgresql')The system includes comprehensive error handling for:
- Network connectivity issues
- API rate limiting
- Database connection failures
- Data parsing errors
- File I/O operations
try:
with db_connection() as conn:
# Database operations
pass
except psycopg2.OperationalError as e:
print(f"Database connection failed: {e}")
# Implement retry logic or fallback
except Exception as e:
print(f"Unexpected error: {e}")[
{
"id": 1,
"name": "Rick Sanchez",
"status": "Alive",
"species": "Human",
"episode_count": 51,
"location": "Earth"
}
]id,name,status,species,episode_count,location
1,Rick Sanchez,Alive,Human,51,Earth
2,Morty Smith,Alive,Human,39,Earth
- Create new extractor in
src/extractors/ - Implement transformation logic in
src/utils/data_processor.py - Update database schema if needed
- Add tests in
tests/
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Rick and Morty API for providing the data
- PostgreSQL community for excellent database support
- Python community for robust data processing libraries
If you have any questions or need help:
- Open an issue
- Check the examples directory
- Review the documentation
- Real-time data streaming
- Docker containerization
- REST API for data access
- Dashboard for visualization
- Machine learning integration
โญ If you find this project useful, please give it a star! โญ