Skip to content
View prathyyyyy's full-sized avatar

Block or report prathyyyyy

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
prathyyyyy/readme.md

Hi πŸ‘‹ I'm Prathy P

Data Systems & Machine Learning Engineer

Professional Summary

Data Systems and Machine Learning Engineer with experience designing high-throughput batch and real-time data pipelines, lakehouse architectures, and production ML platforms on AWS and Azure. Skilled in Spark, Kafka, Databricks, and vector search systems, with a strong focus on building scalable, reliable data and ML infrastructure for real-world applications.

🌍 India and open for relocation
βœ‰οΈ [email protected]
🀝 Open to Data Engineer | Data Scientist | ML Engineer roles


πŸš€ What I Build

  • High-throughput batch & real-time data pipelines (Spark, Kafka, Kinesis, Flink)
  • Lakehouse architectures using Delta, Iceberg, Hudi, Unity Catalog
  • Streaming analytics & security detection systems
  • ML pipelines on Spark with GPU acceleration
  • Vector search & semantic retrieval systems using FAISS & embeddings
  • Multimodal RAG systems (text + image retrieval)
  • Production ML with monitoring, CI/CD, and drift detection

🧠 Core Expertise

Data Engineering

PySpark Kafka Kinesis Flink Databricks Delta Lake Iceberg Hudi Unity Catalog

Machine Learning Systems

Spark ML XGBoost4J-Spark RAPIDS Evidently AI SageMaker Pipelines

Vector & LLM Systems

FAISS Sentence-BERT Embeddings Multimodal RAG LangChain

Cloud

AWS (Glue, Lambda, Athena, S3, SageMaker)
Azure (Databricks, Data Factory, Azure ML, DevOps)

Backend

.NET PostgreSQL Docker Flask API


πŸ—οΈ Featured Projects

πŸ”Ή High-Throughput E-Commerce Streaming Analytics & Security Detection

  • Processed 67M+ events
  • Built batch + real-time analytics pipelines
  • Apache Hudi β†’ 50% faster queries, 40% less storage
  • Kinesis + Flink + DynamoDB for DDoS/Bot detection

πŸ”Ή Truck Delay Prediction using Spark ML + GPU XGBoost

  • XGBoost4J-Spark + RAPIDS Accelerator
  • Production pipeline with SageMaker + Evidently AI
  • Drift monitoring, CI/CD, orchestration

πŸ”Ή Semantic Search & Relevance Platform

  • Sentence-BERT embeddings
  • FAISS vector retrieval
  • Iceberg storage + Dockerized Flask API

πŸ”Ή Multimodal RAG Food Recommendation System

  • Text + Image embeddings
  • FAISS vector indexing
  • Streamlit app deployed on AWS

πŸ”Ή Databricks Streaming ETL (Medallion Architecture)

  • Kafka + PySpark streaming joins
  • Unity Catalog governance
  • Azure DevOps CI/CD

πŸ… Certification

Microsoft Certified: Azure Data Scientist Associate (DP-100)
https://learn.microsoft.com/en-us/users/prathy-0029/credentials/certification/azure-data-scientist


πŸ› οΈ Tech Stack

Python β€’ PySpark β€’ SQL β€’ Spark ML β€’ Kafka β€’ Databricks β€’ AWS β€’ Azure β€’ FAISS β€’ Docker β€’ PostgreSQL β€’ PowerBI


🀝 Let’s Collaborate

I love working on:

  • Distributed data systems
  • ML at scale
  • Vector search & RAG systems
  • Streaming analytics

⚑ Fun Fact

I enjoy translating complex data problems into scalable engineering systems.

Pinned Loading

  1. Forest-Fire-Detection Forest-Fire-Detection Public

    Forest Fire Detection By Convolutional Neural Network

    Jupyter Notebook 15 4

  2. Medical-Data-Extraction Medical-Data-Extraction Public

    Medical Data Extraction By Pytesseract (Google Optical Character Recognition Engine) and Computer Vision

    Jupyter Notebook 17 3