This repository is for prediction of movies and beer rating using the principles of finding the line slopes and itercepts using the linear algebraic inverse formulares and veryifying theta and Mean Squared Error and absolute error values
This repository contains two assignments focused on feature engineering, linear and logistic models, and a simple item–item collaborative filtering approach. The code is organized to run with the provided runner notebooks and uses clean, modular functions for reproducibility.
Length-based regression: Predict ratings using scaled review length.
Time features:
Reduced one-hot encoding for weekday and month, constrained to fit dimension limits.
Numeric encoding variant with explicit offset term.
Train/test split evaluation to compare encodings.
Beer sentiment classification:
Baseline feature: review text length.
Improved features: length plus available subratings.
Logistic regression with balanced class weighting.
Precision@K evaluation using predicted probabilities.
ABV classification (beer):
Modular feature builder combining style one-hot, subratings, and scaled length.
Logistic regression with validation-based selection of regularization strength.
Ablation to quantify the contribution of each feature group.
Rating prediction (books):
Item–item collaborative filtering using Jaccard similarity over user sets.
Fallbacks to item and global averages for sparse cases.
Blended predictor that combines neighbor estimate with user/item baselines.
Place homework1.py and homework2.py in the same directory as the runner notebooks.
Open each runner (Jupyter Notebook or JupyterLab).
Restart kernel to pick up changes.
Run all cells. Outputs include metrics (MSE, BER, precision@K) and sample results.
Books:
Text: review_text
Rating: rating (fallbacks: star_rating, overall)
Time: parsed_date (provided by runner)
Beer:
Text: review/text
Labels: review/overall (classification), beer/ABV (diagnostics)
Category: beer/style
Subratings: review/aroma, review/appearance, review/palate, review/taste, review/overall
Explicit feature scaling for lengths using training-set maximum.
Reduced one-hot encodings to respect dimensionality constraints.
Bias handling aligned with assignment expectations per question.
Class weighting for imbalanced classification tasks.
Deterministic, readable functions to simplify grading and reuse.
1.Feature engineering under dimensionality constraints. 2.Linear and logistic modeling with proper evaluation (MSE, BER). 3.Regularization selection via validation. 4.Recommender system fundamentals with robust fallbacks. 5.Clean module organization and reproducible experiments.
Python 3.x Standard Libraries
numpy, scikit-learn
Jupyter Notebook or JupyterLab