GitHub - Elavarasan17/stackoverflow-data-dump-analysis: Big Data Analytics Project

Installation

Kindly follow the running.md file in the directory which contains step-wise the installation procedures.

Problem Statement

Stack overflow is a question-and-answer forum for technical enthusiasts where users are free to ask and answer questions. Although the forum is filled with professional experts the average response time for each question is around 4 days which is quite a lot of time. Due to this long turnaround time, in some cases, it is hard to tap the full potential of a developer. If this turnaround time is reduced by directing the posts to the right person at the right time, then the questions will be answered quickly, thereby decreasing the time spent on solving the doubtful questions. The main objective of the project is to analyze the stack overflow data dump to derive actionable insights by answering the following question which could help to reduce the response time, thereby increasing the productivity of the tech wizards, Questions:

Active Users from 2015 based on the number of Q/A posted.
Top 10 expert users in each tag based on upvotes.
Average response time per Tag.
Trend analysis of Badges and Tags.
Topographical visuals of Tags and Badges.

File Structure

Please find the file structure below,

├── README.md
├── app
│   ├── Dockerfile
│   ├── answered_app.py
│   ├── app.py
│   ├── requirements.txt
│   └── users_app.py
├── src
│   ├── badges_features.py
│   ├── badges_trends.py
│   ├── explorer.py
│   ├── location.py
│   ├── ml-model-builder.py
│   ├── preprocessing
│   │   ├── Tags.py
│   │   ├── comments.py
│   │   ├── posts.py
│   │   ├── users.py
│   │   └── votes.py
│   ├── user-answers.py
│   ├── xml_parquet_badges.py
│   └── xml_parquet_postlinks.py
├─── start-up
│   ├── start-up-linux.sh
│   └── start-up-mac.sh
└─── output

Outcome of Analysis

The visualizations of the extensive stack overflow data analysis can be found in output folder.

Licenses and Acknowledgement

The data set was obtained from stackexchange. Data processing, storage and visualization are done using Google Cloud Platform.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
doc		doc
output		output
spark		spark
start-up		start-up
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
RUNNING.MD		RUNNING.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Table of Contents

Installation

Problem Statement

File Structure

Outcome of Analysis

Licenses and Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

Elavarasan17/stackoverflow-data-dump-analysis

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Problem Statement

File Structure

Outcome of Analysis

Licenses and Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages