- Installation
- Problem Statement
- File Structure
- Outcomes of Analysis
- Licenses and Acknowledgement
Kindly follow the running.md file in the directory which contains step-wise the installation procedures.
Stack overflow is a question-and-answer forum for technical enthusiasts where users are free to ask and answer questions. Although the forum is filled with professional experts the average response time for each question is around 4 days which is quite a lot of time. Due to this long turnaround time, in some cases, it is hard to tap the full potential of a developer. If this turnaround time is reduced by directing the posts to the right person at the right time, then the questions will be answered quickly, thereby decreasing the time spent on solving the doubtful questions. The main objective of the project is to analyze the stack overflow data dump to derive actionable insights by answering the following question which could help to reduce the response time, thereby increasing the productivity of the tech wizards, Questions:
- Active Users from 2015 based on the number of Q/A posted.
- Top 10 expert users in each tag based on upvotes.
- Average response time per Tag.
- Trend analysis of Badges and Tags.
- Topographical visuals of Tags and Badges.
Please find the file structure below,
├── README.md
├── app
│ ├── Dockerfile
│ ├── answered_app.py
│ ├── app.py
│ ├── requirements.txt
│ └── users_app.py
├── src
│ ├── badges_features.py
│ ├── badges_trends.py
│ ├── explorer.py
│ ├── location.py
│ ├── ml-model-builder.py
│ ├── preprocessing
│ │ ├── Tags.py
│ │ ├── comments.py
│ │ ├── posts.py
│ │ ├── users.py
│ │ └── votes.py
│ ├── user-answers.py
│ ├── xml_parquet_badges.py
│ └── xml_parquet_postlinks.py
├─── start-up
│ ├── start-up-linux.sh
│ └── start-up-mac.sh
└─── output
The visualizations of the extensive stack overflow data analysis can be found in output folder.
The data set was obtained from stackexchange. Data processing, storage and visualization are done using Google Cloud Platform.