This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.
Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.
- Create the database
stackoverflowin your database:CREATE DATABASE stackoverflow;- You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
- Move the following files to the folder from where the program is executed:
Badges.xml,Votes.xml,Posts.xml,Users.xml,Tags.xml.- In some old dumps, the cases in the filenames are different.
- Execute in the current folder (in parallel, if desired):
python load_into_pg.py Badgespython load_into_pg.py Postspython load_into_pg.py Tags(not present in earliest dumps)python load_into_pg.py Userspython load_into_pg.py Votes
- Finally, after all the initial tables have been created:
psql stackoverflow < ./sql/final_post.sql- If you used a different database name, make sure to use that instead of
stackoverflowwhile executing this step.
- For some additional indexes and tables, you can also execute the the following;
psql stackoverflow < ./sql/optional_post.sql- Again, remember to user the correct database name here, if not
stackoverflow.
- It prepares some indexes and views which may not be necessary for your analysis.
- The
Bodyfield inPoststable is NOT populated. - The
EmailHashfield inUserstable is NOT populated. - Some tables (e.g.
PostHistoryandComments) are missing.
- The
tags.xmlis missing from the data dump. Hence, thePostTagandUserTagQAtables will be empty afterfinal_post.sql. - The
ViewCountinPostsis sometimes equal to anemptyvalue. It is replaced byNULLin those cases.