This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.
Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.
- Create the database
stackoverflowin your database:CREATE DATABASE stackoverflow;- You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
- Move the following files to the folder from where the program is executed:
Badges.xml,Votes.xml,Posts.xml,Users.xml,Tags.xml.- In some old dumps, the cases in the filenames are different.
- Execute in the current folder (in parallel, if desired):
python load_into_pg.py Badgespython load_into_pg.py Postspython load_into_pg.py Tags(not present in earliest dumps)python load_into_pg.py Userspython load_into_pg.py Votes
- Finally, after all the initial tables have been created:
psql stackoverflow < ./sql/final_post.sql- If you used a different database name, make sure to use that instead of
stackoverflowwhile executing this step.
- It prepares some indexes and views which may not be necessary for your analysis.
- The
bodyfield inPoststable is NOT populated. - The
emailhashfield inUserstable is NOT populated. - Some tables (e.g.
PostHistoryandComments) are missing.