A comprehensive, normalized, and human-readable PostgreSQL database containing all IPEDS (Integrated Postsecondary Education Data System) data for the 2022-23 academic year.
This database provides a complete, normalized structure for all IPEDS data, including:
- Institution master data (directory information, characteristics, locations)
- Enrollment data (fall enrollment, 12-month enrollment, distance education)
- Completions/degrees (awards by program, field, demographics)
- Finance data (revenues, expenses, endowments)
- Human resources (staff, faculty, salaries)
- Student aid (financial aid, net price)
- Graduation rates and outcome measures
- Admissions data
- Academic libraries data
- 1 Institution Master Table:
institution- Core directory information for all institutions - 67 Dimension/Lookup Tables:
dim_*- Standardized code lookups (states, award levels, CIP codes, etc.) - 52 Fact Tables:
fact_*- Survey data tables linked to institutions
- Normalized Structure: Lookup tables for all categorical variables to ensure data consistency
- Human-Readable Names: Tables and columns use descriptive
snake_casenames instead of cryptic codes - Comprehensive Documentation: All tables and columns have COMMENT metadata with full descriptions
- Referential Integrity Ready: Foreign key relationships designed but not enforced (to allow flexible data loading)
- Proper Data Types: Numeric fields use appropriate INTEGER/BIGINT/NUMERIC types, text fields use VARCHAR/TEXT
- PostgreSQL 16+
- Python 3.11+
- Required Python packages:
pandas,psycopg2-binary,openpyxl
# 1. Create the database
createdb ipeds2022
# 2. Apply the schema
psql ipeds2022 < schema/ipeds_schema.sql
# 3. Import data (when CSV files are available)
python3 scripts/import_data.pypsql ipeds2022The institution table is the core table containing directory information for all institutions:
SELECT unitid, instnm, stabbr, city, sector
FROM institution
LIMIT 5;Key columns:
unitid- Unique institution identifier (primary key)instnm- Institution namestabbr- State abbreviationaddr,city,zip- Location informationsector- Institutional sector/typecontrol- Public/private controlobereg- Geographic regionlocale- Urban/suburban/rural localecarnegie- Carnegie classification
Dimension tables provide standardized lookups for categorical variables:
- dim_stabbr - State codes and names
- dim_awlevel - Award/degree levels
- dim_cipcode - CIP (Classification of Instructional Programs) codes
- dim_sector - Institution sectors (4-year public, 2-year private, etc.)
- dim_locale - Urban-centric locale classifications
- And 62 more...
Example usage:
-- Get human-readable degree level names
SELECT c.awlevel, d.label
FROM fact_c2022_a c
JOIN dim_awlevel d ON c.awlevel = d.code
LIMIT 10;Fact tables contain the actual survey data, grouped by topic:
fact_ef2022*- Fall enrollment data (by demographics, distance education, age, major, residence)fact_effy2022*- 12-month unduplicated headcountfact_efia2022- 12-month instructional activity
fact_c2022_a- Completions by program and demographicsfact_c2022_b- Completions by demographicsfact_c2022_c- Completions by award level, demographics, and agefact_c2022_dep- Number of programs offered
fact_f2122_f1_a- Public institutions (GASB)fact_f2122_f2- Private not-for-profit or public using FASBfact_f2122_f3- Private for-profit institutions
fact_s2022_*- Staff data by occupation, rank, demographicsfact_sal2022_*- Salary data for instructional and non-instructional stafffact_eap2022- Employee by occupational category
fact_sfa2122_p1- Student financial aidfact_sfa2122_p2- Student financial aid and net pricefact_sfav2122- Military and veteran benefits
fact_gr2022*- Graduation rates by various cohorts and demographicsfact_gr200_22- 200% graduation rates
fact_adm2022- Admissions and test scoresfact_al2022- Academic librariesfact_ic2022*- Institutional characteristics and student chargesfact_om2022- Outcome measuresfact_flags2022- Response status for all survey components
SELECT unitid, instnm, city, sector
FROM institution
WHERE stabbr = 'CA'
ORDER BY instnm;SELECT i.instnm, e.eftotlt as total_enrollment
FROM fact_ef2022_a e
JOIN institution i ON e.unitid = i.unitid
WHERE e.efalevel = '1' -- All students
ORDER BY e.eftotlt DESC NULLS LAST
LIMIT 20;SELECT
d.label as degree_level,
c.ctotalt as total_awards
FROM fact_c2022_a c
JOIN dim_awlevel d ON c.awlevel = d.code
JOIN institution i ON c.unitid = i.unitid
WHERE i.instnm LIKE '%Harvard%'
AND c.majornum = '1' -- First major
GROUP BY d.label, c.ctotalt
ORDER BY c.ctotalt DESC;SELECT
i.instnm,
s.npist2 as avg_net_price
FROM fact_sfa2122_p2 s
JOIN institution i ON s.unitid = i.unitid
WHERE s.scugffn > 0 -- Has students in cohort
ORDER BY s.npist2
LIMIT 20;SELECT
ds.label as sector,
AVG(f.f2c01) as avg_instruction_expense
FROM fact_f2122_f2 f
JOIN institution i ON f.unitid = i.unitid
JOIN dim_sector ds ON i.sector = ds.code
WHERE f.f2c01 IS NOT NULL
GROUP BY ds.label
ORDER BY avg_instruction_expense DESC;The complete data dictionary is available in:
- sources/IPEDS202223TablesDoc.xlsx - Original Excel format
- ipeds_metadata.db - SQLite database for programmatic access
import sqlite3
conn = sqlite3.connect('ipeds_metadata.db')
# Get all variables for a table
df = pd.read_sql("""
SELECT varName, varTitle, DataType, longDescription
FROM variables
WHERE TableName = 'HD2022'
ORDER BY varOrder
""", conn)
# Get all value labels for a variable
df = pd.read_sql("""
SELECT Codevalue, valueLabel
FROM value_sets
WHERE varName = 'STABBR'
ORDER BY valueOrder
""", conn)ipeds2/
├── README.md # This file
├── schema/
│ └── ipeds_schema.sql # Complete PostgreSQL schema (13,834 lines)
├── scripts/
│ ├── examine_data_dict.py # Explore the data dictionary
│ ├── load_data_dict_to_sqlite.py # Load metadata into SQLite
│ ├── design_schema.py # Analyze and design database structure
│ ├── generate_postgres_schema.py # Generate PostgreSQL DDL
│ ├── import_data.py # Import CSV data into PostgreSQL
│ └── download_ipeds_data.py # Download data from NCES (if needed)
├── sources/
│ ├── IPEDS202223TablesDoc.xlsx # Data dictionary (Excel)
│ ├── IPEDS202223.accdb # Access database (Git LFS)
│ └── IPEDS202223_csv/ # CSV exports (Git LFS)
└── ipeds_metadata.db # SQLite metadata database
This database is derived from official IPEDS data from the National Center for Education Statistics (NCES):
- Website: https://nces.ed.gov/ipeds/
- Data year: 2022-23
- Release: Final/revised
The actual data files (CSV and Access database) are stored in Git LFS. To access them:
git lfs pullAlternatively, download directly from NCES: https://nces.ed.gov/ipeds/datacenter/login.aspx
- All tables include comments with full descriptions
- Lookup tables contain official IPEDS code values
- Data types are mapped appropriately from IPEDS specifications
- Foreign key constraints are defined but not enforced (commented out in schema) to allow flexible data loading
- Indexes are created on
unitidfor all fact tables - Additional indexes may be beneficial depending on query patterns
- Consider partitioning large tables if working with multiple years
When adding new IPEDS years or modifying the schema:
- Update the data dictionary in
sources/ - Run
load_data_dict_to_sqlite.pyto refresh metadata - Re-generate schema with
generate_postgres_schema.py - Test data import with
import_data.py - Update this README with any structural changes
This database schema and tooling is provided as-is. IPEDS data is public domain and provided by NCES/US Department of Education.
For IPEDS data questions, contact:
- IPEDS Help Desk: https://nces.ed.gov/ipeds/use-the-data/contact-us
For database schema questions, refer to:
- The generated schema comments (in PostgreSQL)
- The metadata database (
ipeds_metadata.db) - The original data dictionary (
IPEDS202223TablesDoc.xlsx)