Skip to content

ftrain/ipeds2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IPEDS 2022-23 PostgreSQL Database

A comprehensive, normalized, and human-readable PostgreSQL database containing all IPEDS (Integrated Postsecondary Education Data System) data for the 2022-23 academic year.

Overview

This database provides a complete, normalized structure for all IPEDS data, including:

  • Institution master data (directory information, characteristics, locations)
  • Enrollment data (fall enrollment, 12-month enrollment, distance education)
  • Completions/degrees (awards by program, field, demographics)
  • Finance data (revenues, expenses, endowments)
  • Human resources (staff, faculty, salaries)
  • Student aid (financial aid, net price)
  • Graduation rates and outcome measures
  • Admissions data
  • Academic libraries data

Database Structure

Tables Summary

  • 1 Institution Master Table: institution - Core directory information for all institutions
  • 67 Dimension/Lookup Tables: dim_* - Standardized code lookups (states, award levels, CIP codes, etc.)
  • 52 Fact Tables: fact_* - Survey data tables linked to institutions

Key Design Principles

  1. Normalized Structure: Lookup tables for all categorical variables to ensure data consistency
  2. Human-Readable Names: Tables and columns use descriptive snake_case names instead of cryptic codes
  3. Comprehensive Documentation: All tables and columns have COMMENT metadata with full descriptions
  4. Referential Integrity Ready: Foreign key relationships designed but not enforced (to allow flexible data loading)
  5. Proper Data Types: Numeric fields use appropriate INTEGER/BIGINT/NUMERIC types, text fields use VARCHAR/TEXT

Quick Start

Prerequisites

  • PostgreSQL 16+
  • Python 3.11+
  • Required Python packages: pandas, psycopg2-binary, openpyxl

Creating the Database

# 1. Create the database
createdb ipeds2022

# 2. Apply the schema
psql ipeds2022 < schema/ipeds_schema.sql

# 3. Import data (when CSV files are available)
python3 scripts/import_data.py

Connecting to the Database

psql ipeds2022

Database Schema Details

Institution Master Table

The institution table is the core table containing directory information for all institutions:

SELECT unitid, instnm, stabbr, city, sector
FROM institution
LIMIT 5;

Key columns:

  • unitid - Unique institution identifier (primary key)
  • instnm - Institution name
  • stabbr - State abbreviation
  • addr, city, zip - Location information
  • sector - Institutional sector/type
  • control - Public/private control
  • obereg - Geographic region
  • locale - Urban/suburban/rural locale
  • carnegie - Carnegie classification

Dimension Tables (dim_*)

Dimension tables provide standardized lookups for categorical variables:

  • dim_stabbr - State codes and names
  • dim_awlevel - Award/degree levels
  • dim_cipcode - CIP (Classification of Instructional Programs) codes
  • dim_sector - Institution sectors (4-year public, 2-year private, etc.)
  • dim_locale - Urban-centric locale classifications
  • And 62 more...

Example usage:

-- Get human-readable degree level names
SELECT c.awlevel, d.label
FROM fact_c2022_a c
JOIN dim_awlevel d ON c.awlevel = d.code
LIMIT 10;

Fact Tables (fact_*)

Fact tables contain the actual survey data, grouped by topic:

Enrollment

  • fact_ef2022* - Fall enrollment data (by demographics, distance education, age, major, residence)
  • fact_effy2022* - 12-month unduplicated headcount
  • fact_efia2022 - 12-month instructional activity

Completions

  • fact_c2022_a - Completions by program and demographics
  • fact_c2022_b - Completions by demographics
  • fact_c2022_c - Completions by award level, demographics, and age
  • fact_c2022_dep - Number of programs offered

Finance

  • fact_f2122_f1_a - Public institutions (GASB)
  • fact_f2122_f2 - Private not-for-profit or public using FASB
  • fact_f2122_f3 - Private for-profit institutions

Human Resources

  • fact_s2022_* - Staff data by occupation, rank, demographics
  • fact_sal2022_* - Salary data for instructional and non-instructional staff
  • fact_eap2022 - Employee by occupational category

Student Aid

  • fact_sfa2122_p1 - Student financial aid
  • fact_sfa2122_p2 - Student financial aid and net price
  • fact_sfav2122 - Military and veteran benefits

Graduation Rates

  • fact_gr2022* - Graduation rates by various cohorts and demographics
  • fact_gr200_22 - 200% graduation rates

Other

  • fact_adm2022 - Admissions and test scores
  • fact_al2022 - Academic libraries
  • fact_ic2022* - Institutional characteristics and student charges
  • fact_om2022 - Outcome measures
  • fact_flags2022 - Response status for all survey components

Example Queries

Find all institutions in California

SELECT unitid, instnm, city, sector
FROM institution
WHERE stabbr = 'CA'
ORDER BY instnm;

Get fall enrollment by institution

SELECT i.instnm, e.eftotlt as total_enrollment
FROM fact_ef2022_a e
JOIN institution i ON e.unitid = i.unitid
WHERE e.efalevel = '1'  -- All students
ORDER BY e.eftotlt DESC NULLS LAST
LIMIT 20;

Get completions by award level for an institution

SELECT
    d.label as degree_level,
    c.ctotalt as total_awards
FROM fact_c2022_a c
JOIN dim_awlevel d ON c.awlevel = d.code
JOIN institution i ON c.unitid = i.unitid
WHERE i.instnm LIKE '%Harvard%'
    AND c.majornum = '1'  -- First major
GROUP BY d.label, c.ctotalt
ORDER BY c.ctotalt DESC;

Get average net price for institutions

SELECT
    i.instnm,
    s.npist2 as avg_net_price
FROM fact_sfa2122_p2 s
JOIN institution i ON s.unitid = i.unitid
WHERE s.scugffn > 0  -- Has students in cohort
ORDER BY s.npist2
LIMIT 20;

Compare instructional expenses across institution types

SELECT
    ds.label as sector,
    AVG(f.f2c01) as avg_instruction_expense
FROM fact_f2122_f2 f
JOIN institution i ON f.unitid = i.unitid
JOIN dim_sector ds ON i.sector = ds.code
WHERE f.f2c01 IS NOT NULL
GROUP BY ds.label
ORDER BY avg_instruction_expense DESC;

Data Dictionary

The complete data dictionary is available in:

  • sources/IPEDS202223TablesDoc.xlsx - Original Excel format
  • ipeds_metadata.db - SQLite database for programmatic access

Querying the Metadata

import sqlite3

conn = sqlite3.connect('ipeds_metadata.db')

# Get all variables for a table
df = pd.read_sql("""
    SELECT varName, varTitle, DataType, longDescription
    FROM variables
    WHERE TableName = 'HD2022'
    ORDER BY varOrder
""", conn)

# Get all value labels for a variable
df = pd.read_sql("""
    SELECT Codevalue, valueLabel
    FROM value_sets
    WHERE varName = 'STABBR'
    ORDER BY valueOrder
""", conn)

Project Structure

ipeds2/
├── README.md                           # This file
├── schema/
│   └── ipeds_schema.sql               # Complete PostgreSQL schema (13,834 lines)
├── scripts/
│   ├── examine_data_dict.py           # Explore the data dictionary
│   ├── load_data_dict_to_sqlite.py    # Load metadata into SQLite
│   ├── design_schema.py               # Analyze and design database structure
│   ├── generate_postgres_schema.py    # Generate PostgreSQL DDL
│   ├── import_data.py                 # Import CSV data into PostgreSQL
│   └── download_ipeds_data.py         # Download data from NCES (if needed)
├── sources/
│   ├── IPEDS202223TablesDoc.xlsx      # Data dictionary (Excel)
│   ├── IPEDS202223.accdb              # Access database (Git LFS)
│   └── IPEDS202223_csv/               # CSV exports (Git LFS)
└── ipeds_metadata.db                  # SQLite metadata database

Data Sources

This database is derived from official IPEDS data from the National Center for Education Statistics (NCES):

Important Notes

Data Availability

The actual data files (CSV and Access database) are stored in Git LFS. To access them:

git lfs pull

Alternatively, download directly from NCES: https://nces.ed.gov/ipeds/datacenter/login.aspx

Data Integrity

  • All tables include comments with full descriptions
  • Lookup tables contain official IPEDS code values
  • Data types are mapped appropriately from IPEDS specifications
  • Foreign key constraints are defined but not enforced (commented out in schema) to allow flexible data loading

Performance Considerations

  • Indexes are created on unitid for all fact tables
  • Additional indexes may be beneficial depending on query patterns
  • Consider partitioning large tables if working with multiple years

Contributing

When adding new IPEDS years or modifying the schema:

  1. Update the data dictionary in sources/
  2. Run load_data_dict_to_sqlite.py to refresh metadata
  3. Re-generate schema with generate_postgres_schema.py
  4. Test data import with import_data.py
  5. Update this README with any structural changes

License

This database schema and tooling is provided as-is. IPEDS data is public domain and provided by NCES/US Department of Education.

Support

For IPEDS data questions, contact:

For database schema questions, refer to:

  • The generated schema comments (in PostgreSQL)
  • The metadata database (ipeds_metadata.db)
  • The original data dictionary (IPEDS202223TablesDoc.xlsx)

About

IPEDS data for academic years 2022-23 and 2023-24 with CSV conversions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors