This project contains Google Cloud Functions that stream data from Firestore to BigQuery in real-time. It also includes utility functions to manage tables and table schemas in BigQuery using command line.
- Real-time streaming of Firestore changes to buffer tables in BigQuery. Nested objects and arrays are flattened. During this process:
- Warning messages are saved in a warning table for records that have schema issues compared with the pre-defined schema.
- Error messages are saved in an error table for records that cannot be successfully processed.
- Other logic (data cleanup, data transformation, etc.) can be added in this process
- At defined intervals (e.g. every 30 minutes), buffer tables are synchronized into target tables (the tables to be used for downstream analysis). The data sync is scheduled using Cloud Scheduler, but can also be triggered manually for flexibility.
localRun.jscan run directly, or accept arguments from the command line, to use local utility functions for manually managing Firestore and BigQuery data.
- A Google Cloud project with Firestore and BigQuery enabled.
- Node.js and npm installed locally.
- The
gcloudCLI installed and authenticated.
git clone https://github.com/episphere/stream-Firestore-to-BigQuery.git
cd stream-Firestore-to-BigQuerynpm installCheck the settings.js file for any necessary configurations, such as target dataset name, buffer dataset name, error and warning table names, collection names to be tracked.
Check the tableSchemas.js file for the schemas of the target tables. Adjust the schemas as needed.
Create buffer tables for an environment (e.g., dev, prod). The defined dataset name and table schemas are used in this step.
node localRun.js --entry createAllBufferTables --gcloud --env devCreate target tables for an environment (e.g., dev, prod)
node localRun.js --entry createAllTargetTables --gcloud --env devCreate error and warning tables for an environment (e.g., dev, prod)
node localRun.js --entry createLogTables --gcloud --env devgcloud functions deploy stream-firestore-updates \
--source=. \
--gen2 \
--runtime=nodejs22 \
--entry-point=streamFirestoreUpdates \
--ingress-settings=internal-only \
--region=us-central1 \
--trigger-location=nam5 \
--trigger-event-filters=type=google.cloud.firestore.document.v1.written \
--trigger-event-filters=database='(default)' \
--memory=1Gi \
--cpu=1 \
--timeout=300s \
--concurrency=80The stream-firestore-updates function is triggered by Firestore write events. It streams the changes to the buffer dataset (default name firestore_stream_buffer) in BigQuery.
gcloud functions deploy sync-batched-updates-to-tables \
--gen2 \
--trigger-http \
--region=us-central1 \
--runtime=nodejs22 \
--source=. \
--entry-point=syncBatchedUpdatesToTables \
--ingress-settings=internal-onlyThe sync-batched-updates-to-tables function is responsible for merging the buffered data into the target tables in the dataset (default name firestore_stream).
HTTP requests to this function can be scheduled using Cloud Scheduler or triggered manually.