Tags: timeoutdigital/to-data-library
Tags
DSS-3346 Rewrite s3_to_gs and gs_to_bq to use new GCS file structure (#… …33) See [JIRA ticket](https://timeoutgroup.atlassian.net/jira/software/c/projects/DSS/boards/72?selectedIssue=DSS-3346) for more info of what is required. This updates the to-data-library to provide methods s3_to_gs and gs_to_bq which enforce the use of our new GCS file structure. I have also update the bq method load_table_from_dataframe to handle the case where a file has multiple partition dates and needs to be separated into individual loads (as per the case with Ozone which has 7 days' data per file). I have tested this as past of [DSS-3169](https://timeoutgroup.atlassian.net/browse/DSS-3169?focusedCommentId=265783). Unit tests still need to be added and will be done as part of a future PR before the JIRA ticket is closed. The PR is quite large, so I'm happy to discuss with the reviewer on a call if preferred. [DSS-3169]: https://timeoutgroup.atlassian.net/browse/DSS-3169?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Here's a screenshot of the tests passing: <img width="1908" height="64" alt="image" src="https://github.com/user-attachments/assets/3652a8f3-b3ae-49f5-928a-aa0cdae42199" /> Note: we could do with more tests, especially on error handling, however I'm reluctant to add any at the mo as the PR is already big. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
DSS-3231 Add success return to gs_to_bq transfer method (#31) This is because it can cause silent failures without it. There is a try-except (without a 'raise') within the load_table_from_uris method, so if there is an error in the BigQuery load job, the try-except causes it to appear as a success. Hence we can't then ascertain whether the job failed and raise an error accordingly. See JIRA ticket comment [here](https://timeoutgroup.atlassian.net/browse/DSS-3231?focusedCommentId=263012) for local testing of this method (using a copy of the transfer module)
DSS-3231 Add schema_update_options to gs_to_bq transfer method (#30) This allows us to use the option to automatically add columns if new columns appear in the data. For example, if we update a schema file to include a new field, we can now set the column to add automatically (as NULLABLE) without having to manually add columns in dev, staging and prod separately. See tests/checks here: 1. This uses the da-cms-data-ingestion pipeline to test using the new transfer config option: [[LINK](https://timeoutgroup.atlassian.net/browse/DSS-3231?focusedCommentId=262940)] 2. I've added the new option to the gs_to_bq test and checked that the tests still pass: <img width="971" height="439" alt="image" src="https://github.com/user-attachments/assets/2391f5fd-1d32-4b16-a091-ba720d6fd271" /> <img width="1914" height="912" alt="image" src="https://github.com/user-attachments/assets/6fdbe2a0-d4b3-4573-ab05-8d5c6c560883" />
DSS-3163 Updated convert_json_array_to_ndjson method (#28) I have updated convert_json_array_to_ndjson method to handle different json formats, ie ndjson. Have added tests too. Here are the logs from a successful run using this to-data-library in the cms ingestion pipeline: <img width="1176" height="719" alt="image" src="https://github.com/user-attachments/assets/d7890ec7-d54f-4f26-8e95-19ddb7b596e7" />
DSS-3165 (#25) For this ticket: https://timeoutgroup.atlassian.net/browse/DSS-3165?atlOrigin=eyJpIjoiNzc1OGI1OWJiODljNDA1YzlkMGVkMjlkMDRlMGQyMzgiLCJwIjoiaiJ9 I have looked into updating to-data-library to include a process for uploading files from s3 to big query, however, when looking into the library I found the method already exists as well as a method to transfer files from s3 to gcs. I tested using the methods in an existing splash da and found the way partitioning was defined was not compatible with our code so I have made changes to correct this and keep the code similar to the gs_to_bq method (which I have used before). I also have added some schema validation and tests. I would be interested to know if you think we need more schema validation and if some of the tests are too repetitive. Also do you think the schema structure is sufficient? I have tried to look at the data for the ads reporting but not sure I'm looking in the right place.
PreviousNext