Skip to content

Tags: timeoutdigital/to-data-library

Tags

v2.0.0

Toggle v2.0.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3346 Rewrite s3_to_gs and gs_to_bq to use new GCS file structure (#…

…33)

See [JIRA
ticket](https://timeoutgroup.atlassian.net/jira/software/c/projects/DSS/boards/72?selectedIssue=DSS-3346)
for more info of what is required.

This updates the to-data-library to provide methods s3_to_gs and
gs_to_bq which enforce the use of our new GCS file structure. I have
also update the bq method load_table_from_dataframe to handle the case
where a file has multiple partition dates and needs to be separated into
individual loads (as per the case with Ozone which has 7 days' data per
file).

I have tested this as past of
[DSS-3169](https://timeoutgroup.atlassian.net/browse/DSS-3169?focusedCommentId=265783).

Unit tests still need to be added and will be done as part of a future
PR before the JIRA ticket is closed.

The PR is quite large, so I'm happy to discuss with the reviewer on a
call if preferred.

[DSS-3169]:
https://timeoutgroup.atlassian.net/browse/DSS-3169?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ


Here's a screenshot of the tests passing:
<img width="1908" height="64" alt="image"
src="https://github.com/user-attachments/assets/3652a8f3-b3ae-49f5-928a-aa0cdae42199"
/>

Note: we could do with more tests, especially on error handling, however
I'm reluctant to add any at the mo as the PR is already big.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

v1.0.20

Toggle v1.0.20's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3232: Add multifile functionality to s3_to_bq (#29)

Co-authored-by: JenHolmes608 <jennifer.holmes@timeout.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

v1.0.19

Toggle v1.0.19's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3231 Add success return to gs_to_bq transfer method (#31)

This is because it can cause silent failures without it. There is a
try-except (without a 'raise') within the load_table_from_uris method,
so if there is an error in the BigQuery load job, the try-except causes
it to appear as a success. Hence we can't then ascertain whether the job
failed and raise an error accordingly.

See JIRA ticket comment
[here](https://timeoutgroup.atlassian.net/browse/DSS-3231?focusedCommentId=263012)
for local testing of this method (using a copy of the transfer module)

v1.0.18

Toggle v1.0.18's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3231 Add schema_update_options to gs_to_bq transfer method (#30)

This allows us to use the option to automatically add columns if new
columns appear in the data. For example, if we update a schema file to
include a new field, we can now set the column to add automatically (as
NULLABLE) without having to manually add columns in dev, staging and
prod separately.

See tests/checks here:

1. This uses the da-cms-data-ingestion pipeline to test using the new
transfer config option:
[[LINK](https://timeoutgroup.atlassian.net/browse/DSS-3231?focusedCommentId=262940)]

2. I've added the new option to the gs_to_bq test and checked that the
tests still pass:
<img width="971" height="439" alt="image"
src="https://github.com/user-attachments/assets/2391f5fd-1d32-4b16-a091-ba720d6fd271"
/>

<img width="1914" height="912" alt="image"
src="https://github.com/user-attachments/assets/6fdbe2a0-d4b3-4573-ab05-8d5c6c560883"
/>

v1.0.17

Toggle v1.0.17's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3163 Updated convert_json_array_to_ndjson method (#28)

I have updated convert_json_array_to_ndjson method to handle different
json formats, ie ndjson.

Have added tests too.

Here are the logs from a successful run using this to-data-library in
the cms ingestion pipeline:
<img width="1176" height="719" alt="image"
src="https://github.com/user-attachments/assets/d7890ec7-d54f-4f26-8e95-19ddb7b596e7"
/>

v1.0.16

Toggle v1.0.16's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3163 (#27)

I have updated transfer.py and gs.py to handle new default naming of gcs
bucket which no longer begins with gs://

I have also updated the tests.

v1.0.15

Toggle v1.0.15's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3165 include gcs step (#26)

I have added in a step so files are picked up from s3, loaded into gcs
and then to bq.

v1.0.14

Toggle v1.0.14's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-3165 (#25)

For this ticket:
https://timeoutgroup.atlassian.net/browse/DSS-3165?atlOrigin=eyJpIjoiNzc1OGI1OWJiODljNDA1YzlkMGVkMjlkMDRlMGQyMzgiLCJwIjoiaiJ9
I have looked into updating to-data-library to include a process for
uploading files from s3 to big query, however, when looking into the
library I found the method already exists as well as a method to
transfer files from s3 to gcs.

I tested using the methods in an existing splash da and found the way
partitioning was defined was not compatible with our code so I have made
changes to correct this and keep the code similar to the gs_to_bq method
(which I have used before).

I also have added some schema validation and tests. I would be
interested to know if you think we need more schema validation and if
some of the tests are too repetitive.
Also do you think the schema structure is sufficient? I have tried to
look at the data for the ads reporting but not sure I'm looking in the
right place.

v1.0.13

Toggle v1.0.13's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-2515 stream convert_json_array_to_ndjson (#21)

stream convert_json_array_to_ndjson

v1.0.12

Toggle v1.0.12's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
DSS-2512 add impersonation (#18)

Add impersonation to the gs, transfer Clients