-
Notifications
You must be signed in to change notification settings - Fork 197
Import news articles from Content Publisher (WHIT-2440) #10635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ChrisBAshton
wants to merge
15
commits into
main
Choose a base branch
from
import-news-articles
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,114
−8
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2ce2775
to
05b647c
Compare
e20256e
to
0140587
Compare
222241b
to
8f62880
Compare
0140587
to
5e7800d
Compare
8f62880
to
56e8537
Compare
4c2b07e
to
5d67594
Compare
c87277b
to
9f44c80
Compare
a6e2441
to
abe1162
Compare
7fb3ee9
to
4d0355c
Compare
409dbaf
to
053ee54
Compare
053ee54
to
9c8b5fc
Compare
This class will be called by a worker in a later commit, for doing the actual import. This first commit just sets up the basics: slug, title, body, state, timestamps etc. In subsequent commits we'll tackle the trickier aspects that need importing. Note that I originally started using Timecop to overwrite the system time at the point of import, so that documents were being imported 'at the time' they were created. But there was a major problem with this: Timecop pollutes time globally and is only intended to be used in test suites. Consequences: 1. Running the bulk import (batching lots of Sidekiq workers) didn't work as expected, as several time overrides would happen in parallel and thus documents would be given unpredictable created_at dates. 2. During the bulk import with all the time overwriting going on, there was every chance that some unrelated live production activity could be affected and its timestamps effectively corrupted. Instead, I now import the docs as-is and then updated the created_at values wherever needed, after the fact.
The `government_id` exported from Content Publisher refers to its 'content_id', but we need to map this to the 'id' in Whitehall. Also not every exported document has a government_id, so we need to allow nil.
Every 'withdrawn' document must have an associated Unpublishing.
Content Publisher referred to Contacts via their Content ID. Whitehall (currently) refers to Contacts via their Contact ID. So let's do a dynamic find and replace here. NB, I did come across at least one exported article where there was an embedded contact that doesn't exist (`[Contact: 313a7279-1fab-4aba-9808-9219d4165828]`) but the contact is already silently dropped from the live page (<https://www.gov.uk/government/news/antibiotic-resistant-strain-of-gonorrhoea-detected-in-london>) so let's just drop the embed code altogether.
In the Content Publisher export, we have the full array of public changenotes, but we can only associate an edition with one changenote. We've decided not to recreate the full edition history of imported documents, instead importing 'flattened' published editions. This is the thorniest bit of that 'flattening': we need to consolidate all of the changenote history into one sentence.
- Squashes all of a document's 'internal' history into one editorial remark on the edition. - Adds support for `<br>` on the editorial remark component to allow us to visually break up the single editorial remark more clearly.
Content Publisher only produces two variants of images: 960 and 300 sized. Whitehall expects (and enforces) several other variants: 712, 630, 465 and 216. If said variants are missing, then the edition is considered invalid, and also the images screen in the UI is stuck showing a 'processing' message. These behaviours are largely triggered by the various `all_asset_variants_uploaded?` method definitions sprinkled throughout Whitehall. My first instinct was to soften the check to only class s960 and s300 variants as 'required': #10646 But this would be very specific to just this use case, and in theory I'd want to be dynamically detecting which variants are required depending on the context. I then looked at dropping the check altogether: the vast majority of the time, if the 'original' asset has been uploaded, so has its variants, so it seemed a low risk change worth making to simplify the code: #10647 But on reflection, these other variant sizes are potentially needed. For example, when featuring a news article on an org homepage, the s465 variant seems to be needed. So, in the unlikely event that a publisher wants to feature a >2 year old news article on their homepage, they wouldn't really be able to unless that variant existed. That left us with two remaining options: 1. Reupload the original image through Whitehall's Carrierwave pipeline, to have it generate all variants afresh 2. Reuse a larger variant for a smaller one (e.g. the s300 Content Publisher variant can be used in place of the s216 sized variant, given browsers typically take whatever image and just scale it down as appropriate. Option 1 is more complicated to implement, and raises several other questions: - How do we publish the new images, given we're importing documents 'already published' (rather than as drafts which we later publish)? - What do we do with the old assets? Delete? Redirect? Leave alone? Option 2 therefore feels the pragmatic way forward, given we already have a working import solution for file attachments that we can apply to images too.
Otherwise we get a validation error where a supporting_organisation duplicates a lead_organisation, as in around-900-000-could-receive-increased-housing-benefit-from-april.json
See previous commit - we'd get a "method `-` doesn't exist" if supporting organisations is nil. So need to account for that.
This will be called by rake tasks written in the next commit(s). It calls our `DocumentImporter` class and also handles the comms with Publishing API.
This imports a single news article, synchronously, via a pre uploaded JSON file. Intended use is as follows: ``` $ kubectl get pods returns list of pods, including whitehall-admin-c4c7c957c-9q966 $ kubectl cp ~/Downloads/sample-article.json whitehall-admin-c4c7c957c-9q966:/tmp copies the file $ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_article["/tmp/sample-article.json"] runs the rake task ```
We anticipate running this as follows: ``` $ kubectl get pods returns list of Whitehall application pods (such as whitehall-admin-c4c7c957c-9q966) and Whitehall worker pods (such as whitehall-admin-worker-777ff7f7b7-lhfkb) Recursively copy the local directory into the worker pods $ kubectl cp /path/to/your/news-articles whitehall-admin-worker-777ff7f7b7-lhfkb:/tmp/news-articles $ kubectl cp /path/to/your/news-articles whitehall-admin-worker-another-one-lhfkb:/tmp/news-articles Now do the same for one of the application pods $ kubectl cp /path/to/your/news-articles whitehall-admin-c4c7c957c-9q966:/tmp/news-articles Finally, run the rake task on the application pod you uploaded to. (Best do this directly on the pod, in case of time-outs) $ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_articles_in_directory["/tmp/news-articles"] runs the rake task ```
Define press_release.json. Only 'news story' and 'press release' content types exist in Content Publisher - so we need to support both of these document types in the import. They're effectively identical content types, so I've copied and pasted the same config file contents as news_story.json, with the relevant bits tweaked. This will almost certainly need rebasing later!
9c8b5fc
to
c1e2b81
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merging PR blocked on:
Running the import blocked on:
Testing
I've successfully tested importing ALL articles locally and on integration, including route overwriting ✅
Example doc on Content Publisher: https://content-publisher.integration.publishing.service.gov.uk/documents/36d03d5e-eac6-4c18-9d29-f02f6bbf6cc1:en
Differences:
Differences:
Changenote history now works (as of WHIT-2498):
As does publishing new major versions:
What
Sister PR to alphagov/content-publisher#3311.
This PR takes the output from the above (the Content Publisher export) and imports those news stories and press releases into Whitehall. It carries over associations, images and attachments, as well as a condensed document history, but we haven't bothered to try to replicate each individual edition - only the published editions.
Example rake task to import a single news article:
How to run
Importing a single news article via its JSON:
Importing all of the news articles:
Explanation:
Dir.glob("#{args[:dir_path]}/**/*.json").each
is what powers the enqueuing - so we need to ensure all the files are there to be globbed!)/tmp/news-articles/<file>.json
, so it needs to exist locally).Sense check that it all imported correctly:
^ This last one should be an empty array. If for whatever reason there are some stragglers, they can be imported by running the bulk task again, or importing individually (
rake import:news_article["/tmp/news-articles/non-executive-director-appointed-to-gov-facility-services-limited-1.json"]
).Finally, we can destroy all StandardEdition news articles (
Edition.where(configurable_document_type: ["news_article", "press_release"]).map(&:document).map(&:destroy)
) if we want to test re-import everything (e.g. after making a tweak).Post merge
Post merge, we should bulk import the news articles as above, and then revert this PR (none of the changes are required to remain once the articles have been migrated).
Why
Migrating content out of Content Publisher will allow us to retire that long-deprecated app 🎉
It will also be a good first test case for our new config-driven news articles format.
In terms of the comprehensiveness of transfer of content, we're striking a balance of replicating just the published news articles, as best as we can, without worrying about replicating the entire document history. We're also not bothering to transfer drafts nor deleted/'gone' content. There is no draft newer than early 2024 and the nature of news articles is such that it's extremely unlikely an update will need to be applied. Publishers will be able to create new drafts on the migrated content once it's in Whitehall. Similarly, a 'gone' news article is extremely unlikely to need to be brought back, and publishers can always replicate the 'gone' news article manually if needed.
JIRA: https://gov-uk.atlassian.net/browse/WHIT-2440
This application is owned by the Whitehall Experience team. Please let us know in #govuk-whitehall-experience-tech when you raise any PRs.
Follow these steps if you are doing a Rails upgrade.