Import news articles from Content Publisher (WHIT-2440) #10635

ChrisBAshton · 2025-09-04T15:01:25Z

Merging PR blocked on:

Not sending 'headers' property when publishing news articles to Publishing API
WHIT-2498 Sending extra 'details' (change history, political, etc)
WHIT-2386 Associations (mostly organisations) being added to StandardEdition
WHIT-2384 Deciding how we override lead image
WHIT-2484 Press Releases being a valid StandardEdition format

Running the import blocked on:

WHIT-2519 Putting out comms before retiring Content Publisher

Testing

I've successfully tested importing ALL articles locally and on integration, including route overwriting ✅

Example doc on Content Publisher: https://content-publisher.integration.publishing.service.gov.uk/documents/36d03d5e-eac6-4c18-9d29-f02f6bbf6cc1:en

Content Publisher	Whitehall

Differences:

No organisations yet (pending WHIT-2386)
No visible history mode yet (pending WHIT-2498)
'Flattened' document history

Live article (via Content Publisher)	Integration article (via Whitehall)

Differences:

No organisations yet (pending WHIT-2386)
History mode SHOULD now be working (since WHIT-2498)
No lead image (pending WHIT-2384)

Changenote history now works (as of WHIT-2498):

As does publishing new major versions:

What

Sister PR to alphagov/content-publisher#3311.

This PR takes the output from the above (the Content Publisher export) and imports those news stories and press releases into Whitehall. It carries over associations, images and attachments, as well as a condensed document history, but we haven't bothered to try to replicate each individual edition - only the published editions.

Example rake task to import a single news article:

rake import:news_article["../content-publisher/tmp/secretary-of-state-for-work-and-pensions-attends-g7-labour-and-employment-ministerial-meeting-in-germany.json"]

How to run

Importing a single news article via its JSON:

$ kubectl get pods
# returns list of pods, including
# whitehall-admin-c4c7c957c-9q966

$ kubectl cp ~/Downloads/sample-article.json whitehall-admin-c4c7c957c-9q966:/tmp
# copies the file

$ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_article["/tmp/sample-article.json"]
# runs the rake task

Importing all of the news articles:

$ kubectl get pods
# returns list of Whitehall application pods (such as whitehall-admin-c4c7c957c-9q966)
# and Whitehall worker pods (such as whitehall-admin-worker-777ff7f7b7-lhfkb)

# Recursively copy the local directory into the worker pods
$ kubectl cp /path/to/your/news-articles whitehall-admin-worker-777ff7f7b7-lhfkb:/tmp/news-articles
$ kubectl cp /path/to/your/news-articles whitehall-admin-worker-another-one-lhfkb:/tmp/news-articles

# Now do the same for one of the application pods
$ kubectl cp /path/to/your/news-articles whitehall-admin-c4c7c957c-9q966:/tmp/news-articles

# Finally, run the rake task on the application pod you uploaded to.
# (Best do this directly on the pod, in case of time-outs)
$ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_articles_in_directory["/tmp/news-articles"]
# runs the rake task

Explanation:

We need to upload the JSON files to the application pod so that the workers get enqueued in the first place. (The results of Dir.glob("#{args[:dir_path]}/**/*.json").each is what powers the enqueuing - so we need to ensure all the files are there to be globbed!)
We need to upload the JSON files to ALL the worker pods so that every worker has access to the given JSON file. (Workers will just look locally for /tmp/news-articles/<file>.json, so it needs to exist locally).
Note that you will need to re-upload all of the JSON files to all of the pods as above, every time there is a deployment to Whitehall (which tears down the pods and rebuilds them)

Sense check that it all imported correctly:

Document.where(document_type: "StandardEdition").map(&:live_edition).select { |ed| ed && ["press_release", "news_story"].include?(ed.configurable_document_type)}.count
=> 2979
^ there should be 2979, the same as the number of JSON files in the local export

expected_slugs = Dir.glob("/tmp/news-articles/**/*.json").map { |filepath| filepath.split("/").last.sub(".json", "") }
=> 
["1-3-billion-investment-to-deliver-homes-infrastructure-and-jobs",
...
actual_slugs = Document.where(document_type: "StandardEdition").pluck(:slug)
=> 
["10-downing-street",
...

expected_slugs - actual_slugs
=> ["helicopter-services-deal-raises-competition-concerns-1", "non-executive-director-appointed-to-gov-facility-services-limited-1"]

^ This last one should be an empty array. If for whatever reason there are some stragglers, they can be imported by running the bulk task again, or importing individually (rake import:news_article["/tmp/news-articles/non-executive-director-appointed-to-gov-facility-services-limited-1.json"]).

Finally, we can destroy all StandardEdition news articles (Edition.where(configurable_document_type: ["news_article", "press_release"]).map(&:document).map(&:destroy)) if we want to test re-import everything (e.g. after making a tweak).

Post merge

Post merge, we should bulk import the news articles as above, and then revert this PR (none of the changes are required to remain once the articles have been migrated).

Why

Migrating content out of Content Publisher will allow us to retire that long-deprecated app 🎉
It will also be a good first test case for our new config-driven news articles format.

In terms of the comprehensiveness of transfer of content, we're striking a balance of replicating just the published news articles, as best as we can, without worrying about replicating the entire document history. We're also not bothering to transfer drafts nor deleted/'gone' content. There is no draft newer than early 2024 and the nature of news articles is such that it's extremely unlikely an update will need to be applied. Publishers will be able to create new drafts on the migrated content once it's in Whitehall. Similarly, a 'gone' news article is extremely unlikely to need to be brought back, and publishers can always replicate the 'gone' news article manually if needed.

JIRA: https://gov-uk.atlassian.net/browse/WHIT-2440

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

This application is owned by the Whitehall Experience team. Please let us know in #govuk-whitehall-experience-tech when you raise any PRs.

Follow these steps if you are doing a Rails upgrade.

This class will be called by a worker in a later commit, for doing the actual import. This first commit just sets up the basics: slug, title, body, state, timestamps etc. In subsequent commits we'll tackle the trickier aspects that need importing. Note that I originally started using Timecop to overwrite the system time at the point of import, so that documents were being imported 'at the time' they were created. But there was a major problem with this: Timecop pollutes time globally and is only intended to be used in test suites. Consequences: 1. Running the bulk import (batching lots of Sidekiq workers) didn't work as expected, as several time overrides would happen in parallel and thus documents would be given unpredictable created_at dates. 2. During the bulk import with all the time overwriting going on, there was every chance that some unrelated live production activity could be affected and its timestamps effectively corrupted. Instead, I now import the docs as-is and then updated the created_at values wherever needed, after the fact.

The `government_id` exported from Content Publisher refers to its 'content_id', but we need to map this to the 'id' in Whitehall. Also not every exported document has a government_id, so we need to allow nil.

Every 'withdrawn' document must have an associated Unpublishing.

Content Publisher referred to Contacts via their Content ID. Whitehall (currently) refers to Contacts via their Contact ID. So let's do a dynamic find and replace here. NB, I did come across at least one exported article where there was an embedded contact that doesn't exist (`[Contact: 313a7279-1fab-4aba-9808-9219d4165828]`) but the contact is already silently dropped from the live page (<https://www.gov.uk/government/news/antibiotic-resistant-strain-of-gonorrhoea-detected-in-london>) so let's just drop the embed code altogether.

In the Content Publisher export, we have the full array of public changenotes, but we can only associate an edition with one changenote. We've decided not to recreate the full edition history of imported documents, instead importing 'flattened' published editions. This is the thorniest bit of that 'flattening': we need to consolidate all of the changenote history into one sentence.

- Squashes all of a document's 'internal' history into one editorial remark on the edition. - Adds support for `<br>` on the editorial remark component to allow us to visually break up the single editorial remark more clearly.

Content Publisher only produces two variants of images: 960 and 300 sized. Whitehall expects (and enforces) several other variants: 712, 630, 465 and 216. If said variants are missing, then the edition is considered invalid, and also the images screen in the UI is stuck showing a 'processing' message. These behaviours are largely triggered by the various `all_asset_variants_uploaded?` method definitions sprinkled throughout Whitehall. My first instinct was to soften the check to only class s960 and s300 variants as 'required': #10646 But this would be very specific to just this use case, and in theory I'd want to be dynamically detecting which variants are required depending on the context. I then looked at dropping the check altogether: the vast majority of the time, if the 'original' asset has been uploaded, so has its variants, so it seemed a low risk change worth making to simplify the code: #10647 But on reflection, these other variant sizes are potentially needed. For example, when featuring a news article on an org homepage, the s465 variant seems to be needed. So, in the unlikely event that a publisher wants to feature a >2 year old news article on their homepage, they wouldn't really be able to unless that variant existed. That left us with two remaining options: 1. Reupload the original image through Whitehall's Carrierwave pipeline, to have it generate all variants afresh 2. Reuse a larger variant for a smaller one (e.g. the s300 Content Publisher variant can be used in place of the s216 sized variant, given browsers typically take whatever image and just scale it down as appropriate. Option 1 is more complicated to implement, and raises several other questions: - How do we publish the new images, given we're importing documents 'already published' (rather than as drafts which we later publish)? - What do we do with the old assets? Delete? Redirect? Leave alone? Option 2 therefore feels the pragmatic way forward, given we already have a working import solution for file attachments that we can apply to images too.

Otherwise we get a validation error where a supporting_organisation duplicates a lead_organisation, as in around-900-000-could-receive-increased-housing-benefit-from-april.json

See previous commit - we'd get a "method `-` doesn't exist" if supporting organisations is nil. So need to account for that.

This will be called by rake tasks written in the next commit(s). It calls our `DocumentImporter` class and also handles the comms with Publishing API.

This imports a single news article, synchronously, via a pre uploaded JSON file. Intended use is as follows: ``` $ kubectl get pods returns list of pods, including whitehall-admin-c4c7c957c-9q966 $ kubectl cp ~/Downloads/sample-article.json whitehall-admin-c4c7c957c-9q966:/tmp copies the file $ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_article["/tmp/sample-article.json"] runs the rake task ```

We anticipate running this as follows: ``` $ kubectl get pods returns list of Whitehall application pods (such as whitehall-admin-c4c7c957c-9q966) and Whitehall worker pods (such as whitehall-admin-worker-777ff7f7b7-lhfkb) Recursively copy the local directory into the worker pods $ kubectl cp /path/to/your/news-articles whitehall-admin-worker-777ff7f7b7-lhfkb:/tmp/news-articles $ kubectl cp /path/to/your/news-articles whitehall-admin-worker-another-one-lhfkb:/tmp/news-articles Now do the same for one of the application pods $ kubectl cp /path/to/your/news-articles whitehall-admin-c4c7c957c-9q966:/tmp/news-articles Finally, run the rake task on the application pod you uploaded to. (Best do this directly on the pod, in case of time-outs) $ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_articles_in_directory["/tmp/news-articles"] runs the rake task ```

Define press_release.json. Only 'news story' and 'press release' content types exist in Content Publisher - so we need to support both of these document types in the import. They're effectively identical content types, so I've copied and pasted the same config file contents as news_story.json, with the relevant bits tweaked. This will almost certainly need rebasing later!

ChrisBAshton force-pushed the import-news-articles branch 12 times, most recently from 2ce2775 to 05b647c Compare September 9, 2025 12:58

ChrisBAshton mentioned this pull request Sep 9, 2025

Simplify asset upload check to verify only the original asset (WHIT-2440) #10647

Closed

ChrisBAshton changed the base branch from main to simplify-asset-variants-checkl September 9, 2025 13:00

ChrisBAshton force-pushed the import-news-articles branch 4 times, most recently from e20256e to 0140587 Compare September 9, 2025 16:47

ChrisBAshton force-pushed the simplify-asset-variants-checkl branch from 222241b to 8f62880 Compare September 9, 2025 16:49

ChrisBAshton force-pushed the import-news-articles branch from 0140587 to 5e7800d Compare September 9, 2025 16:50

ChrisBAshton force-pushed the simplify-asset-variants-checkl branch from 8f62880 to 56e8537 Compare September 10, 2025 05:26

ChrisBAshton force-pushed the import-news-articles branch 2 times, most recently from 4c2b07e to 5d67594 Compare September 10, 2025 08:00

ChrisBAshton mentioned this pull request Sep 10, 2025

Export documents and assets (WHIT-2419) alphagov/content-publisher#3311

Merged

ChrisBAshton force-pushed the import-news-articles branch from c87277b to 9f44c80 Compare September 10, 2025 17:04

ChrisBAshton changed the base branch from simplify-asset-variants-checkl to main September 10, 2025 17:04

ChrisBAshton force-pushed the import-news-articles branch 4 times, most recently from a6e2441 to abe1162 Compare September 12, 2025 13:15

ChrisBAshton force-pushed the import-news-articles branch 7 times, most recently from 7fb3ee9 to 4d0355c Compare September 16, 2025 15:55

ChrisBAshton marked this pull request as ready for review September 16, 2025 15:55

ChrisBAshton force-pushed the import-news-articles branch 2 times, most recently from 409dbaf to 053ee54 Compare September 16, 2025 16:20

ChrisBAshton changed the base branch from main to no-headers-news-articles September 16, 2025 16:24

Base automatically changed from no-headers-news-articles to main September 17, 2025 05:54

ChrisBAshton force-pushed the import-news-articles branch from 053ee54 to 9c8b5fc Compare September 17, 2025 06:32

ChrisBAshton added 15 commits September 17, 2025 12:52

Add support for 'government_id'

80803ea

The `government_id` exported from Content Publisher refers to its 'content_id', but we need to map this to the 'id' in Whitehall. Also not every exported document has a government_id, so we need to allow nil.

Create 'unpublishing' for Withdrawn documents

c539c7a

Every 'withdrawn' document must have an associated Unpublishing.

Add condensed 'internal history' / EditorialRemark

ae53f87

- Squashes all of a document's 'internal' history into one editorial remark on the edition. - Adds support for `<br>` on the editorial remark component to allow us to visually break up the single editorial remark more clearly.

Add support for attachments

92f3d14

Carry over the document's associations

3995187

Ensure Organisation uniqueness amongst lead and secondary orgs

d2173a5

Otherwise we get a validation error where a supporting_organisation duplicates a lead_organisation, as in around-900-000-could-receive-increased-housing-benefit-from-april.json

Add support for where there are no supporting organisations

e142cd9

See previous commit - we'd get a "method `-` doesn't exist" if supporting organisations is nil. So need to account for that.

Add DocumentImportWorker

8ea0128

This will be called by rake tasks written in the next commit(s). It calls our `DocumentImporter` class and also handles the comms with Publishing API.

ChrisBAshton force-pushed the import-news-articles branch from 9c8b5fc to c1e2b81 Compare September 17, 2025 11:52

ChrisBAshton marked this pull request as draft September 18, 2025 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Import news articles from Content Publisher (WHIT-2440) #10635

Import news articles from Content Publisher (WHIT-2440) #10635

ChrisBAshton commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Import news articles from Content Publisher (WHIT-2440) #10635

Are you sure you want to change the base?

Import news articles from Content Publisher (WHIT-2440) #10635

Conversation

ChrisBAshton commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

What

How to run

Post merge

Why

Uh oh!

Uh oh!

ChrisBAshton commented Sep 4, 2025 •

edited

Loading