Skip to content

Conversation

ChrisBAshton
Copy link
Contributor

@ChrisBAshton ChrisBAshton commented Sep 4, 2025

Merging PR blocked on:

  • Not sending 'headers' property when publishing news articles to Publishing API
  • WHIT-2498 Sending extra 'details' (change history, political, etc)
  • WHIT-2386 Associations (mostly organisations) being added to StandardEdition
  • WHIT-2384 Deciding how we override lead image
  • WHIT-2484 Press Releases being a valid StandardEdition format

Running the import blocked on:

  • WHIT-2519 Putting out comms before retiring Content Publisher

Testing

I've successfully tested importing ALL articles locally and on integration, including route overwriting ✅

Example doc on Content Publisher: https://content-publisher.integration.publishing.service.gov.uk/documents/36d03d5e-eac6-4c18-9d29-f02f6bbf6cc1:en

Content Publisher Whitehall

Differences:

  • No organisations yet (pending WHIT-2386)
  • No visible history mode yet (pending WHIT-2498)
  • 'Flattened' document history
Live article (via Content Publisher) Integration article (via Whitehall)

Differences:

  • No organisations yet (pending WHIT-2386)
  • History mode SHOULD now be working (since WHIT-2498)
  • No lead image (pending WHIT-2384)

Changenote history now works (as of WHIT-2498):

As does publishing new major versions:

What

Sister PR to alphagov/content-publisher#3311.

This PR takes the output from the above (the Content Publisher export) and imports those news stories and press releases into Whitehall. It carries over associations, images and attachments, as well as a condensed document history, but we haven't bothered to try to replicate each individual edition - only the published editions.

Example rake task to import a single news article:

rake import:news_article["../content-publisher/tmp/secretary-of-state-for-work-and-pensions-attends-g7-labour-and-employment-ministerial-meeting-in-germany.json"]

How to run

Importing a single news article via its JSON:

$ kubectl get pods
# returns list of pods, including
# whitehall-admin-c4c7c957c-9q966

$ kubectl cp ~/Downloads/sample-article.json whitehall-admin-c4c7c957c-9q966:/tmp
# copies the file

$ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_article["/tmp/sample-article.json"]
# runs the rake task

Importing all of the news articles:

$ kubectl get pods
# returns list of Whitehall application pods (such as whitehall-admin-c4c7c957c-9q966)
# and Whitehall worker pods (such as whitehall-admin-worker-777ff7f7b7-lhfkb)

# Recursively copy the local directory into the worker pods
$ kubectl cp /path/to/your/news-articles whitehall-admin-worker-777ff7f7b7-lhfkb:/tmp/news-articles
$ kubectl cp /path/to/your/news-articles whitehall-admin-worker-another-one-lhfkb:/tmp/news-articles

# Now do the same for one of the application pods
$ kubectl cp /path/to/your/news-articles whitehall-admin-c4c7c957c-9q966:/tmp/news-articles

# Finally, run the rake task on the application pod you uploaded to.
# (Best do this directly on the pod, in case of time-outs)
$ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_articles_in_directory["/tmp/news-articles"]
# runs the rake task

Explanation:

  1. We need to upload the JSON files to the application pod so that the workers get enqueued in the first place. (The results of Dir.glob("#{args[:dir_path]}/**/*.json").each is what powers the enqueuing - so we need to ensure all the files are there to be globbed!)
  2. We need to upload the JSON files to ALL the worker pods so that every worker has access to the given JSON file. (Workers will just look locally for /tmp/news-articles/<file>.json, so it needs to exist locally).
  3. Note that you will need to re-upload all of the JSON files to all of the pods as above, every time there is a deployment to Whitehall (which tears down the pods and rebuilds them)

Sense check that it all imported correctly:

Document.where(document_type: "StandardEdition").map(&:live_edition).select { |ed| ed && ["press_release", "news_story"].include?(ed.configurable_document_type)}.count
=> 2979
^ there should be 2979, the same as the number of JSON files in the local export

expected_slugs = Dir.glob("/tmp/news-articles/**/*.json").map { |filepath| filepath.split("/").last.sub(".json", "") }
=> 
["1-3-billion-investment-to-deliver-homes-infrastructure-and-jobs",
...
actual_slugs = Document.where(document_type: "StandardEdition").pluck(:slug)
=> 
["10-downing-street",
...

expected_slugs - actual_slugs
=> ["helicopter-services-deal-raises-competition-concerns-1", "non-executive-director-appointed-to-gov-facility-services-limited-1"]

^ This last one should be an empty array. If for whatever reason there are some stragglers, they can be imported by running the bulk task again, or importing individually (rake import:news_article["/tmp/news-articles/non-executive-director-appointed-to-gov-facility-services-limited-1.json"]).

Finally, we can destroy all StandardEdition news articles (Edition.where(configurable_document_type: ["news_article", "press_release"]).map(&:document).map(&:destroy)) if we want to test re-import everything (e.g. after making a tweak).

Post merge

Post merge, we should bulk import the news articles as above, and then revert this PR (none of the changes are required to remain once the articles have been migrated).

Why

Migrating content out of Content Publisher will allow us to retire that long-deprecated app 🎉
It will also be a good first test case for our new config-driven news articles format.

In terms of the comprehensiveness of transfer of content, we're striking a balance of replicating just the published news articles, as best as we can, without worrying about replicating the entire document history. We're also not bothering to transfer drafts nor deleted/'gone' content. There is no draft newer than early 2024 and the nature of news articles is such that it's extremely unlikely an update will need to be applied. Publishers will be able to create new drafts on the migrated content once it's in Whitehall. Similarly, a 'gone' news article is extremely unlikely to need to be brought back, and publishers can always replicate the 'gone' news article manually if needed.

JIRA: https://gov-uk.atlassian.net/browse/WHIT-2440


⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

This application is owned by the Whitehall Experience team. Please let us know in #govuk-whitehall-experience-tech when you raise any PRs.

Follow these steps if you are doing a Rails upgrade.

@ChrisBAshton ChrisBAshton force-pushed the import-news-articles branch 12 times, most recently from 2ce2775 to 05b647c Compare September 9, 2025 12:58
@ChrisBAshton ChrisBAshton changed the base branch from main to simplify-asset-variants-checkl September 9, 2025 13:00
@ChrisBAshton ChrisBAshton force-pushed the import-news-articles branch 4 times, most recently from e20256e to 0140587 Compare September 9, 2025 16:47
@ChrisBAshton ChrisBAshton force-pushed the simplify-asset-variants-checkl branch from 222241b to 8f62880 Compare September 9, 2025 16:49
@ChrisBAshton ChrisBAshton force-pushed the simplify-asset-variants-checkl branch from 8f62880 to 56e8537 Compare September 10, 2025 05:26
@ChrisBAshton ChrisBAshton force-pushed the import-news-articles branch 2 times, most recently from 4c2b07e to 5d67594 Compare September 10, 2025 08:00
@ChrisBAshton ChrisBAshton changed the base branch from simplify-asset-variants-checkl to main September 10, 2025 17:04
@ChrisBAshton ChrisBAshton force-pushed the import-news-articles branch 4 times, most recently from a6e2441 to abe1162 Compare September 12, 2025 13:15
@ChrisBAshton ChrisBAshton force-pushed the import-news-articles branch 7 times, most recently from 7fb3ee9 to 4d0355c Compare September 16, 2025 15:55
@ChrisBAshton ChrisBAshton marked this pull request as ready for review September 16, 2025 15:55
@ChrisBAshton ChrisBAshton force-pushed the import-news-articles branch 2 times, most recently from 409dbaf to 053ee54 Compare September 16, 2025 16:20
@ChrisBAshton ChrisBAshton changed the base branch from main to no-headers-news-articles September 16, 2025 16:24
Base automatically changed from no-headers-news-articles to main September 17, 2025 05:54
This class will be called by a worker in a later commit, for
doing the actual import. This first commit just sets up the
basics: slug, title, body, state, timestamps etc. In subsequent
commits we'll tackle the trickier aspects that need importing.

Note that I originally started using Timecop to overwrite the
system time at the point of import, so that documents were being
imported 'at the time' they were created. But there was a major
problem with this: Timecop pollutes time globally and is only
intended to be used in test suites. Consequences:

1. Running the bulk import (batching lots of Sidekiq workers)
   didn't work as expected, as several time overrides would happen
   in parallel and thus documents would be given unpredictable
   created_at dates.
2. During the bulk import with all the time overwriting going on,
   there was every chance that some unrelated live production
   activity could be affected and its timestamps effectively
   corrupted.

Instead, I now import the docs as-is and then updated the created_at
values wherever needed, after the fact.
The `government_id` exported from Content Publisher refers to
its 'content_id', but we need to map this to the 'id' in
Whitehall.

Also not every exported document has a government_id, so we
need to allow nil.
Every 'withdrawn' document must have an associated Unpublishing.
Content Publisher referred to Contacts via their Content ID.
Whitehall (currently) refers to Contacts via their Contact ID.
So let's do a dynamic find and replace here.

NB, I did come across at least one exported article where there
was an embedded contact that doesn't exist
(`[Contact: 313a7279-1fab-4aba-9808-9219d4165828]`) but the
contact is already silently dropped from the live page
(<https://www.gov.uk/government/news/antibiotic-resistant-strain-of-gonorrhoea-detected-in-london>)
so let's just drop the embed code altogether.
In the Content Publisher export, we have the full array of public
changenotes, but we can only associate an edition with one
changenote. We've decided not to recreate the full edition history
of imported documents, instead importing 'flattened' published
editions. This is the thorniest bit of that 'flattening': we need
to consolidate all of the changenote history into one sentence.
- Squashes all of a document's 'internal' history into one
  editorial remark on the edition.
- Adds support for `<br>` on the editorial remark component to
  allow us to visually break up the single editorial remark more
  clearly.
Content Publisher only produces two variants of images: 960 and
300 sized. Whitehall expects (and enforces) several other variants:
712, 630, 465 and 216. If said variants are missing, then the
edition is considered invalid, and also the images screen in the
UI is stuck showing a 'processing' message. These behaviours are
largely triggered by the various `all_asset_variants_uploaded?`
method definitions sprinkled throughout Whitehall.

My first instinct was to soften the check to only class s960 and
s300 variants as 'required': #10646
But this would be very specific to just this use case, and in theory
I'd want to be dynamically detecting which variants are required
depending on the context.

I then looked at dropping the check altogether: the vast majority
of the time, if the 'original' asset has been uploaded, so has its
variants, so it seemed a low risk change worth making to simplify
the code: #10647

But on reflection, these other variant sizes are potentially needed.
For example, when featuring a news article on an org homepage,
the s465 variant seems to be needed. So, in the unlikely event
that a publisher wants to feature a >2 year old news article on
their homepage, they wouldn't really be able to unless that
variant existed.

That left us with two remaining options:

1. Reupload the original image through Whitehall's Carrierwave
   pipeline, to have it generate all variants afresh
2. Reuse a larger variant for a smaller one (e.g. the s300 Content
   Publisher variant can be used in place of the s216 sized variant,
   given browsers typically take whatever image and just scale it
   down as appropriate.

Option 1 is more complicated to implement, and raises several
other questions:

- How do we publish the new images, given we're importing documents
  'already published' (rather than as drafts which we later publish)?
- What do we do with the old assets? Delete? Redirect? Leave alone?

Option 2 therefore feels the pragmatic way forward, given we
already have a working import solution for file attachments that
we can apply to images too.
Otherwise we get a validation error where a supporting_organisation
duplicates a lead_organisation, as in
around-900-000-could-receive-increased-housing-benefit-from-april.json
See previous commit - we'd get a "method `-` doesn't exist" if supporting
organisations is nil. So need to account for that.
This will be called by rake tasks written in the next commit(s).
It calls our `DocumentImporter` class and also handles the
comms with Publishing API.
This imports a single news article, synchronously, via a pre
uploaded JSON file. Intended use is as follows:

```
$ kubectl get pods
returns list of pods, including
whitehall-admin-c4c7c957c-9q966

$ kubectl cp ~/Downloads/sample-article.json whitehall-admin-c4c7c957c-9q966:/tmp
copies the file

$ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_article["/tmp/sample-article.json"]
runs the rake task
```
We anticipate running this as follows:

```
$ kubectl get pods
returns list of Whitehall application pods (such as whitehall-admin-c4c7c957c-9q966)
and Whitehall worker pods (such as whitehall-admin-worker-777ff7f7b7-lhfkb)

Recursively copy the local directory into the worker pods
$ kubectl cp /path/to/your/news-articles whitehall-admin-worker-777ff7f7b7-lhfkb:/tmp/news-articles
$ kubectl cp /path/to/your/news-articles whitehall-admin-worker-another-one-lhfkb:/tmp/news-articles

Now do the same for one of the application pods
$ kubectl cp /path/to/your/news-articles whitehall-admin-c4c7c957c-9q966:/tmp/news-articles

Finally, run the rake task on the application pod you uploaded to.
(Best do this directly on the pod, in case of time-outs)
$ kubectl exec whitehall-admin-c4c7c957c-9q966 -- rake import:news_articles_in_directory["/tmp/news-articles"]
runs the rake task
```
Define press_release.json. Only 'news story' and 'press release'
content types exist in Content Publisher - so we need to support
both of these document types in the import. They're effectively
identical content types, so I've copied and pasted the same
config file contents as news_story.json, with the relevant bits
tweaked.

This will almost certainly need rebasing later!
@ChrisBAshton ChrisBAshton marked this pull request as draft September 18, 2025 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant