Concourse as an Orchestrator of Orchestrator (TFC) with high concurrent job count #8963

rorychatterton · 2024-05-25T06:26:17Z

rorychatterton
May 25, 2024

Context:
I'm exploring Concourse as an Orchestrator of Orchestrators for a set of workflows executed out of Terraform Cloud.

I'm working with a client has a small number of job types (< 15), large number of instances ( > 1000 ), with a dependency tree that pushes inputs/outputs through those workspace (e.g. the templates might have a graph like the attached photo).

Issue:
One of the issues I've run into with other CD tools in this space, (such as Github Actions), is if you want each execution to be presented as it's own job, you end up with a lot of idle agents, very very quickly. In a model where you pay per minute of execution time, you kind of end up DDOS'ing your wallet XD

Question:
Has anybody had much luck using Concourse in a fashion like this where you're effectively orchestrating many long running jobs across a third party system, waiting on data to come back in?

My logic was that since concourse tasks are just containers without limits (i think?), you could much more effectively bin pack those jobs down so that the resource overheads are significantly less than your typical SaaS orchestrator, but also maintaining a first-class execution graph ui?

Is there a better way to do this?

Similarly, is there any guidance as to the maximum number of jobs that should be presented in a graph. Would we end up killing the system if we had a graph with a thousand nodes in it?

Answered by marco-m-pix4d

May 30, 2024

Hello,

Has anybody had much luck using Concourse in a fashion like this where you're effectively orchestrating many long running jobs across a third party system, waiting on data to come back in?

I don't have direct experience on this. What I can say is that a task (a Concourse job contains one or more tasks) doesn't have the notion of waiting for I/O. It looks to me that you would end up with N tasks, each busy waiting, polling for this "data to come back in". This might work, but consider that each task is mapped to a container, and in my extensive experience, above ca 220 containers, a Concourse worker becomes unstable (at least with my workload).

Another way to "wait for data to com…

View full answer

marco-m-pix4d · 2024-05-30T10:11:33Z

marco-m-pix4d
May 30, 2024
Collaborator

Hello,

Has anybody had much luck using Concourse in a fashion like this where you're effectively orchestrating many long running jobs across a third party system, waiting on data to come back in?

I don't have direct experience on this. What I can say is that a task (a Concourse job contains one or more tasks) doesn't have the notion of waiting for I/O. It looks to me that you would end up with N tasks, each busy waiting, polling for this "data to come back in". This might work, but consider that each task is mapped to a container, and in my extensive experience, above ca 220 containers, a Concourse worker becomes unstable (at least with my workload).

Another way to "wait for data to come in" is to write a Concourse resource. This might work, but again a Concourse resource is a container, which by default polls by being restarted on each poll interval, so it is even more expensive than busy waiting within a task.

From my understanding of what you want to achieve, I would consider another approach, something like https://temporal.io/ (although I like the idea, I do not have first-hand experience).

Similarly, is there any guidance as to the maximum number of jobs that should be presented in a graph. Would we end up killing the system if we had a graph with a thousand nodes in it?

To me, before the killing or not, the real question is whether a human could navigate the graph or not. For this, it is possible to use job groups, see https://concourse-ci.org/pipelines.html#schema.group_config. An example of job groups in a pipeline is the Concourse CI itself: https://ci.concourse-ci.org/teams/main/pipelines/concourse

2 replies

rorychatterton Jun 4, 2024
Author

Thanks Marco, I really appreciate the feedback!

Another way to "wait for data to come in" is to write a Concourse resource. This might work, but again a Concourse resource is a container, which by default polls by being restarted on each poll interval, so it is even more expensive than busy waiting within a task.

Ok. That makes sense. I was looking through the documents to see if there was an alternative resource model that would respond to web-hooks, events or some similar mechanism, and came to a similar conclusion.

From my understanding of what you want to achieve, I would consider another approach, something like https://temporal.io/ (although I like the idea, I do not have first-hand experience).

Good shout. Temporal, and Airflow (via data composer because the customer is a google shop) were two alternatives that I had considered for similar reasons.

What attracted me to Concourse was the ability for operations to be able to quickly grok the impact of a dependency failure across a wide range of customers - something which the Concourse UI can help to visualise and provide a mechanism to dive in.

To me, before the killing or not, the real question is whether a human could navigate the graph or not

100%. In reality, we would have used Groups as a way to be able to split up releases into visible release rings, based upon a yet-to-be-defined scope.

I had the meta-model mixed up in my mind. I was imagining that the dependency graph itself was the larger model, which would need to be calculated in the background, while the actual rendered ui/group could be a subset. The context was much more about the internal dependency calculation, rather than the actual screen.

Thank you for your detailed feedback!

marco-m-pix4d Jun 4, 2024
Collaborator

Thank you for your detailed feedback!

You are welcome :-)

I was looking through the documents to see if there was an alternative resource model that would respond to web-hooks,

That is possible, I forgot that. You can reduce the polling interval of a resource and instead use webhooks, see https://concourse-ci.org/resources.html#schema.resource.webhook_token and click on the "down arrow" to expand the documentation.

A potential problem with webhook is demultiplexing multiple incoming webhooks to the correct resource. You might or not need something similar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concourse

Concourse as an Orchestrator of Orchestrator (TFC) with high concurrent job count #8963

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Concourse

Concourse as an Orchestrator of Orchestrator (TFC) with high concurrent job count #8963

Uh oh!

Uh oh!

rorychatterton May 25, 2024

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

marco-m-pix4d May 30, 2024 Collaborator

Uh oh!

rorychatterton Jun 4, 2024 Author

Uh oh!

marco-m-pix4d Jun 4, 2024 Collaborator

rorychatterton
May 25, 2024

Replies: 1 comment 2 replies

marco-m-pix4d
May 30, 2024
Collaborator

rorychatterton Jun 4, 2024
Author

marco-m-pix4d Jun 4, 2024
Collaborator