I'd be curious how they handle false positives... e.g. tasks that appear stuck (due to GC pauses, I/O stalls, etc.) vs truly dead ones. I have seen that overzealous cleanup can do more damage than letting a zombie linger. That being said, there is obviously an upper limit to letting zombies linger.
I’ve been wanting to implement a more “overzealous” approach to cleanup orphaned pods from analytical workflows (Prefect) that hang on to expensive compute resources, sometimes it feels frustratingly out of control. It’s really difficult to get good signal from the noise on if it’s actually orphaned (due to the things you’ve mentioned); killing a workload that isn’t actually orphaned can be very costly due to re-runs. Commenting out of solidarity here, but also curious to see others chime in their approach.
Really good article. We had been debating moving from a monolith to services in a distributed environment a while back, and I recommended real baby steps - lets not do full blown services but first break up some of the components so everything isn't deployed together. Guess what? Zombie tasks - albeit not that many, but tracking them down is a bear.
I'd be curious how they handle false positives... e.g. tasks that appear stuck (due to GC pauses, I/O stalls, etc.) vs truly dead ones. I have seen that overzealous cleanup can do more damage than letting a zombie linger. That being said, there is obviously an upper limit to letting zombies linger.
I’ve been wanting to implement a more “overzealous” approach to cleanup orphaned pods from analytical workflows (Prefect) that hang on to expensive compute resources, sometimes it feels frustratingly out of control. It’s really difficult to get good signal from the noise on if it’s actually orphaned (due to the things you’ve mentioned); killing a workload that isn’t actually orphaned can be very costly due to re-runs. Commenting out of solidarity here, but also curious to see others chime in their approach.
Really good article. We had been debating moving from a monolith to services in a distributed environment a while back, and I recommended real baby steps - lets not do full blown services but first break up some of the components so everything isn't deployed together. Guess what? Zombie tasks - albeit not that many, but tracking them down is a bear.
[dead]