Topic: More-efficient recovery from failures during large-ML-model training