Is it possible to reboot a machine after `system failure` in CI test

Yangfisher1 · February 24, 2022, 6:45am

Recently I’m trying to deploy a CI pipeline of a special project with gitlab. Its unit test would generate a kernel module and insmod it to test. However, sometimes the bugs in codes could lead to a kernel crash. I think I can use a gitlab runner with type ssh to execute the test. After the kernel crash, I think the CI pipeline could fetch the failure and execute some commands to reboot the remote machine. Is it possible to do it?

snim2 · February 24, 2022, 2:11pm

In the runner configuration you can set a post_build_script which runs after the script section of a pipeline job, but before the after_script section. This is where you might reboot your machine.

However, if you can, it might be easier to use a Docker runner, then if the container for your build job crashes, you won’t need to reboot the physical machine.

There are a few other forum topics and blog posts about building kernels, so you might find some useful inspiration there.

Yangfisher1 · February 25, 2022, 2:13am

Sure, using a docker runner is easier. However, a RDMA-NIC is needed for our unit tests, making the docker container more complex.

OK, I’ll try the post_build_script because I tried to use after_script, where I simulated a crash using iDRAC to reboot the remote test machine. It seems like the broken ssh connection is regarded as a serious failure, and the runner machine is unable to do something to reboot the machine after fetching the failure. It just throws the error ERROR: Job failed (system failure): wait: remote command exited without exit status or exit signal and doesn’t execute the after_script. Also, It seems like using post_build_script means we need to reboot the machine every time the job is finished, am I right?
I expect to fetch the failure and remotely reboot the machine using the racadm command. From what I learned from GitLab CI now, it looks impossible to do this.

Yangfisher1 · February 25, 2022, 2:39am

I tried to use post_build_script, which shows that it didn’t work. The runner machine didn’t execute the command in post_build_script when the crash occurred.

Yangfisher1 · February 25, 2022, 8:59am

By the way, is it possible for a runner whose executor is ssh to execute some command after the ssh connection? E.g., I run a test and find a kernel BUG with dmesg, so I want to reboot the remote machine but considering that the runner is still in the connection, script and after_script won’t work. It seems like letting the runner know we have a problem is also difficult.

snim2 · February 25, 2022, 9:34am

The pre_clone_script is probably the closest you’ll get to executing something right after the ssh connection has been made.

Yangfisher1 · February 28, 2022, 1:11am

Thanks!