Parallel tests#2190
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the tmt test runner to support parallel execution of test plans using std::thread. It introduces a RunPlanResult struct and extracts the core test logic into a run_plan function. The review feedback identifies several opportunities for improvement, including addressing interleaved console output from concurrent threads, optimizing the thread-joining logic to avoid waiting for entire batches, handling potential panics when querying system parallelism, and removing redundant clones of owned objects.
122ad1a to
93f4abf
Compare
cgwalters
left a comment
There was a problem hiding this comment.
Without any kind of deeper review, I think what we really want to do is upstream into https://github.com/teemtee/tmt support for bcvk.
I believe it already has support for concurrency (I mean I'd hope) and such - and that would make it a lot more sustainable for other projects to use tmt+bcvk (and we have many in the ecosystem that would make sense to do so)
93f4abf to
f2bbb18
Compare
|
One or two test failures, but not as good of a speedup as I'd had locally. GA seems to have 4cpu |
|
Yes, we can bump to larger runners though. I was experimenting with that previously in bootc-dev/ci-sandbox#1 but it's been a while |
5230a9f to
a98f50a
Compare
|
Alright, so composefs integration tests now take ~30-40 mins instead of ~1hr - 1hr30min Tests are still quite inconsistent tough. Especially the GC one randomly takes over 1000s to complete, dunno why. Needs some investigation |
|
The ostree one takes around ~1hr instead of ~2h15min Some tests are randomly taking a bit too long though I'll try to debug this |
a98f50a to
cedbd85
Compare
9e8d647 to
a02028c
Compare
`bootc status` for UKIs takes upto 250MB of memory as we load the entire UKI into memory just to extract the cmdline. In bootc-dev/bootc#2190 tests for UKI get OOM killed Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
`bootc status` for UKIs takes upto 250MB of memory as we load the entire UKI into memory just to extract the cmdline. In bootc-dev/bootc#2190 tests for UKI get OOM killed Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
a02028c to
2110c6c
Compare
`bootc status` for UKIs takes upto 250MB of memory as we load the entire UKI into memory just to extract the cmdline. In bootc-dev/bootc#2190 tests for UKI get OOM killed Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
ff859ce to
e437f70
Compare
00e2ca9 to
d514d82
Compare
|
Interesting I'll look into this one |
5f10779 to
b0d220c
Compare
b0d220c to
8ebd2af
Compare
Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Signed-off-by: Johan-Liebert1 <pragyanpoudyal41999@gmail.com>
So we don't spawn VMs for tests like "composefs-gc" just to do nothing and exit Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Sort tests in descending order of time taken for completion so longer tests get scheduled together. Also, update to use mpsc channels for communication between threads Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
Use a Mutex guard for VM creation as CI sometimes is failing with ``` Error: 0: Failed to create libvirt domain 1: Failed to start libvirt domain: error: Failed to start domain 'bootc-..' error: failed to create directory '/run/user/1001/libvirt/qemu/run/swtpm': File exists ``` Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
8ebd2af to
2452901
Compare
To reduce CI failures due to flakes, rerun failed tests Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
2452901 to
bdb74ec
Compare
|
I think this one's ready for review now |
Signed-off-by: Pragyan Poudyal <pragyanpoudyal41999@gmail.com>
6a5c236 to
4d064c6
Compare
| // Log disk usage after each test run to help diagnose "no space left on device" failures | ||
| println!("Disk usage after plan {}:", plan); | ||
| let _ = cmd!(sh, "df -h").run(); | ||
| let _ = cmd!(sh, "du -h /var/* -d 1").run(); |
There was a problem hiding this comment.
df is a cheap operation, this one is not; I don't think we should do it by default.
There was a problem hiding this comment.
Only added this for testing. I'll remove it
| } else { | ||
| println!("Run ID not available - cannot generate detailed report"); | ||
| } | ||
| let rerun_suffix = format!("{}-rerun", random_suffix); |
There was a problem hiding this comment.
Shouldn't this be part of tmt?
I am really uncertain about doing this retry by default too at this level.
Ideally, we basically have retries around all network operations. Everything else - we really do want to drive flakes to zero.
There was a problem hiding this comment.
Sometimes I can see tests in CI failing because the VM never starts after rebooting. This should hopefully catch that. I'm not sure about how tmt manages VMs, but in the said case, we would have to create a new VM
| // Launch VM with bcvk | ||
| let firmware_args_slice = firmware_args.as_slice(); | ||
|
|
||
| let guard = match libvirt_lock.lock() { |
There was a problem hiding this comment.
error: failed to create directory '/run/user/1001/libvirt/qemu/run/swtpm': File exists
That's a bug in libvirt that we're just working around, right?
If we have to do a workaround, I think it'd be better in bcvk.
There was a problem hiding this comment.
I'm not sure if this would be considered a bug. It only doesn't work sometimes so I believed it was due to a race condition
| def main [] { | ||
| tap begin "bootc-image-builder qcow2 build test" | ||
|
|
||
| print ">>>>>>>>>>>>>>>>>> the current working directory <<<<<<<<<<<<<<<<<" ($env.PWD) |
| const DISTRO_CENTOS_9: &str = "centos-9"; | ||
|
|
||
| // Tests sorted by time taken (descending) | ||
| const TESTS_SORTED_BY_TIME: [&str; 23] = [ |
There was a problem hiding this comment.
Not a fan of duplicating all of our test names, this will get out of date quickly.
There was a problem hiding this comment.
yeah... would you prefer adding a key extra-time-taken or something similar so we can do this sort using that key?
|
|
||
| /// For tests that should only run for composefs systems | ||
| /// Ex. composefs-gc | ||
| const FIELD_SKIP_IF_OSTREE: &str = "skip_if_ostree"; |
| struct RunPlanResult { | ||
| plan_name: String, | ||
| passed: bool, | ||
| time_taken: Option<Duration>, |
There was a problem hiding this comment.
I guess but isn't this part of tmt too?
|
|
||
| let mut handles: Vec<JoinHandle<RunPlanResult>> = vec![]; | ||
|
|
||
| let num_cpu = std::thread::available_parallelism() |
There was a problem hiding this comment.
Doesn't tmt support parallelism? I think eventually the right thing to do here is to move the bcvk stuff as a first-class op in tmt.
But I don't object at all to doing it here in the interim.
No description provided.