test(bindings): Fix unsupported FS check in cufile#2233
Conversation
leofang
left a comment
There was a problem hiding this comment.
A few items — main concern is fixture execution ordering (inline below): usefixtures order is not an execution-order guarantee, so driver can set up before the skip resolves and we'd hit cufile.driver_open() on tests we're about to skip (or error in driver setup on an unsupported FS instead of cleanly skipping). Docstring inversion and # noqa colon are quick fixes. Two minor nits.
Doesn't address #1307 itself — fine, that's a separate root-cause question we can keep open.
-- Leo's bot
|
|
||
| @pytest.mark.skipif(not isSupportedFilesystem(), reason="cuFile handle_register requires ext4 or xfs filesystem") | ||
| @pytest.mark.usefixtures("driver") | ||
| @pytest.mark.usefixtures("driver", "skipIfUnsupportedFilesystem") |
There was a problem hiding this comment.
usefixtures string order is not an execution-order guarantee — pytest orders fixtures by scope and dependency graph, not by position in the marker. With ("driver", "skipIfUnsupportedFilesystem") driver can (and likely will) set up first, so we'll run cufile.driver_open() on every test we then skip, and if driver setup itself touches the FS on an unsupported mount it'll error instead of cleanly skipping.
Suggest making the skip a dependency of driver (and stats) so it always resolves first:
@pytest.fixture
def driver(ctx, skipIfUnsupportedFilesystem):
...
@pytest.fixture
def stats(..., skipIfUnsupportedFilesystem):
...Then skipIfUnsupportedFilesystem can be dropped from each test's usefixtures list — the dependency carries it transitively.
There was a problem hiding this comment.
This seems like the bot is not thinking it through?
driver is used in basically all tests, this just skips some. If it depended on it, sure we skip some setup work but we would stop running tests that can run just fine?!
There was a problem hiding this comment.
It's been a while since I last read this file. My take is that for the tests that do need the driver fixture, we want to ensure the execution order to be "check if the test needs to be skipped" -> "driver setup" -> "test" -> "driver teardown", and the bot was cautiously calling out that this order may or may not be preserved by usefixtures with multiple args?
| logging.info(f"Current filesystem type (findmnt): {fs_type}") | ||
| return fs_type in ("ext4", "xfs") | ||
| if fs_type not in ("ext4", "xfs"): | ||
| pytest.skip("cuFile handle_register requires ext4 or xfs filesystem") |
There was a problem hiding this comment.
Nit: worth a sentence in the PR body or docstring that on dev machines where the pytest basetemp lands on tmpfs (a common default for /tmp), the cuFile suite will now silently skip — previously it would attempt to run from CWD. Correct behavior, but a noticeable change for local devs that's worth flagging so nobody is puzzled by 100% skipped tests.
There was a problem hiding this comment.
Hmmmm, but this is a point because the tests did pass, but it might be that they are now indeed skipped (sometimes/always?).
I.e. things are working (maybe in cufile compat mode) here and skipping them would be just because of the FS check not being very good.
EDIT: Sorry, let me just run and see what is the actual situation and then rethink...
There was a problem hiding this comment.
Hmmmm, OK, maybe mostly for posterity. The old path that was using the current working directory directly failed in this way: https://github.com/NVIDIA/cuda-python/actions/runs/28160946473/job/83403807062?pr=2233#step:31:1364
25-06-2026 09:55:05:66 [pid=4215 tid=4215] ERROR cufio-udev:67 udev property not found: ID_FS_USAGE vda1
25-06-2026 09:55:05:66 [pid=4215 tid=4215] ERROR cufio-fs:970 error getting volume attributes error for device: dev_no: 253:1
25-06-2026 09:55:05:66 [pid=4215 tid=4215] DEBUG cufio_core:1350 cuFile DIO status for file descriptor 31 DirectIO not supported
25-06-2026 09:55:05:66 [pid=4215 tid=4215] DEBUG cufio:311 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled
** Note: Now trying the compat mode: **
25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG cufio-udev:131 numanode string -1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG cufio-udev:131 numanode string -1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG cufio-udev:148 device pci path string : 0000:09:00.0->0000:00:02.7
25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG cufio-udev:94 sysfs attribute found integrity/device_is_integrity_capable vda1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG cufio-fs:374 block device vda1 drive integrity check capability not present. Ok
25-06-2026 09:55:05:67 [pid=4215 tid=4215] INFO cufio-fs:449 Block dev: /dev/vda1 numa node: -1 pci bridge: 0000:00:02.7 bdf: 0000:09:00.0
25-06-2026 09:55:05:67 [pid=4215 tid=4215] INFO cufio-udev:99 sysfs attribute not found device/transport vda1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] INFO cufio-udev:99 sysfs attribute not found wwid vda1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG cufio-udev:94 sysfs attribute found queue/logical_block_size vda1
** NOTE: And this seems to be the problem (But locally I think `ID_FS_USAGE` doesn't exist either, but not 100% sure. **
25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR cufio-udev:67 udev property not found: ID_FS_USAGE vda1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR cufio-fs:970 error getting volume attributes error for device: dev_no: 253:1
25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR cufio-obj:251 unable to get volume attributes for fd 31
25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR cufio:333 cuFileHandleRegister error, failed to allocate file object
25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR cufio:363 cuFileHandleRegister error: file descriptor is not registered
25-06-2026 09:55:05:104 [pid=4215 tid=4215] TRACE cufio:895 cuFileDriver closing
So what that gives? I suspect the FS based skipped was always shady because tmpfs, etc. work just fine in compat mode and the whole FS based logic makes sense without compat mode?
However, compat mode is tried here and fails in cwd (but not in the temp folder which actually reports as overlayfs(!?), although I would have thought it is just /tmp/<something> but I didn't check yet.)
There was a problem hiding this comment.
I guess ping @sourabgupta3, maybe you have some relatively quick insights? TBH, I wonder if we shouldn't just delete the check entirely and assume that the CI filesystem failing even with compat mode allowed is a weird niche thing that is hard to check for...
This is a follow-up from making cufile tests use temporary directories as noted by Leo I don't think QA is a problem, because the previous xfail was guarded by having `CI` in the environment variables. However, the `isSupportedFilesystem` was using the wrong directory now as we are now running the test in the temporary directory. My suspicion is that there is some additional check that would be strictly needed (e.g. to check that it isn't just ext4 but also directly mounted on a local nvme device) but I have not figured out a check for that.
Co-authored-by: Leo Fang <leo80042@gmail.com> Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
3a34c26 to
2451985
Compare
|
Also see if cufile logging can give us more info... Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
5a30c36 to
dd647ae
Compare
This is a follow-up from making cufile tests use temporary directories as noted by Leo
I don't think QA is a problem, because the previous xfail was guarded by having
CIin the environment variables.However, the
isSupportedFilesystemwas using the wrong directory now as we are now running the test in the temporary directory.My suspicion is that there is some additional check that would be strictly needed (e.g. to check that it isn't just ext4 but also directly mounted on a local nvme device) but I have not figured out a check for that.
(I don't love
pytest.skip()use, but no good way around as we need the tempdir.)