Skip to content

test(bindings): Fix unsupported FS check in cufile#2233

Open
seberg wants to merge 3 commits into
NVIDIA:mainfrom
seberg:supported-fs-fixture
Open

test(bindings): Fix unsupported FS check in cufile#2233
seberg wants to merge 3 commits into
NVIDIA:mainfrom
seberg:supported-fs-fixture

Conversation

@seberg

@seberg seberg commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This is a follow-up from making cufile tests use temporary directories as noted by Leo

I don't think QA is a problem, because the previous xfail was guarded by having CI in the environment variables.

However, the isSupportedFilesystem was using the wrong directory now as we are now running the test in the temporary directory.

My suspicion is that there is some additional check that would be strictly needed (e.g. to check that it isn't just ext4 but also directly mounted on a local nvme device) but I have not figured out a check for that.

(I don't love pytest.skip() use, but no good way around as we need the tempdir.)

@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label Jun 17, 2026
@seberg seberg added test Improvements or additions to tests bug Something isn't working labels Jun 17, 2026
@seberg seberg self-assigned this Jun 17, 2026
@seberg seberg added this to the cuda.core v1.1.0 milestone Jun 17, 2026

@leofang leofang left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few items — main concern is fixture execution ordering (inline below): usefixtures order is not an execution-order guarantee, so driver can set up before the skip resolves and we'd hit cufile.driver_open() on tests we're about to skip (or error in driver setup on an unsupported FS instead of cleanly skipping). Docstring inversion and # noqa colon are quick fixes. Two minor nits.

Doesn't address #1307 itself — fine, that's a separate root-cause question we can keep open.

-- Leo's bot

Comment thread cuda_bindings/tests/test_cufile.py Outdated
Comment thread cuda_bindings/tests/test_cufile.py Outdated

@pytest.mark.skipif(not isSupportedFilesystem(), reason="cuFile handle_register requires ext4 or xfs filesystem")
@pytest.mark.usefixtures("driver")
@pytest.mark.usefixtures("driver", "skipIfUnsupportedFilesystem")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usefixtures string order is not an execution-order guarantee — pytest orders fixtures by scope and dependency graph, not by position in the marker. With ("driver", "skipIfUnsupportedFilesystem") driver can (and likely will) set up first, so we'll run cufile.driver_open() on every test we then skip, and if driver setup itself touches the FS on an unsupported mount it'll error instead of cleanly skipping.

Suggest making the skip a dependency of driver (and stats) so it always resolves first:

@pytest.fixture
def driver(ctx, skipIfUnsupportedFilesystem):
    ...

@pytest.fixture
def stats(..., skipIfUnsupportedFilesystem):
    ...

Then skipIfUnsupportedFilesystem can be dropped from each test's usefixtures list — the dependency carries it transitively.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the bot is not thinking it through?

driver is used in basically all tests, this just skips some. If it depended on it, sure we skip some setup work but we would stop running tests that can run just fine?!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I last read this file. My take is that for the tests that do need the driver fixture, we want to ensure the execution order to be "check if the test needs to be skipped" -> "driver setup" -> "test" -> "driver teardown", and the bot was cautiously calling out that this order may or may not be preserved by usefixtures with multiple args?

Comment thread cuda_bindings/tests/test_cufile.py
Comment thread cuda_bindings/tests/test_cufile.py Outdated
logging.info(f"Current filesystem type (findmnt): {fs_type}")
return fs_type in ("ext4", "xfs")
if fs_type not in ("ext4", "xfs"):
pytest.skip("cuFile handle_register requires ext4 or xfs filesystem")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: worth a sentence in the PR body or docstring that on dev machines where the pytest basetemp lands on tmpfs (a common default for /tmp), the cuFile suite will now silently skip — previously it would attempt to run from CWD. Correct behavior, but a noticeable change for local devs that's worth flagging so nobody is puzzled by 100% skipped tests.

@seberg seberg Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, but this is a point because the tests did pass, but it might be that they are now indeed skipped (sometimes/always?).

I.e. things are working (maybe in cufile compat mode) here and skipping them would be just because of the FS check not being very good.

EDIT: Sorry, let me just run and see what is the actual situation and then rethink...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, OK, maybe mostly for posterity. The old path that was using the current working directory directly failed in this way: https://github.com/NVIDIA/cuda-python/actions/runs/28160946473/job/83403807062?pr=2233#step:31:1364

 25-06-2026 09:55:05:66 [pid=4215 tid=4215] ERROR  cufio-udev:67 udev property not found: ID_FS_USAGE vda1
 25-06-2026 09:55:05:66 [pid=4215 tid=4215] ERROR  cufio-fs:970 error getting volume attributes error for device: dev_no: 253:1
 25-06-2026 09:55:05:66 [pid=4215 tid=4215] DEBUG  cufio_core:1350 cuFile DIO status for file descriptor 31 DirectIO not supported
 25-06-2026 09:55:05:66 [pid=4215 tid=4215] DEBUG  cufio:311 cuFileHandleRegister GDS not supported or disabled by config, using cuFile posix read/write with compat mode enabled

** Note: Now trying the compat mode: **

 25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG  cufio-udev:131 numanode string -1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG  cufio-udev:131 numanode string -1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG  cufio-udev:148 device pci path string : 0000:09:00.0->0000:00:02.7
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG  cufio-udev:94 sysfs attribute found integrity/device_is_integrity_capable vda1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG  cufio-fs:374 block device vda1 drive integrity check capability not present. Ok
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] INFO   cufio-fs:449 Block dev: /dev/vda1 numa node: -1 pci bridge: 0000:00:02.7 bdf: 0000:09:00.0
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] INFO   cufio-udev:99 sysfs attribute not found device/transport vda1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] INFO   cufio-udev:99 sysfs attribute not found wwid vda1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] DEBUG  cufio-udev:94 sysfs attribute found queue/logical_block_size vda1

** NOTE: And this seems to be the problem (But locally I think `ID_FS_USAGE` doesn't exist either, but not 100% sure. **

 25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR  cufio-udev:67 udev property not found: ID_FS_USAGE vda1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR  cufio-fs:970 error getting volume attributes error for device: dev_no: 253:1
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR  cufio-obj:251 unable to get volume attributes for fd 31
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR  cufio:333 cuFileHandleRegister error, failed to allocate file object
 25-06-2026 09:55:05:67 [pid=4215 tid=4215] ERROR  cufio:363 cuFileHandleRegister error: file descriptor is not registered
 25-06-2026 09:55:05:104 [pid=4215 tid=4215] TRACE  cufio:895 cuFileDriver closing

So what that gives? I suspect the FS based skipped was always shady because tmpfs, etc. work just fine in compat mode and the whole FS based logic makes sense without compat mode?
However, compat mode is tried here and fails in cwd (but not in the temp folder which actually reports as overlayfs(!?), although I would have thought it is just /tmp/<something> but I didn't check yet.)

@seberg seberg Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ping @sourabgupta3, maybe you have some relatively quick insights? TBH, I wonder if we shouldn't just delete the check entirely and assume that the CI filesystem failing even with compat mode allowed is a weird niche thing that is hard to check for...

seberg and others added 2 commits June 25, 2026 09:04
This is a follow-up from making cufile tests use temporary directories
as noted by Leo

I don't think QA is a problem, because the previous xfail was guarded
by having `CI` in the environment variables.

However, the `isSupportedFilesystem` was using the wrong directory now
as we are now running the test in the temporary directory.

My suspicion is that there is some additional check that would be
strictly needed (e.g. to check that it isn't just ext4 but also
directly mounted on a local nvme device) but I have not figured out
a check for that.
Co-authored-by: Leo Fang <leo80042@gmail.com>
Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@seberg seberg force-pushed the supported-fs-fixture branch from 3a34c26 to 2451985 Compare June 25, 2026 07:04
@github-actions

Copy link
Copy Markdown

Also see if cufile logging can give us more info...

Signed-off-by: Sebastian Berg <sebastianb@nvidia.com>
@seberg seberg force-pushed the supported-fs-fixture branch from 5a30c36 to dd647ae Compare June 25, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants