Skip to content

Add MSSQL bulk load via BCP#66

Merged
martinv13 merged 13 commits into
cre-dev:mainfrom
martinv13:claude/mssql-bcp
Jun 20, 2026
Merged

Add MSSQL bulk load via BCP#66
martinv13 merged 13 commits into
cre-dev:mainfrom
martinv13:claude/mssql-bcp

Conversation

@martinv13

Copy link
Copy Markdown
Collaborator

Summary

  • Adds a bulk_insert() override to MSSQLDialect that uses the bcp
    utility for large batches (≥ 1000 rows), falling back to
    fast_executemany for smaller batches or when bcp is not on PATH
  • Auto-detects bcp at DataModel creation time via shutil.which();
    no configuration needed when available
  • Pass use_bcp=False to DataModel to opt out entirely
  • Threads use_bcp through DataModel.__init__()get_dialect()
    dialect constructor via **kwargs, so other dialects are unaffected

Value formatting (BCP character mode, tab-separated)

Python type Written as Notes
None empty field -k flag maps to NULL
bool 0 / 1 BIT columns
bytes {hex} VARBINARY hex literal
datetime str(v) ISO format SQL Server accepts
str str(v), tab→space tab is the field delimiter

Notes

  • BCP does not participate in the caller's SQLAlchemy transaction —
    acceptable since insertions target ephemeral staging tables that are
    dropped after each load
  • MSSQL CI workflow updated to install mssql-tools18 (provides bcp)
    and expose it on PATH

Test plan

  • tests/test_bulk_insert_mssql.py — 10 tests: 4 covering the
    fast_executemany path (< 1000 rows: basic, numeric, boolean, scalar
    default) and 5 covering the BCP path (basic, numeric, boolean, binary,
    scalar default), plus 1 fallback test that forces bcp_path = None to
    verify graceful degradation
  • BCP tests skip automatically when bcp is not on PATH
  • All non-DB tests continue to pass (pytest -m "not dbtest")

🤖 Generated with Claude Code

claude added 13 commits June 19, 2026 20:26
Overrides bulk_insert() in PostgreSQLDialect to stream an in-memory CSV
payload to PostgreSQL using the native COPY protocol instead of SQLAlchemy
executemany. Supports both psycopg2 (copy_expert) and psycopg3 (cursor.copy),
falling back to the base-class implementation for other drivers.

Adds integration tests parametrised over both drivers, and updates the
PostgreSQL CI workflow to a matrix that runs once per driver.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Adds a BCP-based bulk_insert() to MSSQLDialect. When the bcp binary is
available on PATH and the connection uses SQL authentication, batches of
1000+ rows are loaded via a subprocess call to bcp with a temp file
(tab-separated, UTF-8), falling back to fast_executemany for smaller
batches or when bcp is unavailable. Pass use_bcp=False to DataModel to
disable BCP entirely.

Also threads use_bcp through DataModel.__init__() -> get_dialect() ->
dialect constructor, adding **kwargs forwarding to DatabaseDialect so
unknown options are silently ignored by other dialects.

Updates the MSSQL CI workflow to install mssql-tools18 (provides bcp)
and adds it to PATH for the test step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
The _insert_and_read helper orders by table.c.id; the numeric test table
was missing that column.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Add -No flag to BCP command when TrustServerCertificate=yes is present
in the connection URL query, allowing BCP to connect to servers with
self-signed certificates (same trust level as pyodbc).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Use -u flag (trust server certificate) instead of -No which is not
supported in mssql-tools18 BCP. Also remove duplicate workflow_dispatch
trigger in python-package.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
BCP character mode cannot load hex-encoded data into varbinary columns
(produces "Text column data incomplete" error). Detect LargeBinary
columns and skip BCP for those tables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Instead of falling back to fast_executemany for binary columns, generate
a non-XML BCP format file that specifies SQLVARBINARY with a 4-byte
length prefix for LargeBinary columns and SQLCHAR for everything else.
The data file is written in binary mode: raw bytes with prefix for binary
fields, UTF-8 text for character fields. NULL binary is encoded as
prefix value -1 (0xFFFFFFFF).

This allows BCP to handle the record hash column (VARBINARY) present in
almost all xml2db tables while keeping the fast character-mode path for
all other column types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
BCP character mode accepts hex-encoded binary without the 0x prefix.
Simplify _format_bcp_value to emit v.hex() for bytes values and revert
to plain character mode (no format file needed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
SELECT rowcount is always -1 in DBAPI; use fetchall() instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Restores push+pull_request triggers on main for python-package,
postgres, mysql, and duckdb workflows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
When Trusted_Connection=yes is in the URL query, BCP is invoked with
-T (trusted connection) instead of -U/-P, mirroring how the ODBC driver
handles Kerberos auth on Linux. Falls back to fast_executemany only when
both SQL credentials and Trusted_Connection are absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
@martinv13 martinv13 merged commit 28dec43 into cre-dev:main Jun 20, 2026
10 checks passed
@martinv13 martinv13 deleted the claude/mssql-bcp branch June 20, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants