Add MSSQL bulk load via BCP#66
Merged
Merged
Conversation
Overrides bulk_insert() in PostgreSQLDialect to stream an in-memory CSV payload to PostgreSQL using the native COPY protocol instead of SQLAlchemy executemany. Supports both psycopg2 (copy_expert) and psycopg3 (cursor.copy), falling back to the base-class implementation for other drivers. Adds integration tests parametrised over both drivers, and updates the PostgreSQL CI workflow to a matrix that runs once per driver. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Adds a BCP-based bulk_insert() to MSSQLDialect. When the bcp binary is available on PATH and the connection uses SQL authentication, batches of 1000+ rows are loaded via a subprocess call to bcp with a temp file (tab-separated, UTF-8), falling back to fast_executemany for smaller batches or when bcp is unavailable. Pass use_bcp=False to DataModel to disable BCP entirely. Also threads use_bcp through DataModel.__init__() -> get_dialect() -> dialect constructor, adding **kwargs forwarding to DatabaseDialect so unknown options are silently ignored by other dialects. Updates the MSSQL CI workflow to install mssql-tools18 (provides bcp) and adds it to PATH for the test step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
The _insert_and_read helper orders by table.c.id; the numeric test table was missing that column. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Add -No flag to BCP command when TrustServerCertificate=yes is present in the connection URL query, allowing BCP to connect to servers with self-signed certificates (same trust level as pyodbc). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Use -u flag (trust server certificate) instead of -No which is not supported in mssql-tools18 BCP. Also remove duplicate workflow_dispatch trigger in python-package.yml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
BCP character mode cannot load hex-encoded data into varbinary columns (produces "Text column data incomplete" error). Detect LargeBinary columns and skip BCP for those tables. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Instead of falling back to fast_executemany for binary columns, generate a non-XML BCP format file that specifies SQLVARBINARY with a 4-byte length prefix for LargeBinary columns and SQLCHAR for everything else. The data file is written in binary mode: raw bytes with prefix for binary fields, UTF-8 text for character fields. NULL binary is encoded as prefix value -1 (0xFFFFFFFF). This allows BCP to handle the record hash column (VARBINARY) present in almost all xml2db tables while keeping the fast character-mode path for all other column types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
BCP character mode accepts hex-encoded binary without the 0x prefix. Simplify _format_bcp_value to emit v.hex() for bytes values and revert to plain character mode (no format file needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
SELECT rowcount is always -1 in DBAPI; use fetchall() instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
Restores push+pull_request triggers on main for python-package, postgres, mysql, and duckdb workflows. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
When Trusted_Connection=yes is in the URL query, BCP is invoked with -T (trusted connection) instead of -U/-P, mirroring how the ODBC driver handles Kerberos auth on Linux. Falls back to fast_executemany only when both SQL credentials and Trusted_Connection are absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QvgSwLLGgKaDbZaEYYp7BJ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bulk_insert()override toMSSQLDialectthat uses thebcputility for large batches (≥ 1000 rows), falling back to
fast_executemanyfor smaller batches or whenbcpis not on PATHbcpatDataModelcreation time viashutil.which();no configuration needed when available
use_bcp=FalsetoDataModelto opt out entirelyuse_bcpthroughDataModel.__init__()→get_dialect()→dialect constructor via
**kwargs, so other dialects are unaffectedValue formatting (BCP character mode, tab-separated)
None-kflag maps to NULLbool0/1bytes{hex}datetimestr(v)strstr(v), tab→spaceNotes
acceptable since insertions target ephemeral staging tables that are
dropped after each load
mssql-tools18(providesbcp)and expose it on PATH
Test plan
tests/test_bulk_insert_mssql.py— 10 tests: 4 covering thefast_executemanypath (< 1000 rows: basic, numeric, boolean, scalardefault) and 5 covering the BCP path (basic, numeric, boolean, binary,
scalar default), plus 1 fallback test that forces
bcp_path = Nonetoverify graceful degradation
bcpis not on PATHpytest -m "not dbtest")🤖 Generated with Claude Code