diff --git a/README.md b/README.md index d344f0a..a18025e 100644 --- a/README.md +++ b/README.md @@ -83,12 +83,17 @@ LogLens also tracks parser coverage telemetry for unsupported or malformed lines - `parsed_lines` - `unparsed_lines` - `parse_success_rate` +- `failure_categories` - `top_unknown_patterns` Common unsupported-pattern buckets include `sshd_connection_closed_preauth`, `sshd_timeout_or_disconnection`, `sshd_negotiation_failure`, `pam_faillock_account_locked`, and `pam_unix_session_closed`. These buckets keep non-finding evidence reviewable without counting it as detector evidence. +Failure categories group unsupported lines into reviewer-facing parser boundary +classes: `unknown_timestamp`, `unknown_program`, +`known_program_unknown_message`, `malformed_source_ip`, and +`unsupported_pam_variant`. For rule-by-rule semantics and signal boundaries, see [`docs/rule-catalog.md`](./docs/rule-catalog.md). For a forensic-style evidence walkthrough, see [`docs/case-study-linux-auth-bruteforce.md`](./docs/case-study-linux-auth-bruteforce.md). For the parser behavior contract, supported modes, and fixture map, see [`docs/parser-contract.md`](./docs/parser-contract.md). For the deliberately noisy parser-coverage sample, see [`docs/parser-coverage-notes.md`](./docs/parser-coverage-notes.md). @@ -142,7 +147,7 @@ When you add `--csv`, LogLens also writes: The CSV schema is intentionally small and stable: - `findings.csv`: `rule`, `subject_kind`, `subject`, `event_count`, `window_start`, `window_end`, `usernames`, `summary` -- `warnings.csv`: `kind`, `line_number`, `message` +- `warnings.csv`: `kind`, `line_number`, `category`, `message` Without `--csv`, LogLens does not create, overwrite, or delete any existing CSV files in the output directory. diff --git a/docs/case-study-linux-auth-bruteforce.md b/docs/case-study-linux-auth-bruteforce.md index e53e24d..3daa8e8 100644 --- a/docs/case-study-linux-auth-bruteforce.md +++ b/docs/case-study-linux-auth-bruteforce.md @@ -110,10 +110,10 @@ The sudo finding is adjacent but separate. It is not joined to the SSH failure c The parser warnings are: -| Line | Unknown-pattern bucket | Evidence interpretation | -| ---: | --- | --- | -| 15 | `sshd_connection_closed_preauth` | preauth connection-close noise was observed but not promoted to a typed event | -| 16 | `sshd_timeout_or_disconnection` | timeout/disconnection noise was observed but not promoted to a typed event | +| Line | Failure category | Unknown-pattern bucket | Evidence interpretation | +| ---: | --- | --- | --- | +| 15 | `known_program_unknown_message` | `sshd_connection_closed_preauth` | preauth connection-close noise was observed but not promoted to a typed event | +| 16 | `known_program_unknown_message` | `sshd_timeout_or_disconnection` | timeout/disconnection noise was observed but not promoted to a typed event | These warnings are useful because they prevent silent overconfidence. A reviewer can see both the finding-producing evidence and the unsupported surrounding records. diff --git a/docs/parser-conformance-matrix.md b/docs/parser-conformance-matrix.md index ec32ecd..a827330 100644 --- a/docs/parser-conformance-matrix.md +++ b/docs/parser-conformance-matrix.md @@ -8,7 +8,7 @@ corpus. The parser contract is intentionally conservative: - recognized evidence emits a normalized `Event` -- unsupported evidence emits a parser warning and an unknown-pattern bucket +- unsupported evidence emits a parser warning, a failure category, and an unknown-pattern bucket - unsupported evidence does not become detector input ## Input Format Matrix @@ -57,23 +57,23 @@ event type in both formats. Unsupported buckets are warning labels, not normalized events. The expected normalized event is always `none`. -| Unsupported evidence | Input formats | Expected unsupported line bucket | Expected normalized event | -| --- | --- | --- | --- | -| `sshd` preauth connection closed or reset, including `Connection closed by ... [preauth]`, `Connection closed by authenticating user ... [preauth]`, and `Connection reset by ... [preauth]` | `syslog_legacy`, `journalctl_short_full` | `sshd_connection_closed_preauth` | none | -| `sshd` timeout, disconnection, or disconnect notice, including `Timeout, client not responding`, `Disconnected from ...`, and `Received disconnect ...` | `syslog_legacy`, `journalctl_short_full` | `sshd_timeout_or_disconnection` | none | -| `sshd` negotiation failure such as `Unable to negotiate with ...` | `syslog_legacy`, `journalctl_short_full` | `sshd_negotiation_failure` | none | -| Other well-formed but unsupported `sshd` messages | `syslog_legacy`, `journalctl_short_full` | `sshd_other` | none | -| `pam_unix(...:session)` session closed | `syslog_legacy`, `journalctl_short_full` | `pam_unix_session_closed` | none | -| Other unsupported `pam_unix(...)` messages | `syslog_legacy`, `journalctl_short_full` | `pam_unix_other` | none | -| `pam_faillock(...:auth)` account temporarily locked | `syslog_legacy`, `journalctl_short_full` | `pam_faillock_account_locked` | none | -| `pam_faillock(...:auth)` successful authentication telemetry | `syslog_legacy`, `journalctl_short_full` | `pam_faillock_authsucc` | none | -| Other unsupported `pam_faillock(...)` messages | `syslog_legacy`, `journalctl_short_full` | `pam_faillock_other` | none | -| `pam_sss(...:auth)` user not known to underlying authentication module | `syslog_legacy`, `journalctl_short_full` | `pam_sss_unknown_user` | none | -| `pam_sss(...:auth)` authentication service cannot retrieve authentication info | `syslog_legacy`, `journalctl_short_full` | `pam_sss_authinfo_unavail` | none | -| Other unsupported `pam_sss(...)` messages | `syslog_legacy`, `journalctl_short_full` | `pam_sss_other` | none | -| Well-formed `sudo` line that is not command, incorrect-password, or policy-denial evidence | `syslog_legacy`, `journalctl_short_full` | `sudo_other` | none | -| Well-formed `su` line that is not recognized as success or failure audit evidence | `syslog_legacy`, `journalctl_short_full` | `su_other` | none | -| Well-formed unsupported program tag | `syslog_legacy`, `journalctl_short_full` | `program_` | none | +| Unsupported evidence | Input formats | Failure category | Expected unsupported line bucket | Expected normalized event | +| --- | --- | --- | --- | --- | +| `sshd` preauth connection closed or reset, including `Connection closed by ... [preauth]`, `Connection closed by authenticating user ... [preauth]`, and `Connection reset by ... [preauth]` | `syslog_legacy`, `journalctl_short_full` | `known_program_unknown_message` | `sshd_connection_closed_preauth` | none | +| `sshd` timeout, disconnection, or disconnect notice, including `Timeout, client not responding`, `Disconnected from ...`, and `Received disconnect ...` | `syslog_legacy`, `journalctl_short_full` | `known_program_unknown_message` | `sshd_timeout_or_disconnection` | none | +| `sshd` negotiation failure such as `Unable to negotiate with ...` | `syslog_legacy`, `journalctl_short_full` | `known_program_unknown_message` | `sshd_negotiation_failure` | none | +| Other well-formed but unsupported `sshd` messages | `syslog_legacy`, `journalctl_short_full` | `known_program_unknown_message` | `sshd_other` | none | +| `pam_unix(...:session)` session closed | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_unix_session_closed` | none | +| Other unsupported `pam_unix(...)` messages | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_unix_other` | none | +| `pam_faillock(...:auth)` account temporarily locked | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_faillock_account_locked` | none | +| `pam_faillock(...:auth)` successful authentication telemetry | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_faillock_authsucc` | none | +| Other unsupported `pam_faillock(...)` messages | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_faillock_other` | none | +| `pam_sss(...:auth)` user not known to underlying authentication module | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_sss_unknown_user` | none | +| `pam_sss(...:auth)` authentication service cannot retrieve authentication info | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_sss_authinfo_unavail` | none | +| Other unsupported `pam_sss(...)` messages | `syslog_legacy`, `journalctl_short_full` | `unsupported_pam_variant` | `pam_sss_other` | none | +| Well-formed `sudo` line that is not command, incorrect-password, or policy-denial evidence | `syslog_legacy`, `journalctl_short_full` | `known_program_unknown_message` | `sudo_other` | none | +| Well-formed `su` line that is not recognized as success or failure audit evidence | `syslog_legacy`, `journalctl_short_full` | `known_program_unknown_message` | `su_other` | none | +| Well-formed unsupported program tag | `syslog_legacy`, `journalctl_short_full` | `unknown_program` | `program_` | none | ## Header And Structural Warning Matrix @@ -81,18 +81,19 @@ Structural failures do not reach the authentication message classifier. They still produce parser warnings and unknown-pattern buckets through the same coverage telemetry path. -| Failure class | Input formats | Expected bucket | Expected normalized event | -| --- | --- | --- | --- | -| Missing syslog assumed year | `syslog_legacy` | `syslog_legacy_mode_requires_assume_year` | none | -| Missing syslog header fields | `syslog_legacy` | `missing_syslog_header_fields` | none | -| Invalid syslog month token | `syslog_legacy` | `invalid_month_token` | none | -| Invalid syslog day token | `syslog_legacy` | `invalid_day_token` | none | -| Invalid time token | `syslog_legacy`, `journalctl_short_full` | `invalid_time_token` | none | -| Invalid calendar date | `syslog_legacy`, `journalctl_short_full` | `invalid_calendar_date` | none | -| Missing journalctl short-full header fields | `journalctl_short_full` | `missing_journalctl_short_full_header_fields` | none | -| Invalid journalctl date token | `journalctl_short_full` | `invalid_journalctl_date_token` | none | -| Invalid journalctl timezone token | `journalctl_short_full` | `invalid_timezone_token` | none | -| Missing program/message delimiter | `syslog_legacy`, `journalctl_short_full` | `missing_program_message_delimiter` | none | +| Failure class | Input formats | Failure category | Expected bucket | Expected normalized event | +| --- | --- | --- | --- | --- | +| Missing syslog assumed year | `syslog_legacy` | `unknown_timestamp` | `syslog_legacy_mode_requires_assume_year` | none | +| Missing syslog header fields | `syslog_legacy` | `unknown_timestamp` | `missing_syslog_header_fields` | none | +| Invalid syslog month token | `syslog_legacy` | `unknown_timestamp` | `invalid_month_token` | none | +| Invalid syslog day token | `syslog_legacy` | `unknown_timestamp` | `invalid_day_token` | none | +| Invalid time token | `syslog_legacy`, `journalctl_short_full` | `unknown_timestamp` | `invalid_time_token` | none | +| Invalid calendar date | `syslog_legacy`, `journalctl_short_full` | `unknown_timestamp` | `invalid_calendar_date` | none | +| Missing journalctl short-full header fields | `journalctl_short_full` | `unknown_timestamp` | `missing_journalctl_short_full_header_fields` | none | +| Invalid journalctl date token | `journalctl_short_full` | `unknown_timestamp` | `invalid_journalctl_date_token` | none | +| Invalid journalctl timezone token | `journalctl_short_full` | `unknown_timestamp` | `invalid_timezone_token` | none | +| Missing program/message delimiter | `syslog_legacy`, `journalctl_short_full` | `unknown_program` | `missing_program_message_delimiter` | none | +| Malformed source IP token | `syslog_legacy`, `journalctl_short_full` | `malformed_source_ip` | `malformed_source_ip` | none | ## Fixture Anchors @@ -113,4 +114,5 @@ these places: - normalized event expectation in `tests/test_parser.cpp` - supported fixture line under `assets/` - unsupported warning bucket expectation +- parser failure category expectation - report-contract fixture if the visible report shape changes diff --git a/docs/parser-contract.md b/docs/parser-contract.md index 2069a8a..0299d99 100644 --- a/docs/parser-contract.md +++ b/docs/parser-contract.md @@ -36,11 +36,20 @@ Recognized success or audit families include accepted password, accepted publick | --- | --- | --- | | Recognized auth line | Emits a typed `Event` with timestamp, hostname, program, optional pid, message, source IP, username, event type, and line number | Can contribute to summaries, reports, and configured detection signals | | Blank line | Skips the line and increments `skipped_blank_lines` | Does not become a warning or parsed event | -| Malformed header | Emits a parser warning with the original line number and structural reason | Counts toward `unparsed_lines` and `top_unknown_patterns` | -| Well-formed but unsupported auth pattern | Emits a parser warning with an unknown-pattern bucket | Stays visible as telemetry instead of being silently ignored | +| Malformed header | Emits a parser warning with the original line number, structural reason, and `unknown_timestamp` category | Counts toward `unparsed_lines`, `failure_categories`, and `top_unknown_patterns` | +| Well-formed but unsupported auth pattern | Emits a parser warning with a failure category and unknown-pattern bucket | Stays visible as telemetry instead of being silently ignored | This is the main trust boundary: unsupported input should remain inspectable, even when it does not produce a finding. +Parser failure categories are intentionally coarser than unknown-pattern +buckets: + +- `unknown_timestamp` +- `unknown_program` +- `known_program_unknown_message` +- `malformed_source_ip` +- `unsupported_pam_variant` + Stable unsupported-pattern buckets currently exercised by the fixture corpus include `sshd_connection_closed_preauth`, `sshd_timeout_or_disconnection`, `sshd_negotiation_failure`, `pam_faillock_account_locked`, and diff --git a/docs/parser-coverage-notes.md b/docs/parser-coverage-notes.md index 677ef29..6b4549b 100644 --- a/docs/parser-coverage-notes.md +++ b/docs/parser-coverage-notes.md @@ -20,10 +20,11 @@ The locked expected coverage summary lives in [`tests/fixtures/parser_matrix/noi - `parsed_lines`: 8 - `unparsed_lines`: 16 - `parse_success_rate`: 0.3333333333 +- `failure_categories`: coarse parser boundary categories for unsupported lines - `top_unknown_patterns`: the five most common unsupported-pattern buckets ## Reading the numbers -A low parse success rate is not automatically a bug for this fixture. The sample is deliberately noisy, and the useful property is that unsupported evidence remains explainable through `warnings` and `top_unknown_patterns`. +A low parse success rate is not automatically a bug for this fixture. The sample is deliberately noisy, and the useful property is that unsupported evidence remains explainable through `warnings`, `failure_categories`, and `top_unknown_patterns`. The matrix should stay defensive and public-safe: use documentation IP ranges, synthetic hostnames, and synthetic usernames only. diff --git a/docs/report-artifacts.md b/docs/report-artifacts.md index a9dd017..e2fbc4b 100644 --- a/docs/report-artifacts.md +++ b/docs/report-artifacts.md @@ -28,6 +28,7 @@ The JSON report keeps parser observability visible next to findings: - `parser_quality.parsed_lines` - `parser_quality.unparsed_lines` - `parser_quality.parse_success_rate` +- `parser_quality.failure_categories` - `parser_quality.top_unknown_patterns` - `parsed_event_count` - `warning_count` @@ -41,14 +42,21 @@ Finding objects contain `rule_id`, `rule`, `subject_kind`, `subject`, `grouping_ `evidence_event_ids` are deterministic local event identifiers derived from the source line number, formatted as `line:`. They let reviewers trace a finding back to the normalized input events that satisfied the rule window without implying global event identity. -Warning objects contain the original `line_number` and the parser `reason`. +Warning objects contain the original `line_number`, parser `category`, and parser `reason`. + +Parser failure categories are stable reviewer-facing buckets for unsupported +lines: `unknown_timestamp`, `unknown_program`, +`known_program_unknown_message`, `malformed_source_ip`, and +`unsupported_pam_variant`. They complement `top_unknown_patterns`: categories +explain the parser boundary class, while unknown-pattern buckets preserve the +more specific unsupported message shape. ## CSV Contract The optional CSV exports intentionally stay small: - `findings.csv`: `rule`, `subject_kind`, `subject`, `event_count`, `window_start`, `window_end`, `usernames`, `summary` -- `warnings.csv`: `kind`, `line_number`, `message` +- `warnings.csv`: `kind`, `line_number`, `category`, `message` Formula-like CSV text fields are neutralized with a leading single quote so spreadsheet tools treat them as text. diff --git a/docs/reviewer-path.md b/docs/reviewer-path.md index 65bfa74..1a15441 100644 --- a/docs/reviewer-path.md +++ b/docs/reviewer-path.md @@ -60,6 +60,7 @@ Look for parser coverage fields: - `parsed_lines` - `unparsed_lines` - `parse_success_rate` +- `failure_categories` - `top_unknown_patterns` Good stopping point: the reviewer can explain what LogLens parses, how rules count supported evidence, what the reports contain, and how unsupported lines remain visible without becoming findings. diff --git a/docs/rule-catalog.md b/docs/rule-catalog.md index de8dd30..2b42883 100644 --- a/docs/rule-catalog.md +++ b/docs/rule-catalog.md @@ -111,7 +111,7 @@ The finding is a triage signal. It is not a compromise verdict, attribution clai ### Why unsupported evidence is not counted -Unsupported lines are parser warnings, not `AuthSignal` records. They may appear in `top_unknown_patterns`, but they do not carry the `counts_as_terminal_auth_failure` flag required by this rule. +Unsupported lines are parser warnings, not `AuthSignal` records. They may appear in `failure_categories` and `top_unknown_patterns`, but they do not carry the `counts_as_terminal_auth_failure` flag required by this rule. This prevents unsupported preauth noise, malformed lines, and unmodeled auth-family messages from silently increasing brute-force counts. diff --git a/src/parser.cpp b/src/parser.cpp index 0eaeb95..6e7385f 100644 --- a/src/parser.cpp +++ b/src/parser.cpp @@ -21,6 +21,18 @@ struct ClockTime { int second = 0; }; +void set_failure(std::string* error, + ParserFailureCategory* category, + std::string reason, + ParserFailureCategory failure_category) { + if (error != nullptr) { + *error = std::move(reason); + } + if (category != nullptr) { + *category = failure_category; + } +} + std::string_view trim_left(std::string_view value) { while (!value.empty() && std::isspace(static_cast(value.front())) != 0) { value.remove_prefix(1); @@ -61,6 +73,55 @@ bool parse_int(std::string_view token, int& value) { return result.ec == std::errc{} && result.ptr == end; } +bool is_valid_ipv4_token(std::string_view token) { + int parts = 0; + while (!token.empty()) { + const auto dot = token.find('.'); + const auto part = dot == std::string_view::npos ? token : token.substr(0, dot); + if (part.empty()) { + return false; + } + + int value = 0; + if (!parse_int(part, value) || value < 0 || value > 255) { + return false; + } + + ++parts; + if (dot == std::string_view::npos) { + token = {}; + } else { + token.remove_prefix(dot + 1); + } + } + + return parts == 4; +} + +bool is_valid_ipv6_like_token(std::string_view token) { + if (token.find(':') == std::string_view::npos) { + return false; + } + + bool saw_hex = false; + for (const char character : token) { + if (std::isxdigit(static_cast(character)) != 0) { + saw_hex = true; + continue; + } + if (character == ':' || character == '.') { + continue; + } + return false; + } + + return saw_hex; +} + +bool is_valid_source_ip_token(std::string_view token) { + return is_valid_ipv4_token(token) || is_valid_ipv6_like_token(token); +} + bool parse_month(std::string_view token, unsigned& month_index) { static constexpr std::array months = { "Jan", "Feb", "Mar", "Apr", "May", "Jun", @@ -260,6 +321,65 @@ std::string extract_kv_value(std::string_view input, std::string_view key) { return {}; } +std::string extract_source_ip_after_from(std::string_view message) { + const auto marker_position = message.find(" from "); + if (marker_position == std::string_view::npos) { + return {}; + } + + auto remaining = message.substr(marker_position + std::string_view{" from "}.size()); + const auto first = consume_token(remaining); + if (first.empty()) { + return {}; + } + + if (first == "authenticating") { + const auto second = consume_token(remaining); + if (second == "user") { + static_cast(consume_token(remaining)); + return std::string(consume_token(remaining)); + } + } + + if (first == "invalid" || first == "illegal") { + const auto second = consume_token(remaining); + if (second == "user") { + static_cast(consume_token(remaining)); + return std::string(consume_token(remaining)); + } + } + + if (first == "user") { + static_cast(consume_token(remaining)); + return std::string(consume_token(remaining)); + } + + return std::string(first); +} + +std::string extract_source_ip_candidate(const Event& event) { + auto candidate = extract_source_ip_after_from(event.message); + if (!candidate.empty()) { + return candidate; + } + + candidate = extract_kv_value(event.message, "rhost="); + if (!candidate.empty()) { + return candidate; + } + + if (event.program == "sshd" && event.message.starts_with("Unable to negotiate with ")) { + candidate = extract_token_after(event.message, " with "); + } + + return candidate; +} + +bool has_malformed_source_ip(const Event& event) { + const auto candidate = extract_source_ip_candidate(event); + return !candidate.empty() && !is_valid_source_ip_token(candidate); +} + std::string sanitize_pattern_label(std::string_view value) { std::string normalized; normalized.reserve(value.size()); @@ -772,6 +892,29 @@ std::string classify_unknown_auth_pattern(const Event& event) { return "program_" + sanitize_pattern_label(event.program); } +bool is_pam_program(std::string_view program) { + return program.starts_with("pam_unix(") + || program.starts_with("pam_faillock(") + || program.starts_with("pam_sss("); +} + +bool is_known_auth_program(std::string_view program) { + return program == "sshd" + || program == "sudo" + || program == "su" + || is_pam_program(program); +} + +ParserFailureCategory failure_category_for_unrecognized_event(const Event& event) { + if (is_pam_program(event.program)) { + return ParserFailureCategory::UnsupportedPamVariant; + } + if (is_known_auth_program(event.program)) { + return ParserFailureCategory::KnownProgramUnknownMessage; + } + return ParserFailureCategory::UnknownProgram; +} + bool classify_event(Event& event) { const auto message = std::string_view{event.message}; if (event.program == "sshd") { @@ -855,11 +998,14 @@ std::string extract_unknown_pattern_key(std::string_view error) { std::optional parse_syslog_legacy_line(const ParserConfig& config, std::string_view line, std::size_t line_number, - std::string* error) { + std::string* error, + ParserFailureCategory* category) { if (!config.assumed_year.has_value()) { - if (error != nullptr) { - *error = "syslog_legacy mode requires assume_year"; - } + set_failure( + error, + category, + "syslog_legacy mode requires assume_year", + ParserFailureCategory::UnknownTimestamp); return std::nullopt; } @@ -870,9 +1016,7 @@ std::optional parse_syslog_legacy_line(const ParserConfig& config, const auto hostname_token = consume_token(remaining); if (month_token.empty() || day_token.empty() || time_token.empty() || hostname_token.empty()) { - if (error != nullptr) { - *error = "missing syslog header fields"; - } + set_failure(error, category, "missing syslog header fields", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } @@ -881,31 +1025,23 @@ std::optional parse_syslog_legacy_line(const ParserConfig& config, ClockTime time; if (!parse_month(month_token, month_index)) { - if (error != nullptr) { - *error = "invalid month token"; - } + set_failure(error, category, "invalid month token", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } if (!parse_int(day_token, day_value)) { - if (error != nullptr) { - *error = "invalid day token"; - } + set_failure(error, category, "invalid day token", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } if (!parse_clock_token(time_token, time)) { - if (error != nullptr) { - *error = "invalid time token"; - } + set_failure(error, category, "invalid time token", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } const auto timestamp = build_timestamp(*config.assumed_year, month_index, day_value, time); if (!timestamp.has_value()) { - if (error != nullptr) { - *error = "invalid calendar date"; - } + set_failure(error, category, "invalid calendar date", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } @@ -915,13 +1051,23 @@ std::optional parse_syslog_legacy_line(const ParserConfig& config, event.line_number = line_number; if (!parse_program_and_message(remaining, event, error)) { + if (category != nullptr) { + *category = ParserFailureCategory::UnknownProgram; + } + return std::nullopt; + } + + if (has_malformed_source_ip(event)) { + set_failure(error, category, "malformed source IP", ParserFailureCategory::MalformedSourceIp); return std::nullopt; } if (!classify_event(event)) { - if (error != nullptr) { - *error = "unrecognized auth pattern: " + classify_unknown_auth_pattern(event); - } + set_failure( + error, + category, + "unrecognized auth pattern: " + classify_unknown_auth_pattern(event), + failure_category_for_unrecognized_event(event)); return std::nullopt; } @@ -930,7 +1076,8 @@ std::optional parse_syslog_legacy_line(const ParserConfig& config, std::optional parse_journalctl_short_full_line(std::string_view line, std::size_t line_number, - std::string* error) { + std::string* error, + ParserFailureCategory* category) { auto remaining = line; const auto weekday_token = consume_token(remaining); const auto date_token = consume_token(remaining); @@ -940,9 +1087,11 @@ std::optional parse_journalctl_short_full_line(std::string_view line, if (weekday_token.empty() || date_token.empty() || time_token.empty() || timezone_token.empty() || hostname_token.empty()) { - if (error != nullptr) { - *error = "missing journalctl short-full header fields"; - } + set_failure( + error, + category, + "missing journalctl short-full header fields", + ParserFailureCategory::UnknownTimestamp); return std::nullopt; } @@ -953,31 +1102,23 @@ std::optional parse_journalctl_short_full_line(std::string_view line, std::chrono::minutes timezone_offset{0}; if (!parse_calendar_date_parts(date_token, year_value, month_index, day_value)) { - if (error != nullptr) { - *error = "invalid journalctl date token"; - } + set_failure(error, category, "invalid journalctl date token", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } if (!parse_clock_token(time_token, time)) { - if (error != nullptr) { - *error = "invalid time token"; - } + set_failure(error, category, "invalid time token", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } if (!parse_timezone_token(timezone_token, timezone_offset)) { - if (error != nullptr) { - *error = "invalid timezone token"; - } + set_failure(error, category, "invalid timezone token", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } const auto timestamp = build_timestamp(year_value, month_index, day_value, time, timezone_offset); if (!timestamp.has_value()) { - if (error != nullptr) { - *error = "invalid calendar date"; - } + set_failure(error, category, "invalid calendar date", ParserFailureCategory::UnknownTimestamp); return std::nullopt; } @@ -987,13 +1128,23 @@ std::optional parse_journalctl_short_full_line(std::string_view line, event.line_number = line_number; if (!parse_program_and_message(remaining, event, error)) { + if (category != nullptr) { + *category = ParserFailureCategory::UnknownProgram; + } + return std::nullopt; + } + + if (has_malformed_source_ip(event)) { + set_failure(error, category, "malformed source IP", ParserFailureCategory::MalformedSourceIp); return std::nullopt; } if (!classify_event(event)) { - if (error != nullptr) { - *error = "unrecognized auth pattern: " + classify_unknown_auth_pattern(event); - } + set_failure( + error, + category, + "unrecognized auth pattern: " + classify_unknown_auth_pattern(event), + failure_category_for_unrecognized_event(event)); return std::nullopt; } @@ -1026,25 +1177,43 @@ std::optional parse_input_mode(std::string_view value) { return std::nullopt; } +std::string to_string(ParserFailureCategory category) { + switch (category) { + case ParserFailureCategory::UnknownTimestamp: + return "unknown_timestamp"; + case ParserFailureCategory::UnknownProgram: + return "unknown_program"; + case ParserFailureCategory::KnownProgramUnknownMessage: + return "known_program_unknown_message"; + case ParserFailureCategory::MalformedSourceIp: + return "malformed_source_ip"; + case ParserFailureCategory::UnsupportedPamVariant: + default: + return "unsupported_pam_variant"; + } +} + AuthLogParser::AuthLogParser(ParserConfig config) : config_(config) {} std::optional AuthLogParser::parse_line(std::string_view line, std::size_t line_number, - std::string* error) const { + std::string* error, + ParserFailureCategory* category) const { if (error != nullptr) { error->clear(); } + if (category != nullptr) { + *category = ParserFailureCategory::KnownProgramUnknownMessage; + } switch (config_.input_mode) { case InputMode::SyslogLegacy: - return parse_syslog_legacy_line(config_, line, line_number, error); + return parse_syslog_legacy_line(config_, line, line_number, error, category); case InputMode::JournalctlShortFull: - return parse_journalctl_short_full_line(line, line_number, error); + return parse_journalctl_short_full_line(line, line_number, error, category); default: - if (error != nullptr) { - *error = "unsupported input mode"; - } + set_failure(error, category, "unsupported input mode", ParserFailureCategory::UnknownProgram); return std::nullopt; } } @@ -1057,6 +1226,7 @@ ParseReport AuthLogParser::parse_stream(std::istream& input) const { result.metadata.assume_year = config_.assumed_year; } std::unordered_map unknown_pattern_counts; + std::unordered_map> failure_category_counts; std::string line; std::size_t line_number = 0; @@ -1071,16 +1241,21 @@ ParseReport AuthLogParser::parse_stream(std::istream& input) const { ++result.quality.total_lines; std::string error; - auto event = parse_line(line, line_number, &error); + ParserFailureCategory category = ParserFailureCategory::KnownProgramUnknownMessage; + auto event = parse_line(line, line_number, &error, &category); if (event.has_value()) { result.events.push_back(std::move(*event)); ++result.quality.parsed_lines; continue; } - result.warnings.push_back(ParseWarning{line_number, error.empty() ? "unrecognized line" : error}); + const auto reason = error.empty() ? "unrecognized line" : error; + result.warnings.push_back(ParseWarning{line_number, reason, category}); ++result.quality.unparsed_lines; - ++unknown_pattern_counts[extract_unknown_pattern_key(error.empty() ? "unrecognized line" : error)]; + ++unknown_pattern_counts[extract_unknown_pattern_key(reason)]; + auto& category_count = failure_category_counts[to_string(category)]; + category_count.first = category; + ++category_count.second; } if (result.quality.total_lines != 0) { @@ -1105,6 +1280,20 @@ ParseReport AuthLogParser::parse_stream(std::istream& input) const { result.quality.top_unknown_patterns.resize(5); } + result.quality.failure_categories.reserve(failure_category_counts.size()); + for (const auto& [_, entry] : failure_category_counts) { + result.quality.failure_categories.push_back(ParserFailureCategoryCount{entry.first, entry.second}); + } + + std::sort(result.quality.failure_categories.begin(), + result.quality.failure_categories.end(), + [](const ParserFailureCategoryCount& left, const ParserFailureCategoryCount& right) { + if (left.count != right.count) { + return left.count > right.count; + } + return to_string(left.category) < to_string(right.category); + }); + return result; } diff --git a/src/parser.hpp b/src/parser.hpp index 0302040..e545f9e 100644 --- a/src/parser.hpp +++ b/src/parser.hpp @@ -19,6 +19,16 @@ enum class InputMode { std::string to_string(InputMode mode); std::optional parse_input_mode(std::string_view value); +enum class ParserFailureCategory { + UnknownTimestamp, + UnknownProgram, + KnownProgramUnknownMessage, + MalformedSourceIp, + UnsupportedPamVariant +}; + +std::string to_string(ParserFailureCategory category); + struct ParserConfig { InputMode input_mode = InputMode::SyslogLegacy; std::optional assumed_year; @@ -27,6 +37,7 @@ struct ParserConfig { struct ParseWarning { std::size_t line_number = 0; std::string reason; + ParserFailureCategory category = ParserFailureCategory::KnownProgramUnknownMessage; }; struct ParseMetadata { @@ -40,12 +51,18 @@ struct UnknownPatternCount { std::size_t count = 0; }; +struct ParserFailureCategoryCount { + ParserFailureCategory category = ParserFailureCategory::KnownProgramUnknownMessage; + std::size_t count = 0; +}; + struct ParserQualityMetrics { std::size_t total_lines = 0; std::size_t skipped_blank_lines = 0; std::size_t parsed_lines = 0; std::size_t unparsed_lines = 0; double parse_success_rate = 0.0; + std::vector failure_categories; std::vector top_unknown_patterns; }; @@ -62,7 +79,8 @@ class AuthLogParser { std::optional parse_line(std::string_view line, std::size_t line_number, - std::string* error = nullptr) const; + std::string* error = nullptr, + ParserFailureCategory* category = nullptr) const; ParseReport parse_stream(std::istream& input) const; ParseReport parse_file(const std::filesystem::path& path) const; diff --git a/src/report.cpp b/src/report.cpp index 73415a7..5d38cbb 100644 --- a/src/report.cpp +++ b/src/report.cpp @@ -600,14 +600,27 @@ std::string render_markdown_report(const ReportData& data) { output << '\n'; } + if (data.parser_quality.failure_categories.empty()) { + output << "No parser failure categories were recorded.\n\n"; + } else { + output << "| Failure Category | Count |\n"; + output << "| --- | ---: |\n"; + for (const auto& entry : data.parser_quality.failure_categories) { + output << "| " << escape_markdown_table_cell(to_string(entry.category)) + << " | " << entry.count << " |\n"; + } + output << '\n'; + } + output << "## Parser Warnings\n\n"; if (warnings.empty()) { output << "No malformed lines were skipped.\n"; } else { - output << "| Line | Reason |\n"; - output << "| ---: | --- |\n"; + output << "| Line | Category | Reason |\n"; + output << "| ---: | --- | --- |\n"; for (const auto& warning : warnings) { output << "| " << warning.line_number << " | " + << escape_markdown_table_cell(to_string(warning.category)) << " | " << escape_markdown_table_cell(warning.reason) << " |\n"; } } @@ -643,6 +656,13 @@ std::string render_json_report(const ReportData& data) { output << " {\"pattern\": \"" << escape_json(entry.pattern) << "\", \"count\": " << entry.count << "}"; output << (index + 1 == data.parser_quality.top_unknown_patterns.size() ? "\n" : ",\n"); } + output << " ],\n"; + output << " \"failure_categories\": [\n"; + for (std::size_t index = 0; index < data.parser_quality.failure_categories.size(); ++index) { + const auto& entry = data.parser_quality.failure_categories[index]; + output << " {\"category\": \"" << to_string(entry.category) << "\", \"count\": " << entry.count << "}"; + output << (index + 1 == data.parser_quality.failure_categories.size() ? "\n" : ",\n"); + } output << " ]\n"; output << " },\n"; output << " \"parsed_event_count\": " << data.events.size() << ",\n"; @@ -708,6 +728,7 @@ std::string render_json_report(const ReportData& data) { for (std::size_t index = 0; index < warnings.size(); ++index) { const auto& warning = warnings[index]; output << " {\"line_number\": " << warning.line_number + << ", \"category\": \"" << to_string(warning.category) << "\"" << ", \"reason\": \"" << escape_json(warning.reason) << "\"}"; output << (index + 1 == warnings.size() ? "\n" : ",\n"); } @@ -739,10 +760,11 @@ std::string render_warnings_csv(const ReportData& data) { std::ostringstream output; const auto warnings = sorted_warnings(data.warnings); - output << "kind,line_number,message\n"; + output << "kind,line_number,category,message\n"; for (const auto& warning : warnings) { output << "parse_warning," << warning.line_number << ',' + << escape_csv(to_string(warning.category)) << ',' << escape_csv(warning.reason) << '\n'; } diff --git a/tests/fixtures/parser_matrix/noisy_auth_expected.json b/tests/fixtures/parser_matrix/noisy_auth_expected.json index b08344c..ec40044 100644 --- a/tests/fixtures/parser_matrix/noisy_auth_expected.json +++ b/tests/fixtures/parser_matrix/noisy_auth_expected.json @@ -17,22 +17,28 @@ {"pattern": "sshd_negotiation_failure", "count": 2}, {"pattern": "sshd_timeout_or_disconnection", "count": 2} ], + "failure_categories": [ + {"category": "known_program_unknown_message", "count": 7}, + {"category": "unsupported_pam_variant", "count": 5}, + {"category": "unknown_timestamp", "count": 3}, + {"category": "unknown_program", "count": 1} + ], "warnings": [ - {"line_number": 1, "reason": "invalid time token"}, - {"line_number": 2, "reason": "invalid calendar date"}, - {"line_number": 13, "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, - {"line_number": 14, "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, - {"line_number": 15, "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, - {"line_number": 16, "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, - {"line_number": 17, "reason": "unrecognized auth pattern: sshd_negotiation_failure"}, - {"line_number": 18, "reason": "unrecognized auth pattern: sshd_negotiation_failure"}, - {"line_number": 19, "reason": "unrecognized auth pattern: pam_unix_session_closed"}, - {"line_number": 20, "reason": "unrecognized auth pattern: pam_unix_session_closed"}, - {"line_number": 21, "reason": "unrecognized auth pattern: pam_faillock_account_locked"}, - {"line_number": 22, "reason": "unrecognized auth pattern: pam_faillock_account_locked"}, - {"line_number": 23, "reason": "unrecognized auth pattern: pam_sss_unknown_user"}, - {"line_number": 24, "reason": "unrecognized auth pattern: sudo_other"}, - {"line_number": 26, "reason": "missing syslog header fields"}, - {"line_number": 27, "reason": "unrecognized auth pattern: program_cron"} + {"line_number": 1, "category": "unknown_timestamp", "reason": "invalid time token"}, + {"line_number": 2, "category": "unknown_timestamp", "reason": "invalid calendar date"}, + {"line_number": 13, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, + {"line_number": 14, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, + {"line_number": 15, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, + {"line_number": 16, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, + {"line_number": 17, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_negotiation_failure"}, + {"line_number": 18, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_negotiation_failure"}, + {"line_number": 19, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_unix_session_closed"}, + {"line_number": 20, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_unix_session_closed"}, + {"line_number": 21, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_faillock_account_locked"}, + {"line_number": 22, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_faillock_account_locked"}, + {"line_number": 23, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_sss_unknown_user"}, + {"line_number": 24, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sudo_other"}, + {"line_number": 26, "category": "unknown_timestamp", "reason": "missing syslog header fields"}, + {"line_number": 27, "category": "unknown_program", "reason": "unrecognized auth pattern: program_cron"} ] } diff --git a/tests/fixtures/report_contracts/journalctl_short_full/report.json b/tests/fixtures/report_contracts/journalctl_short_full/report.json index 696a2d2..43a3d75 100644 --- a/tests/fixtures/report_contracts/journalctl_short_full/report.json +++ b/tests/fixtures/report_contracts/journalctl_short_full/report.json @@ -13,6 +13,9 @@ "top_unknown_patterns": [ {"pattern": "sshd_connection_closed_preauth", "count": 1}, {"pattern": "sshd_timeout_or_disconnection", "count": 1} + ], + "failure_categories": [ + {"category": "known_program_unknown_message", "count": 2} ] }, "parsed_event_count": 14, @@ -75,7 +78,7 @@ } ], "warnings": [ - {"line_number": 15, "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, - {"line_number": 16, "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"} + {"line_number": 15, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, + {"line_number": 16, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"} ] } diff --git a/tests/fixtures/report_contracts/journalctl_short_full/report.md b/tests/fixtures/report_contracts/journalctl_short_full/report.md index 2c94fc5..e424428 100644 --- a/tests/fixtures/report_contracts/journalctl_short_full/report.md +++ b/tests/fixtures/report_contracts/journalctl_short_full/report.md @@ -42,9 +42,13 @@ | sshd_connection_closed_preauth | 1 | | sshd_timeout_or_disconnection | 1 | +| Failure Category | Count | +| --- | ---: | +| known_program_unknown_message | 2 | + ## Parser Warnings -| Line | Reason | -| ---: | --- | -| 15 | unrecognized auth pattern: sshd_connection_closed_preauth | -| 16 | unrecognized auth pattern: sshd_timeout_or_disconnection | +| Line | Category | Reason | +| ---: | --- | --- | +| 15 | known_program_unknown_message | unrecognized auth pattern: sshd_connection_closed_preauth | +| 16 | known_program_unknown_message | unrecognized auth pattern: sshd_timeout_or_disconnection | diff --git a/tests/fixtures/report_contracts/journalctl_short_full/warnings.csv b/tests/fixtures/report_contracts/journalctl_short_full/warnings.csv index 1459da3..89909b0 100644 --- a/tests/fixtures/report_contracts/journalctl_short_full/warnings.csv +++ b/tests/fixtures/report_contracts/journalctl_short_full/warnings.csv @@ -1,3 +1,3 @@ -kind,line_number,message -parse_warning,15,unrecognized auth pattern: sshd_connection_closed_preauth -parse_warning,16,unrecognized auth pattern: sshd_timeout_or_disconnection +kind,line_number,category,message +parse_warning,15,known_program_unknown_message,unrecognized auth pattern: sshd_connection_closed_preauth +parse_warning,16,known_program_unknown_message,unrecognized auth pattern: sshd_timeout_or_disconnection diff --git a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json index 890a43d..c9581f0 100644 --- a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json +++ b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.json @@ -16,6 +16,10 @@ {"pattern": "sshd_connection_closed_preauth", "count": 1}, {"pattern": "sshd_negotiation_failure", "count": 1}, {"pattern": "sshd_timeout_or_disconnection", "count": 1} + ], + "failure_categories": [ + {"category": "known_program_unknown_message", "count": 3}, + {"category": "unsupported_pam_variant", "count": 2} ] }, "parsed_event_count": 12, @@ -102,10 +106,10 @@ } ], "warnings": [ - {"line_number": 12, "reason": "unrecognized auth pattern: pam_sss_unknown_user"}, - {"line_number": 14, "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, - {"line_number": 15, "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, - {"line_number": 16, "reason": "unrecognized auth pattern: pam_unix_session_closed"}, - {"line_number": 17, "reason": "unrecognized auth pattern: sshd_negotiation_failure"} + {"line_number": 12, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_sss_unknown_user"}, + {"line_number": 14, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, + {"line_number": 15, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, + {"line_number": 16, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_unix_session_closed"}, + {"line_number": 17, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_negotiation_failure"} ] } diff --git a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.md b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.md index 675720b..79fbdfb 100644 --- a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.md +++ b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/report.md @@ -51,12 +51,17 @@ | sshd_negotiation_failure | 1 | | sshd_timeout_or_disconnection | 1 | +| Failure Category | Count | +| --- | ---: | +| known_program_unknown_message | 3 | +| unsupported_pam_variant | 2 | + ## Parser Warnings -| Line | Reason | -| ---: | --- | -| 12 | unrecognized auth pattern: pam_sss_unknown_user | -| 14 | unrecognized auth pattern: sshd_connection_closed_preauth | -| 15 | unrecognized auth pattern: sshd_timeout_or_disconnection | -| 16 | unrecognized auth pattern: pam_unix_session_closed | -| 17 | unrecognized auth pattern: sshd_negotiation_failure | +| Line | Category | Reason | +| ---: | --- | --- | +| 12 | unsupported_pam_variant | unrecognized auth pattern: pam_sss_unknown_user | +| 14 | known_program_unknown_message | unrecognized auth pattern: sshd_connection_closed_preauth | +| 15 | known_program_unknown_message | unrecognized auth pattern: sshd_timeout_or_disconnection | +| 16 | unsupported_pam_variant | unrecognized auth pattern: pam_unix_session_closed | +| 17 | known_program_unknown_message | unrecognized auth pattern: sshd_negotiation_failure | diff --git a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/warnings.csv b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/warnings.csv index b2bdc44..08c750a 100644 --- a/tests/fixtures/report_contracts/multi_host_journalctl_short_full/warnings.csv +++ b/tests/fixtures/report_contracts/multi_host_journalctl_short_full/warnings.csv @@ -1,6 +1,6 @@ -kind,line_number,message -parse_warning,12,unrecognized auth pattern: pam_sss_unknown_user -parse_warning,14,unrecognized auth pattern: sshd_connection_closed_preauth -parse_warning,15,unrecognized auth pattern: sshd_timeout_or_disconnection -parse_warning,16,unrecognized auth pattern: pam_unix_session_closed -parse_warning,17,unrecognized auth pattern: sshd_negotiation_failure +kind,line_number,category,message +parse_warning,12,unsupported_pam_variant,unrecognized auth pattern: pam_sss_unknown_user +parse_warning,14,known_program_unknown_message,unrecognized auth pattern: sshd_connection_closed_preauth +parse_warning,15,known_program_unknown_message,unrecognized auth pattern: sshd_timeout_or_disconnection +parse_warning,16,unsupported_pam_variant,unrecognized auth pattern: pam_unix_session_closed +parse_warning,17,known_program_unknown_message,unrecognized auth pattern: sshd_negotiation_failure diff --git a/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json b/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json index 91c7ec4..13cc567 100644 --- a/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json +++ b/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.json @@ -17,6 +17,10 @@ {"pattern": "sshd_connection_closed_preauth", "count": 1}, {"pattern": "sshd_negotiation_failure", "count": 1}, {"pattern": "sshd_timeout_or_disconnection", "count": 1} + ], + "failure_categories": [ + {"category": "known_program_unknown_message", "count": 3}, + {"category": "unsupported_pam_variant", "count": 2} ] }, "parsed_event_count": 12, @@ -103,10 +107,10 @@ } ], "warnings": [ - {"line_number": 12, "reason": "unrecognized auth pattern: pam_sss_unknown_user"}, - {"line_number": 14, "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, - {"line_number": 15, "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, - {"line_number": 16, "reason": "unrecognized auth pattern: pam_unix_session_closed"}, - {"line_number": 17, "reason": "unrecognized auth pattern: sshd_negotiation_failure"} + {"line_number": 12, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_sss_unknown_user"}, + {"line_number": 14, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, + {"line_number": 15, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"}, + {"line_number": 16, "category": "unsupported_pam_variant", "reason": "unrecognized auth pattern: pam_unix_session_closed"}, + {"line_number": 17, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_negotiation_failure"} ] } diff --git a/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.md b/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.md index 9a9569b..f0e645b 100644 --- a/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.md +++ b/tests/fixtures/report_contracts/multi_host_syslog_legacy/report.md @@ -52,12 +52,17 @@ | sshd_negotiation_failure | 1 | | sshd_timeout_or_disconnection | 1 | +| Failure Category | Count | +| --- | ---: | +| known_program_unknown_message | 3 | +| unsupported_pam_variant | 2 | + ## Parser Warnings -| Line | Reason | -| ---: | --- | -| 12 | unrecognized auth pattern: pam_sss_unknown_user | -| 14 | unrecognized auth pattern: sshd_connection_closed_preauth | -| 15 | unrecognized auth pattern: sshd_timeout_or_disconnection | -| 16 | unrecognized auth pattern: pam_unix_session_closed | -| 17 | unrecognized auth pattern: sshd_negotiation_failure | +| Line | Category | Reason | +| ---: | --- | --- | +| 12 | unsupported_pam_variant | unrecognized auth pattern: pam_sss_unknown_user | +| 14 | known_program_unknown_message | unrecognized auth pattern: sshd_connection_closed_preauth | +| 15 | known_program_unknown_message | unrecognized auth pattern: sshd_timeout_or_disconnection | +| 16 | unsupported_pam_variant | unrecognized auth pattern: pam_unix_session_closed | +| 17 | known_program_unknown_message | unrecognized auth pattern: sshd_negotiation_failure | diff --git a/tests/fixtures/report_contracts/multi_host_syslog_legacy/warnings.csv b/tests/fixtures/report_contracts/multi_host_syslog_legacy/warnings.csv index b2bdc44..08c750a 100644 --- a/tests/fixtures/report_contracts/multi_host_syslog_legacy/warnings.csv +++ b/tests/fixtures/report_contracts/multi_host_syslog_legacy/warnings.csv @@ -1,6 +1,6 @@ -kind,line_number,message -parse_warning,12,unrecognized auth pattern: pam_sss_unknown_user -parse_warning,14,unrecognized auth pattern: sshd_connection_closed_preauth -parse_warning,15,unrecognized auth pattern: sshd_timeout_or_disconnection -parse_warning,16,unrecognized auth pattern: pam_unix_session_closed -parse_warning,17,unrecognized auth pattern: sshd_negotiation_failure +kind,line_number,category,message +parse_warning,12,unsupported_pam_variant,unrecognized auth pattern: pam_sss_unknown_user +parse_warning,14,known_program_unknown_message,unrecognized auth pattern: sshd_connection_closed_preauth +parse_warning,15,known_program_unknown_message,unrecognized auth pattern: sshd_timeout_or_disconnection +parse_warning,16,unsupported_pam_variant,unrecognized auth pattern: pam_unix_session_closed +parse_warning,17,known_program_unknown_message,unrecognized auth pattern: sshd_negotiation_failure diff --git a/tests/fixtures/report_contracts/syslog_legacy/report.json b/tests/fixtures/report_contracts/syslog_legacy/report.json index 5377ef6..c041831 100644 --- a/tests/fixtures/report_contracts/syslog_legacy/report.json +++ b/tests/fixtures/report_contracts/syslog_legacy/report.json @@ -14,6 +14,9 @@ "top_unknown_patterns": [ {"pattern": "sshd_connection_closed_preauth", "count": 1}, {"pattern": "sshd_timeout_or_disconnection", "count": 1} + ], + "failure_categories": [ + {"category": "known_program_unknown_message", "count": 2} ] }, "parsed_event_count": 14, @@ -76,7 +79,7 @@ } ], "warnings": [ - {"line_number": 15, "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, - {"line_number": 16, "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"} + {"line_number": 15, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_connection_closed_preauth"}, + {"line_number": 16, "category": "known_program_unknown_message", "reason": "unrecognized auth pattern: sshd_timeout_or_disconnection"} ] } diff --git a/tests/fixtures/report_contracts/syslog_legacy/report.md b/tests/fixtures/report_contracts/syslog_legacy/report.md index e6b7410..b8b2bf9 100644 --- a/tests/fixtures/report_contracts/syslog_legacy/report.md +++ b/tests/fixtures/report_contracts/syslog_legacy/report.md @@ -43,9 +43,13 @@ | sshd_connection_closed_preauth | 1 | | sshd_timeout_or_disconnection | 1 | +| Failure Category | Count | +| --- | ---: | +| known_program_unknown_message | 2 | + ## Parser Warnings -| Line | Reason | -| ---: | --- | -| 15 | unrecognized auth pattern: sshd_connection_closed_preauth | -| 16 | unrecognized auth pattern: sshd_timeout_or_disconnection | +| Line | Category | Reason | +| ---: | --- | --- | +| 15 | known_program_unknown_message | unrecognized auth pattern: sshd_connection_closed_preauth | +| 16 | known_program_unknown_message | unrecognized auth pattern: sshd_timeout_or_disconnection | diff --git a/tests/fixtures/report_contracts/syslog_legacy/warnings.csv b/tests/fixtures/report_contracts/syslog_legacy/warnings.csv index 1459da3..89909b0 100644 --- a/tests/fixtures/report_contracts/syslog_legacy/warnings.csv +++ b/tests/fixtures/report_contracts/syslog_legacy/warnings.csv @@ -1,3 +1,3 @@ -kind,line_number,message -parse_warning,15,unrecognized auth pattern: sshd_connection_closed_preauth -parse_warning,16,unrecognized auth pattern: sshd_timeout_or_disconnection +kind,line_number,category,message +parse_warning,15,known_program_unknown_message,unrecognized auth pattern: sshd_connection_closed_preauth +parse_warning,16,known_program_unknown_message,unrecognized auth pattern: sshd_timeout_or_disconnection diff --git a/tests/test_cli.cpp b/tests/test_cli.cpp index 2921d95..6fc4292 100644 --- a/tests/test_cli.cpp +++ b/tests/test_cli.cpp @@ -189,8 +189,8 @@ int main(int argc, char* argv[]) { expect(findings_csv.find("brute_force,source_ip,203.0.113.10,5,2026-03-10 08:11:22,2026-03-10 08:18:05,,5 failed SSH attempts from 203.0.113.10 within 10 minutes.") != std::string::npos, "expected brute-force findings csv row"); - expect(warnings_csv.find("kind,line_number,message") == 0, "expected warnings csv header"); - expect(warnings_csv.find("parse_warning,15,unrecognized auth pattern: sshd_connection_closed_preauth") + expect(warnings_csv.find("kind,line_number,category,message") == 0, "expected warnings csv header"); + expect(warnings_csv.find("parse_warning,15,known_program_unknown_message,unrecognized auth pattern: sshd_connection_closed_preauth") != std::string::npos, "expected warning csv row"); diff --git a/tests/test_parser.cpp b/tests/test_parser.cpp index dc111a2..376b90b 100644 --- a/tests/test_parser.cpp +++ b/tests/test_parser.cpp @@ -100,12 +100,23 @@ std::string noisy_auth_coverage_json(const loglens::ParseReport& result) { output << (index + 1 == result.quality.top_unknown_patterns.size() ? "\n" : ",\n"); } + output << " ],\n" + << " \"failure_categories\": [\n"; + + for (std::size_t index = 0; index < result.quality.failure_categories.size(); ++index) { + const auto& entry = result.quality.failure_categories[index]; + output << " {\"category\": \"" << loglens::to_string(entry.category) + << "\", \"count\": " << entry.count << "}"; + output << (index + 1 == result.quality.failure_categories.size() ? "\n" : ",\n"); + } + output << " ],\n" << " \"warnings\": [\n"; for (std::size_t index = 0; index < result.warnings.size(); ++index) { const auto& warning = result.warnings[index]; output << " {\"line_number\": " << warning.line_number + << ", \"category\": \"" << loglens::to_string(warning.category) << "\"" << ", \"reason\": \"" << warning.reason << "\"}"; output << (index + 1 == result.warnings.size() ? "\n" : ",\n"); } @@ -749,10 +760,39 @@ void test_journalctl_auth_family_fixture_file() { void test_malformed_line() { const auto parser = make_syslog_parser(); std::string error; - const auto event = parser.parse_line("malformed log line without syslog header", 9, &error); + loglens::ParserFailureCategory category = loglens::ParserFailureCategory::KnownProgramUnknownMessage; + const auto event = parser.parse_line("malformed log line without syslog header", 9, &error, &category); expect(!event.has_value(), "expected malformed line to fail"); expect(!error.empty(), "expected parse error for malformed line"); + expect(category == loglens::ParserFailureCategory::UnknownTimestamp, + "expected malformed header to be categorized as unknown timestamp"); +} + +void test_parser_failure_taxonomy() { + const auto parser = make_syslog_parser(); + std::istringstream input( + "rotated\n" + "Mar 10 08:00:00 example-host CRON[2001]: (root) CMD (/usr/bin/true)\n" + "Mar 10 08:00:10 example-host sshd[1001]: Connection closed by authenticating user root 203.0.113.10 port 50100 [preauth]\n" + "Mar 10 08:00:20 example-host sshd[1002]: Failed password for root from not_an_ip port 50101 ssh2\n" + "Mar 10 08:00:30 example-host pam_faillock(sshd:auth): Account temporarily locked for user root\n"); + + const auto result = parser.parse_stream(input); + + expect(result.events.empty(), "expected taxonomy fixture to produce warnings only"); + expect(result.warnings.size() == 5, "expected five taxonomy warnings"); + expect(result.quality.failure_categories.size() == 5, "expected five parser failure categories"); + expect(loglens::to_string(result.warnings[0].category) == "unknown_timestamp", + "expected first warning category"); + expect(loglens::to_string(result.warnings[1].category) == "unknown_program", + "expected second warning category"); + expect(loglens::to_string(result.warnings[2].category) == "known_program_unknown_message", + "expected third warning category"); + expect(loglens::to_string(result.warnings[3].category) == "malformed_source_ip", + "expected fourth warning category"); + expect(loglens::to_string(result.warnings[4].category) == "unsupported_pam_variant", + "expected fifth warning category"); } void test_unknown_auth_patterns_are_warnings_only() { @@ -800,6 +840,9 @@ void test_stream_warnings_and_metadata() { expect(result.quality.top_unknown_patterns.size() == 1, "expected one unknown pattern"); expect(result.quality.top_unknown_patterns.front().pattern == "missing_syslog_header_fields", "expected normalized structural parse failure pattern"); + expect(result.quality.failure_categories.size() == 1, "expected one parser failure category"); + expect(result.quality.failure_categories.front().category == loglens::ParserFailureCategory::UnknownTimestamp, + "expected missing header to be categorized as unknown timestamp"); } void test_stream_tracks_skipped_blank_lines() { @@ -841,6 +884,9 @@ void test_journalctl_metadata() { expect(result.quality.top_unknown_patterns.size() == 1, "expected one journalctl unknown pattern"); expect(result.quality.top_unknown_patterns.front().pattern == "missing_journalctl_short_full_header_fields", "expected normalized journalctl failure pattern"); + expect(result.quality.failure_categories.size() == 1, "expected one journalctl parser failure category"); + expect(result.quality.failure_categories.front().category == loglens::ParserFailureCategory::UnknownTimestamp, + "expected journalctl missing header to be categorized as unknown timestamp"); } void test_journalctl_rejects_empty_fractional_seconds() { @@ -1102,6 +1148,7 @@ int main() { test_syslog_auth_family_fixture_file(); test_journalctl_auth_family_fixture_file(); test_malformed_line(); + test_parser_failure_taxonomy(); test_unknown_auth_patterns_are_warnings_only(); test_stream_warnings_and_metadata(); test_stream_tracks_skipped_blank_lines(); diff --git a/tests/test_report.cpp b/tests/test_report.cpp index 69c07ef..77b7ab2 100644 --- a/tests/test_report.cpp +++ b/tests/test_report.cpp @@ -100,10 +100,10 @@ void test_noisy_auth_report_json_keeps_unsupported_lines_visible() { "expected noisy report json stable sshd preauth bucket"); expect(json.find("\"pattern\": \"pam_faillock_account_locked\", \"count\": 2") != std::string::npos, "expected noisy report json stable pam_faillock account-lock bucket"); - expect(json.find("\"line_number\": 13, \"reason\": \"unrecognized auth pattern: sshd_connection_closed_preauth\"") + expect(json.find("\"line_number\": 13, \"category\": \"known_program_unknown_message\", \"reason\": \"unrecognized auth pattern: sshd_connection_closed_preauth\"") != std::string::npos, "expected noisy report json to keep unsupported sshd warning visible"); - expect(json.find("\"line_number\": 24, \"reason\": \"unrecognized auth pattern: sudo_other\"") + expect(json.find("\"line_number\": 24, \"category\": \"known_program_unknown_message\", \"reason\": \"unrecognized auth pattern: sudo_other\"") != std::string::npos, "expected noisy report json to keep unsupported partial sudo warning visible"); } @@ -130,7 +130,8 @@ void test_markdown_table_cells_escape_user_controlled_values() { expect(markdown.find("summary \\| <raw> & more Usernames: ali\\|ce, bob<root>") != std::string::npos, "expected markdown finding notes to escape table and html-sensitive characters"); - expect(markdown.find("| 1 | bad \\| value
next <tag> & more |") != std::string::npos, + expect(markdown.find("| 1 | known_program_unknown_message | bad \\| value
next <tag> & more |") + != std::string::npos, "expected markdown warning reason to escape table pipes and newlines"); } @@ -218,7 +219,7 @@ void test_csv_neutralizes_formula_like_fields() { "expected formula-like finding subject to be neutralized"); expect(findings_csv.find(",'+bob;-carol;@dave,' @summary") != std::string::npos, "expected formula-like usernames and summary to be neutralized"); - expect(warnings_csv.find("parse_warning,2,'=warning") != std::string::npos, + expect(warnings_csv.find("parse_warning,2,known_program_unknown_message,'=warning") != std::string::npos, "expected formula-like warning reason to be neutralized"); } diff --git a/tests/test_report_contracts.cpp b/tests/test_report_contracts.cpp index 481bc97..5413f23 100644 --- a/tests/test_report_contracts.cpp +++ b/tests/test_report_contracts.cpp @@ -148,12 +148,14 @@ std::vector extract_json_contract_lines(const std::string& json) { || starts_with(line, "\"parsed_lines\": ") || starts_with(line, "\"unparsed_lines\": ") || starts_with(line, "\"parse_success_rate\": ") + || starts_with(line, "\"failure_categories\": ") || starts_with(line, "\"parsed_event_count\": ") || starts_with(line, "\"warning_count\": ") || starts_with(line, "\"finding_count\": ") || starts_with(line, "\"host_summaries\": ") || starts_with(line, "\"hostname\": ") || starts_with(line, "{\"pattern\": ") + || starts_with(line, "{\"category\": ") || starts_with(line, "{\"event_type\": ") || starts_with(line, "\"rule_id\": ") || starts_with(line, "\"rule\": ")