Event-Driven Fraud Detection

TL;DR

In this project, I set out to build an event-driven anti-fraud platform that not only works, but is also operable when things get complicated.

The pipeline uses stage-level intermediate persistence: Outbox in transaction-service, Inbox + Outbox in fraud-detection-service, and Inbox + deduplication in alert-service. With that design, every critical hop (persistence, evaluation, publishing, and alerting) keeps recoverable state without blocking the request path.

The local runtime keeps default x6 replicas per microservice behind NGINX gateways, Kafka runs as a 3-broker cluster, and the scripts/start-autoscale.sh starter computes partitions, listener concurrency, and connection pools to scale coherently.

This is not just a functional demo: it is a case study with explicit resilience, end-to-end traceability, and reproducible validation (unit + integration + e2e + load with k6).

Context and motivation

When fraud analysis and transactional writes live in the same synchronous flow, they compete for latency and availability.

The practical goal was to decouple anti-fraud analysis without losing operational control, with concrete pieces to operate better under real failures:

consistency between DB and event publishing (Outbox),
decoupling consumption/processing with Inbox to absorb spikes,
at-least-once semantics with idempotent consumers,
DLQ as an operational mechanism (not just a “parking lot”),
business-oriented observability, not only infrastructure observability.

Diagram of the decoupled event-driven flow with Inbox/Outbox and main stages

Technical objective

The technical focus is for the pipeline to work under pressure without losing control:

Keep high transaction write throughput.
Close the DB/Kafka consistency gap in transaction-service and fraud-detection-service.
Scale processing with real partitioning, replicas, and Inbox workers.
Improve triage with traces, structured logs, and stage-level metrics.
Validate behavior reproducibly with mixed load and failure scenarios.

Implemented architecture

The architecture is organized into three functional domains with a load-balanced entry layer and operational persistence per stage:

transaction-gateway (NGINX :8080) -> transaction-service (default x6).
fraud-gateway (NGINX :8081) -> fraud-detection-service (default x6).
alert-gateway (NGINX :8082) -> alert-service (default x6).

Services by responsibility:

transaction-service
- exposes REST API + webhook,
- persists the transaction,
- enqueues TransactionCreatedEvent in transaction_outbox within the same DB transaction,
- an asynchronous relay publishes to Kafka and marks published/failed states.
fraud-detection-service
- consumes transactions.created,
- ingests into fraud_inbox with idempotent insert by eventId,
- processes the inbox with workers, evaluates heuristic rules, and persists history,
- enqueues FraudDetectedEvent in fraud_outbox when riskScore exceeds threshold,
- publishes fraud.detected through an asynchronous outbox relay.
alert-service
- consumes fraud.detected,
- ingests into alert_inbox and processes with workers,
- deduplicates by eventId (inbox + processed_events) and persists alerts,
- notifies by channels (log always enabled + optional email with MailHog locally),
- exposes alert queries for verification and triage.

Each service keeps its own dedicated PostgreSQL database and stage-specific operational tables (transaction_outbox, fraud_inbox, fraud_outbox, alert_inbox, processed_events) to decouple stages and sustain recovery.

Kafka topology and scalability

The local stack starts a 3-broker Kafka cluster (kafka, kafka-2, kafka-3) with settings oriented toward local fault tolerance:

min.insync.replicas=2
default replication factor 3
configurable topic partitions (APP_KAFKA_PARTITIONS, default 18)

Main topics:

transactions.created
fraud.detected
transactions.created.dlq
fraud.detected.dlq

This combination makes it possible to test real partition-level parallelism and better observe lag/backpressure behavior under sustained load.

In addition, the scripts/start-autoscale.sh starter computes a coherent scaling configuration:

partitions = max(fraud_instances * fraud_concurrency, alert_instances * alert_concurrency),
SPRING_KAFKA_LISTENER_CONCURRENCY per instance in fraud-detection-service and alert-service,
effective consumer group concurrency as min(partitions, instances * concurrency),
Hikari connection budgets/pools per service,
.env.scaling file to run docker compose without editing the base compose.

Screenshot of partition metrics and consumer lag per group under sustained load

Metrics after sustained load of 2500 RPS with 18 partitions

End-to-end operational flow

Client sends a transaction by REST or webhook to the gateway.
transaction-service validates payload, persists to PostgreSQL, and enqueues event in transaction_outbox (JSONB).
TransactionOutboxRelayService takes pending batches with FOR UPDATE SKIP LOCKED, publishes to Kafka, and marks state (PUBLISHED or retry with nextAttemptAt).
fraud-detection-service consumes transactions.created and enqueues into fraud_inbox with insert ... on conflict do nothing.
FraudInboxProcessingService drains inbox with workers, evaluates rules, and persists user transaction history.
If the result is fraud, it enqueues FraudDetectedEvent in fraud_outbox; its relay then publishes to fraud.detected.
alert-service consumes fraud.detected, enqueues into alert_inbox, and processes with workers.
AlertProcessingService deduplicates (inbox + processed_events), persists alert, and executes channel notifications.
If consumption fails, fixed-backoff retries are applied and then DLQ; DLQ consumers attempt reprocessing and register success/failure metrics.

The separation between synchronous writes and asynchronous processing/publishing per stage is key to keeping stable latency and operational consistency.

End-to-end sequence diagram from API/webhook to alert or DLQ

Resilience and consistency

Inbox + Outbox per stage

The platform uses intermediate persistence at each critical hop:

transaction-service: transaction_outbox to decouple transactional writes and transactions.created publishing,
fraud-detection-service: fraud_inbox for durable consumption + fraud_outbox for durable fraud.detected publishing,
alert-service: alert_inbox to decouple Kafka consumption and alert/notification processing.

Common technical points:

batching + non-blocking pessimistic lock (SKIP LOCKED),
retries with configurable delay,
periodic cleanup of processed/published events,
traceparent and baggage propagation to preserve trace continuity.

Concurrent idempotency

Idempotency is applied per layer to sustain at-least-once without duplicate side effects:

fraud-detection-service: deduplication by unique eventId key in fraud_inbox + explicit RECEIVED -> PROCESSED state transition,
alert-service: deduplication in alert_inbox and an additional barrier in processed_events with atomic insert (saveAndFlush) and DataIntegrityViolationException handling.

Retries, DLQ, and operational reprocessing

The consumption strategy uses fixed backoff (1s, 3 attempts). If recovery does not happen, the message is routed to <topic>.dlq.

The DLQ is not left passive:

dedicated consumers exist per DLQ,
received/reprocessed/failed metrics are recorded,
scripts are available to force and verify failure/reprocessing (scripts/test-dlq.sh, scripts/test-dlq-reprocess.sh).

Rules engine and event contracts

Active fraud rules

An accumulative heuristic engine is used with score cap at 100 and configurable threshold (fraud-score-threshold, default 70).

Active rules:

HIGH_AMOUNT (+45)
HIGH_VELOCITY (+35)
COUNTRY_CHANGE_IN_SHORT_WINDOW (+30)
HIGH_RISK_MERCHANT (+25)

The FraudDetectedEvent includes ruleVersion and transactionOccurredAt, which makes decision traceability and real end-to-end pipeline latency measurement easier.

Operational observability

The observability stack covers application metrics, infrastructure metrics, and distributed traceability:

Prometheus (services + exporters)
Loki (structured logs)
Alloy (Docker log collection + OTLP reception)
Tempo (distributed traces)
Grafana (dashboards and exploration)
Kafka Exporter + Postgres Exporters + cAdvisor (capacity and infrastructure)
operational JSON log contract (event + outcome) to filter critical transitions by service.

Active dashboards:

fraud-observability
fraud-alerting-live
fraud-tracing
fraud-alert-triage-db
fraud-kafka-operations
fraud-throughput-live
fraud-capacity-backpressure

Metrics that delivered the most value

transaction_events_enqueued_total{outcome}
transaction_events_published_total{outcome}
fraud_events_consumed_total
fraud_events_published_total{outcome}
fraud_decisions_total{decision}
fraud_alerts_total
fraud_alert_notifications_total{channel,outcome}
fraud_pipeline_e2e_latency_seconds{path}
fraud_inbox_backlog
alert_inbox_backlog
kafka_dlq_events_received_total
kafka_dlq_events_reprocessed_total
kafka_dlq_events_failed_total

Business and reliability alerts

Business and reliability rules were included (including coverage and conversion SLOs):

FraudPipelineCoverageSLOViolation
FraudToAlertConversionSLOViolation
NotificationFailureRateHigh
FraudEvaluationLatencyHigh
AlertNotificationLatencyHigh
KafkaDlqTrafficDetected
KafkaDlqReprocessFailed

This set reduces detection time and avoids depending only on generic technical alerts.

Grafana screenshot with a fired SLO alert and supporting panels for triage

Behavior validation

Reproducible load with `k6`

The load runner is prepared for daily use and comparable tests:

modes: stress, spike, soak, smoke,
profiles: capacity-baseline, balanced, mostly-normal, fraud-focus, validation, chaos-5xx, custom,
interactive and non-interactive support,
strong input validation before execution,
fallback to grafana/k6 container when no local binary is available.

This makes it possible to test not only throughput, but also controlled degradation, queue pressure, error behavior, and fraud decision consistency.

Screenshot of k6 results for the executed profile (RPS, p95, error rate)

Automated tests

The strategy combines:

unit tests,
per-service integration tests with Testcontainers (Kafka + PostgreSQL),
full-pipeline e2e with 6 scenarios (including mixed load with error rate, P95, and alert materialization assertions).

There are also separate CI workflows for build/unit, integration, and e2e, with retries in e2e and diagnostic artifacts on failure.

Outcome signals that are actually validated

Beyond “it compiles,” there are explicit conditions the project validates automatically:

in mixedLoad_shouldKeepApiHealthyAndGenerateExpectedAlerts (e2e), 120 concurrent requests are executed (REST + webhook),
the test requires errorRate <= 2% and p95 < 2.5s on the entry path,
and it also confirms that the expected fraud volume ends up materialized as persisted alerts.

This does not replace a production benchmark, but it does guarantee a reproducible baseline to compare architecture and tuning changes without subjective “feelings.”

1
long failedRequests = results.stream().filter(result -> result.statusCode() != 201).count();
2
double errorRate = failedRequests / (double) totalRequests;
3
assertTrue(errorRate <= 0.02, "API error rate should stay below 2% under load");
4
List<Long> latencies = results.stream()
5
        .map(LoadResult::latencyMs)
6
        .sorted(Comparator.naturalOrder())
7
        .toList();
8
int p95Index = (int) Math.ceil(latencies.size() * 0.95) - 1;
9
long p95LatencyMs = latencies.get(Math.max(0, p95Index));
10
assertTrue(p95LatencyMs < 2500, "P95 latency should stay below 2.5s under load");
11
await()
12
        .atMost(Duration.ofSeconds(90))
13
        .pollInterval(Duration.ofSeconds(3))
14
        .untilAsserted(() -> {
15
            int currentAlertCount =
16
                    given()
17
                    .when()
18
                            .get(alertServiceUrl + "/api/v1/alerts")
19
                    .then()
20
                            .statusCode(200)
21
                    .extract()
22
                            .path("size()");
23
            assertTrue(currentAlertCount >= initialAlertCount + fraudulentRequests,
24
                    "Fraud alerts should be generated for load scenario");
25
        });

Result of the mixed E2E scenario with key assertions (error rate, p95, and materialized alerts) in PASS state

Daily operations and triage

Recommended operational flow:

Confirm pipeline health on business dashboards.
Verify whether the bottleneck is inbox backlog, outbox publish, evaluation, or notification.
Correlate by traceId in JSON logs.
Jump to distributed trace for the exact sequence.

Useful scripts from the repository:

scripts/start-autoscale.sh
scripts/single-fraud-scenario.sh
scripts/run-k6-stress.sh
scripts/test-dlq.sh
scripts/test-dlq-reprocess.sh

Triage dashboard with traceId correlation across logs, traces, and DLQ metrics

Implemented guarantees

The platform includes concrete end-to-end guarantees:

lower risk of event loss during intermediate failures through stage-level persistence (inbox/outbox),
isolation between synchronous path (API) and asynchronous evaluation/publishing segments,
explicit deduplication to neutralize redeliveries and races in consumers,
operational traceability per event (state, attempts, errors, latency, and backlog).

Assumed trade-offs

There are also deliberate concessions that are part of the design:

consistency is eventual (not immediate): detection/alert arrives milliseconds or seconds after the initial write,
operational complexity increases (more state tables, workers, retention policies, and queue metrics),
observability cost goes up, but it is offset by faster diagnosis during real degradation.

Current technical debt

Area	Current state	Risk	Next step
Event contracts	JSON without formal governance	Silent incompatible changes	Formal versioning + contract tests in CI
DB schema	`ddl-auto: update` still in use	Drift between environments	Versioned migrations (Flyway/Liquibase)
API security	Endpoints without authN/authZ	Operational exposure	JWT/OAuth2 + rate limiting
Deployment platform	Focus on local Docker Compose	Gap compared to production	Production profile (orchestration + autoscaling)

Metric naming fine standardization is also still pending to simplify long-term dashboard maintenance.

Key learnings

Moving to event-driven without intermediate persistence (Inbox/Outbox) leaves a consistency gap that is too costly in production.
Consumer idempotency is not optional when you have at-least-once and real parallelism.
DLQ without reprocessing and dedicated metrics is operational debt in disguise.
Observability that actually accelerates triage combines business metrics + structured logs + traces.
Loading the system with mixed profiles (not only happy path) changes the quality of architecture decisions.

Personal closing

This project left me with a practical conclusion: in event-driven architectures, building a functional flow is only half the work; the other half is making it observable, recoverable, and operable under pressure.

That changed my design criteria: now I prioritize guarantees, operational signals, and recovery strategies from the beginning.