Engineering • Data Architecture

Building IP Intelligence Data Pipelines for Analytics and SIEM

By Sarah Chen, Data Engineering Lead10 min read

Raw logs contain IP addresses. What they don't contain is context — where the traffic originated, what network it came from, whether it used a proxy or VPN, and what risk level it carries. Turning IP addresses into actionable intelligence requires a pipeline architecture that handles batch enrichment, real-time stream processing, and downstream analytics. This post shows how to build one.

The Enrichment Gap in Most Data Pipelines

20-30%
Records Enrichable
Typical log data with IP fields
10M+
IPs Processed Daily
Production batch throughput
25+
Data Fields Per Lookup
Country, ASN, ISP, risk, timezone
99.9%
Accuracy Rate
Country-level geolocation

Why Raw IP Addresses Are Not Enough

Security teams, analytics teams, and fraud operations all face the same problem: their data stores contain millions of IP addresses that have never been translated into location, network, or risk context. An IP address in a web server log tells you nothing about whether the request came from a corporate office in London or a residential proxy in the same city.

Three downstream use cases depend on IP enrichment:

SIEM and Security Analytics

Enrich firewall logs, authentication events, and network flows with geolocation, ASN, and threat data. Correlate login attempts by country or flag traffic from known hosting providers.

Business Intelligence

Add country, region, timezone, ISP, and connection type to user analytics. Segment traffic by geography without relying on cookies or user-reported location. Power dashboards that show regional distribution, carrier mix, and VPN traffic percentage.

Fraud Operations

Backfill risk scores on historical transactions. Build training datasets for ML models with geolocation, proxy flags, ASN classification, and connection type. Identify patterns in historical fraud that raw IP data would never reveal.

Pipeline Architecture: Three Patterns

Most IP enrichment pipelines fall into one of three architectures. Choose based on your latency requirements, data volume, and downstream system.

Pattern 1: Batch Enrichment (Data Warehouse Backfill)

Best for historical data — nightly ETL jobs that enrich tables in Snowflake, BigQuery, Redshift, or PostgreSQL. Process millions of records with a batch API endpoint.

Source DBExtract IPsBatch API CallMerge ResultsTarget DB
// Python: Batch enrichment with chunked processing
import requests

def enrich_ip_batch(ip_list, api_key, chunk_size=1000):
    """Enrich a list of IPs using batch API calls."""
    results = {}
    for i in range(0, len(ip_list), chunk_size):
        chunk = ip_list[i:i + chunk_size]
        response = requests.post(
            "https://ip-info.app/api/module-app/"
            "v1-bulk-ip-details",
            headers={
                "x-api-key": api_key,
                "Content-Type": "application/json"
            },
            json={"ips": chunk}
        )
        for record in response.json().get("results", []):
            results[record["ip"]] = {
                "country": record.get("country"),
                "city": record.get("city"),
                "isp": record.get("isp"),
                "asn": record.get("asn"),
                "is_vpn": record.get("isVpn", False),
                "risk_score": record.get("riskScore"),
            }
    return results

Pattern 2: Real-Time Stream Enrichment

Best for live dashboards and alerting — enrich events as they flow through Kafka, Kinesis, or a similar event stream. Each event gets a synchronous API lookup before landing in the downstream system.

Event SourceStream ProcessorIP Lookup (per event)Enriched Event
// Node.js / TypeScript: Stream enrichment middleware
import { moduleAppClient } from "@/sdks/module-app-client";

async function enrichEvent(event: Record<string, unknown>) {
  const ip = event.clientIp as string;
  if (!ip) return event;

  try {
    const { data } = await moduleAppClient
      .v1GetIpDetails
      .v1GetIpDetailsAction({ ip });

    return {
      ...event,
      geo: {
        country: data.country,
        city: data.city,
        timezone: data.timezone,
      },
      network: {
        isp: data.isp,
        asn: data.asn,
        connectionType: data.connectionType,
      },
      threat: {
        isVpn: data.isVpn,
        isProxy: data.isProxy,
        isTor: data.isTor,
        riskScore: data.riskScore,
      },
    };
  } catch {
    // Return original event on API failure
    return { ...event, geo: null, network: null, threat: null };
  }
}

Pattern 3: Query-Time Enrichment (SIEM / Search)

Best for investigative workflows — enrich IPs on demand when an analyst runs a search. Integrate with Splunk, Elasticsearch, or Datadog via lookup tables or custom search commands.

Analyst QueryLookup TableEnriched Results
# Elasticsearch: Enrich a query with IP geolocation data
# Assumes a pre-built enrichment index called "ip_intel"

GET /security-events-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "event_type": "login_failed" } }
      ],
      "should": [
        {
          "term": {
            "ip_intel.threat.is_vpn": true
          }
        }
      ]
    }
  },
  "aggs": {
    "by_country": {
      "terms": { "field": "ip_intel.geo.country" }
    },
    "vpn_traffic_pct": {
      "filters": {
        "filters": {
          "vpn": { "term": { "ip_intel.threat.is_vpn": true } },
          "total": { "match_all": {} }
        }
      }
    }
  }
}

Normalized Enrichment Schema

Regardless of your downstream platform, use a consistent enrichment schema. This makes pipelines portable between Splunk, Elasticsearch, Snowflake, and other systems.

Field GroupFieldsPrimary Use Case
geocountry, region, city, postal, lat, lon, timezoneAnalytics dashboards, compliance filtering
networkisp, asn, org, connectionTypeCarrier analysis, hosting detection, business segmentation
threatisVpn, isProxy, isTor, isHosting, isBot, riskScoreFraud detection, alert rules, risk scoring

Operational Decisions That Affect Pipeline Cost

Deduplicate Before You Enrich

Log data is repetitive. The same IP can appear thousands of times in a single day's web server logs. Deduplicating IPs before calling the enrichment API can reduce your API spend by 80-95% on batch workloads. Cache results locally and rehydrate the full dataset after enrichment.

Cache TTL Strategy

IP-to-geolocation mappings change slowly for residential IPs but frequently for VPN and hosting providers. Use a 24-72 hour cache for batch enrichment and a 5-15 minute cache for real-time stream enrichment. For fraud-sensitive streams, cache only the geolocation fields and always fetch fresh threat signals.

Handle Private and Reserved IPs

Internal IPs (10.x.x.x, 172.16-31.x.x, 192.168.x.x, IPv6 link-local) should never be sent to the enrichment API. Filter them out at the extraction stage and tag them as internal in your downstream data. Sending private IPs to external APIs wastes credits and can trigger rate limits.

Rate Limiting and Backpressure

Production batch pipelines should implement token-bucket rate limiting and exponential backoff. Most IP intelligence APIs have rate limits that vary by plan tier. Design your pipeline to degrade gracefully — queue unenriched records and retry rather than dropping data.

Downstream Integration Patterns

Once your enrichment pipeline is running, the enriched data feeds into downstream systems. Here are the most common integration targets:

Elasticsearch / OpenSearch

Build an enrichment index keyed by IP address. Use Elasticsearch's terms lookup or scripted fields to join log events with geolocation and threat data at query time.

Splunk

Import enrichment data as a KV store lookup. Use the iplocation command with your custom lookup to add country, city, and ISP fields to search results.

Snowflake / BigQuery / Redshift

Create an ip_intel dimension table and JOIN against your fact tables. Materialized views with country and risk_score enable fast dashboard queries without re-enriching.

Datadog / Grafana

Use the Log Pipeline enrichment processor or a custom metric processor to add geo and threat fields to incoming log events. Build dashboards that segment traffic by country, VPN percentage, and risk score distribution.

Frequently Asked Questions

How much does IP enrichment cost at scale?
With deduplication, most teams enrich 100K-500K unique IPs per day. At volume pricing tiers, this translates to roughly $27-$143/day in API costs. The deduplication factor varies by traffic pattern but typically reduces the unique IP count by 80-95% compared to total events.
Should I enrich IPs in real time or in batch?
Use real-time enrichment for fraud detection, access control, and live dashboards where latency matters. Use batch enrichment for historical analysis, model training, and compliance reporting. Many teams run both — a real-time path for enforcement and a nightly batch for analytics.
How do I handle VPN and proxy traffic in analytics pipelines?
Tag enriched records with the VPN, proxy, and Tor flags from the API response. Create separate analytics dimensions for "organic traffic," "VPN traffic," and "proxy traffic." Filter or segment on these flags in your dashboards rather than excluding them — the distribution of masked traffic itself is a useful signal.
How long does it take to enrich a million records?
With batch API calls (100-1000 IPs per request), deduplication, and parallel processing, one million unique IPs typically enrich in 15-45 minutes depending on your API tier's rate limits and your geographic proximity to the API endpoint.
What if the API returns no data for an IP?
Some IPs (particularly in IPv6 space or certain regional ISPs) may return partial data. Design your pipeline to handle null fields gracefully — use "unknown" as a fallback dimension rather than dropping the record. For fraud scoring, a missing threat signal should default to a neutral stance rather than blocking the traffic.

Start Building Your Enrichment Pipeline

Test batch enrichment and real-time lookup patterns with 100 free API calls. Verify response fields, latency, and data depth against your production data.