Building IP Intelligence Data Pipelines for Analytics and SIEM
Raw logs contain IP addresses. What they don't contain is context — where the traffic originated, what network it came from, whether it used a proxy or VPN, and what risk level it carries. Turning IP addresses into actionable intelligence requires a pipeline architecture that handles batch enrichment, real-time stream processing, and downstream analytics. This post shows how to build one.
The Enrichment Gap in Most Data Pipelines
Why Raw IP Addresses Are Not Enough
Security teams, analytics teams, and fraud operations all face the same problem: their data stores contain millions of IP addresses that have never been translated into location, network, or risk context. An IP address in a web server log tells you nothing about whether the request came from a corporate office in London or a residential proxy in the same city.
Three downstream use cases depend on IP enrichment:
SIEM and Security Analytics
Enrich firewall logs, authentication events, and network flows with geolocation, ASN, and threat data. Correlate login attempts by country or flag traffic from known hosting providers.
Business Intelligence
Add country, region, timezone, ISP, and connection type to user analytics. Segment traffic by geography without relying on cookies or user-reported location. Power dashboards that show regional distribution, carrier mix, and VPN traffic percentage.
Fraud Operations
Backfill risk scores on historical transactions. Build training datasets for ML models with geolocation, proxy flags, ASN classification, and connection type. Identify patterns in historical fraud that raw IP data would never reveal.
Pipeline Architecture: Three Patterns
Most IP enrichment pipelines fall into one of three architectures. Choose based on your latency requirements, data volume, and downstream system.
Pattern 1: Batch Enrichment (Data Warehouse Backfill)
Best for historical data — nightly ETL jobs that enrich tables in Snowflake, BigQuery, Redshift, or PostgreSQL. Process millions of records with a batch API endpoint.
// Python: Batch enrichment with chunked processing
import requests
def enrich_ip_batch(ip_list, api_key, chunk_size=1000):
"""Enrich a list of IPs using batch API calls."""
results = {}
for i in range(0, len(ip_list), chunk_size):
chunk = ip_list[i:i + chunk_size]
response = requests.post(
"https://ip-info.app/api/module-app/"
"v1-bulk-ip-details",
headers={
"x-api-key": api_key,
"Content-Type": "application/json"
},
json={"ips": chunk}
)
for record in response.json().get("results", []):
results[record["ip"]] = {
"country": record.get("country"),
"city": record.get("city"),
"isp": record.get("isp"),
"asn": record.get("asn"),
"is_vpn": record.get("isVpn", False),
"risk_score": record.get("riskScore"),
}
return resultsPattern 2: Real-Time Stream Enrichment
Best for live dashboards and alerting — enrich events as they flow through Kafka, Kinesis, or a similar event stream. Each event gets a synchronous API lookup before landing in the downstream system.
// Node.js / TypeScript: Stream enrichment middleware
import { moduleAppClient } from "@/sdks/module-app-client";
async function enrichEvent(event: Record<string, unknown>) {
const ip = event.clientIp as string;
if (!ip) return event;
try {
const { data } = await moduleAppClient
.v1GetIpDetails
.v1GetIpDetailsAction({ ip });
return {
...event,
geo: {
country: data.country,
city: data.city,
timezone: data.timezone,
},
network: {
isp: data.isp,
asn: data.asn,
connectionType: data.connectionType,
},
threat: {
isVpn: data.isVpn,
isProxy: data.isProxy,
isTor: data.isTor,
riskScore: data.riskScore,
},
};
} catch {
// Return original event on API failure
return { ...event, geo: null, network: null, threat: null };
}
}Pattern 3: Query-Time Enrichment (SIEM / Search)
Best for investigative workflows — enrich IPs on demand when an analyst runs a search. Integrate with Splunk, Elasticsearch, or Datadog via lookup tables or custom search commands.
# Elasticsearch: Enrich a query with IP geolocation data
# Assumes a pre-built enrichment index called "ip_intel"
GET /security-events-*/_search
{
"query": {
"bool": {
"must": [
{ "term": { "event_type": "login_failed" } }
],
"should": [
{
"term": {
"ip_intel.threat.is_vpn": true
}
}
]
}
},
"aggs": {
"by_country": {
"terms": { "field": "ip_intel.geo.country" }
},
"vpn_traffic_pct": {
"filters": {
"filters": {
"vpn": { "term": { "ip_intel.threat.is_vpn": true } },
"total": { "match_all": {} }
}
}
}
}
}Normalized Enrichment Schema
Regardless of your downstream platform, use a consistent enrichment schema. This makes pipelines portable between Splunk, Elasticsearch, Snowflake, and other systems.
| Field Group | Fields | Primary Use Case |
|---|---|---|
| geo | country, region, city, postal, lat, lon, timezone | Analytics dashboards, compliance filtering |
| network | isp, asn, org, connectionType | Carrier analysis, hosting detection, business segmentation |
| threat | isVpn, isProxy, isTor, isHosting, isBot, riskScore | Fraud detection, alert rules, risk scoring |
Operational Decisions That Affect Pipeline Cost
Deduplicate Before You Enrich
Log data is repetitive. The same IP can appear thousands of times in a single day's web server logs. Deduplicating IPs before calling the enrichment API can reduce your API spend by 80-95% on batch workloads. Cache results locally and rehydrate the full dataset after enrichment.
Cache TTL Strategy
IP-to-geolocation mappings change slowly for residential IPs but frequently for VPN and hosting providers. Use a 24-72 hour cache for batch enrichment and a 5-15 minute cache for real-time stream enrichment. For fraud-sensitive streams, cache only the geolocation fields and always fetch fresh threat signals.
Handle Private and Reserved IPs
Internal IPs (10.x.x.x, 172.16-31.x.x, 192.168.x.x, IPv6 link-local) should never be sent to the enrichment API. Filter them out at the extraction stage and tag them as internal in your downstream data. Sending private IPs to external APIs wastes credits and can trigger rate limits.
Rate Limiting and Backpressure
Production batch pipelines should implement token-bucket rate limiting and exponential backoff. Most IP intelligence APIs have rate limits that vary by plan tier. Design your pipeline to degrade gracefully — queue unenriched records and retry rather than dropping data.
Downstream Integration Patterns
Once your enrichment pipeline is running, the enriched data feeds into downstream systems. Here are the most common integration targets:
Elasticsearch / OpenSearch
Build an enrichment index keyed by IP address. Use Elasticsearch's terms lookup or scripted fields to join log events with geolocation and threat data at query time.
Splunk
Import enrichment data as a KV store lookup. Use the iplocation command with your custom lookup to add country, city, and ISP fields to search results.
Snowflake / BigQuery / Redshift
Create an ip_intel dimension table and JOIN against your fact tables. Materialized views with country and risk_score enable fast dashboard queries without re-enriching.
Datadog / Grafana
Use the Log Pipeline enrichment processor or a custom metric processor to add geo and threat fields to incoming log events. Build dashboards that segment traffic by country, VPN percentage, and risk score distribution.
Frequently Asked Questions
How much does IP enrichment cost at scale?
Should I enrich IPs in real time or in batch?
How do I handle VPN and proxy traffic in analytics pipelines?
How long does it take to enrich a million records?
What if the API returns no data for an IP?
Start Building Your Enrichment Pipeline
Test batch enrichment and real-time lookup patterns with 100 free API calls. Verify response fields, latency, and data depth against your production data.