IP Data Enrichment for Analytics: Warehouse, Batch, and Stream Patterns
Your data warehouse stores millions of IP addresses in logs, transaction records, and session tables. Without enrichment, those IPs are opaque strings. This post covers three architecture patterns for turning raw IP addresses into geographic, network, and risk intelligence — batch loading into Snowflake or BigQuery, streaming enrichment for real-time dashboards, and SQL-based query-time lookups that avoid moving data entirely.
IP Enrichment for Analytics: Key Patterns
Why IP Enrichment Belongs in Your Warehouse, Not Just Your Application
Most teams enrich IP addresses at the application layer — checking geolocation during login, scoring risk at checkout, or appending country headers at the edge. This handles real-time decisions well, but it leaves your historical data un-enriched. When your analytics team wants to answer questions like "what share of fraud attempts came from VPN traffic last quarter" or "how does conversion rate vary by city," they hit a wall: the warehouse has raw IPs, not location or risk data.
Warehouse-side enrichment solves this by joining IP intelligence into your fact tables. Once enriched, every downstream analyst, dashboard, and ML model can filter and group by country, ISP, VPN status, ASN, threat level, and timezone — without calling an API at query time.
Three Enrichment Architecture Patterns
There is no single right pattern. The best choice depends on your data volume, freshness requirements, query patterns, and warehouse platform. Here are the three main approaches, from simplest to most sophisticated:
1Batch Enrichment: Load IP Intelligence into Your Warehouse
Export unique IP addresses from your warehouse, call the bulk geolocation API, and load the enriched results back as a dimension table. This pattern runs on a schedule (daily, hourly) and gives every downstream query access to location and risk data without API calls.
# Step 1: Extract unique IPs from your warehouse # Snowflake example COPY INTO @staging/ips_to_enrich.csv FROM ( SELECT DISTINCT ip_address FROM raw_events WHERE event_date >= DATEADD(day, -1, CURRENT_DATE()) AND ip_address IS NOT NULL ) FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',') # Step 2: Call the bulk geolocation API # POST the CSV file and receive enriched results curl -X POST "https://ip-info.app/api/v1/geolocate/bulk" \ -H "x-api-key: YOUR_API_KEY" \ -F "file=@ips_to_enrich.csv" \ -o enriched_results.jsonl # Step 3: Load enriched data back into Snowflake CREATE OR REPLACE TABLE ip_intelligence_dim ( ip_address STRING PRIMARY KEY, country_code STRING, city_name STRING, latitude FLOAT, longitude FLOAT, timezone STRING, asn INTEGER, aso STRING, is_vpn BOOLEAN, is_proxy BOOLEAN, is_tor BOOLEAN, threat_level STRING, accuracy_radius INTEGER, enriched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() );
When to use: Daily or hourly enrichment of historical data. Works well when your warehouse is the single source of truth and analysts need to run ad-hoc queries against enriched IP data. Supports Snowflake, BigQuery, Redshift, Databricks, and any platform that can run SQL and import files.
2Stream Enrichment: Enrich in Real Time Before Warehouse Insert
Add an enrichment step in your streaming pipeline — Kafka, Kinesis, or Flink — that calls the IP intelligence API on each event before it lands in the warehouse. Events arrive already enriched, so no batch job is needed.
// Kafka Streams / Node.js: Enrich events before warehouse insert
const { Kafka } = require('kafkajs');
const axios = require('axios');
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'ip-enricher' });
const producer = kafka.producer();
const API_KEY = process.env.IP_INFO_API_KEY;
const API_BASE = 'https://ip-info.app/api/v1/geolocate';
const cache = new Map(); // In-memory cache for dedup
async function enrichIp(ip) {
if (cache.has(ip)) return cache.get(ip);
const { data } = await axios.get(`${API_BASE}/${ip}`, {
headers: { 'x-api-key': API_KEY },
timeout: 5000,
});
const enriched = {
countryCode: data.countryCode,
cityName: data.city?.name,
asn: data.asn,
aso: data.aso,
isVpn: data.isVPN,
isProxy: data.isProxy,
threatLevel: data.threatLevel,
timezone: data.city?.timeZone,
};
cache.set(ip, enriched);
if (cache.size > 100_000) {
// Evict oldest entries to bound memory
const firstKey = cache.keys().next().value;
cache.delete(firstKey);
}
return enriched;
}
async function run() {
await consumer.connect();
await producer.connect();
await consumer.subscribe({ topic: 'raw-events' });
for await (const message of consumer) {
const event = JSON.parse(message.value);
if (event.ipAddress) {
try {
event.ipIntel = await enrichIp(event.ipAddress);
} catch {
event.ipIntel = { error: 'enrichment_failed' };
}
}
await producer.send({
topic: 'enriched-events',
messages: [{ value: JSON.stringify(event) }],
});
}
}
run().catch(console.error);When to use: Real-time dashboards, fraud operations that need current data, or when your warehouse serves operational queries that should not wait for a batch cycle. Adds latency to the ingestion pipeline (one API call per unique IP), so cache aggressively.
3Query-Time Enrichment: Call the API from SQL (No Data Movement)
Some platforms support external functions or UDFs that call HTTP APIs during query execution. Snowflake External Functions, BigQuery Remote Functions, and Redshift Lambda UDFs all support this pattern. The enrichment happens at query time — no ETL, no batch schedule, no data movement.
-- Snowflake: External Function for IP geolocation -- (Requires API Gateway or Azure Function as proxy) CREATE OR REPLACE EXTERNAL FUNCTION enrich_ip(ip_address VARCHAR) RETURNS VARIANT API_INTEGRATION = ip_info_api AS 'https://your-gateway.example.com/ip-enrich'; -- Usage: enrich IPs at query time SELECT e.event_id, e.ip_address, enrich_ip(e.ip_address):countryCode AS country, enrich_ip(e.ip_address):cityName AS city, enrich_ip(e.ip_address):asn AS asn, enrich_ip(e.ip_address):aso AS isp, enrich_ip(e.ip_address):isVpn AS is_vpn, enrich_ip(e.ip_address):threatLevel AS threat FROM analytics.events e WHERE e.event_date >= DATEADD(day, -7, CURRENT_DATE()) AND enrich_ip(e.ip_address):isVpn = TRUE; -- For production: cache results in a temp table -- to avoid repeated API calls on the same IP CREATE TEMPORARY TABLE ip_cache AS SELECT DISTINCT ip_address, enrich_ip(ip_address) AS intel FROM analytics.events WHERE event_date >= DATEADD(day, -7, CURRENT_DATE());
When to use: Ad-hoc analysis where you do not want to maintain an ETL pipeline, or when you need to enrich a small subset of IPs on demand. Not suitable for large-scale recurring queries — the per-row API call cost and latency make batch or stream patterns more efficient at scale.
What You Can Build With Enriched IP Data
Once IP intelligence lives in your warehouse, it powers queries across fraud, analytics, compliance, and operations. Here are the highest-value use cases teams actually ship:
Fraud analysis by VPN and proxy status
Group transactions, login attempts, and account changes by VPN flag, proxy status, and threat level. Identify which regions, ISPs, and ASN ranges produce the most suspicious activity.
Geographic conversion and revenue analysis
Segment conversion rates, average order value, and churn by country, city, and timezone. Enriched IP data gives you geographic analytics without client-side tracking or cookies.
Network and ISP intelligence
Understand which ISPs and carriers your users connect through. Identify patterns like high churn from mobile carriers in specific regions or fraud clusters from particular ASN ranges.
Compliance and regulatory reporting
Filter transactions by country for PSD3 reporting, GDPR data residency checks, or sanctions screening. Build automated reports that map IP-based geographic data to regulatory requirements.
Suspicious traffic clustering
Use ASN, ISP, and threat-level fields to detect coordinated attacks. A sudden spike from a single ASN range, or multiple accounts sharing the same VPN exit node, signals organized fraud.
Content localization analytics
Measure how users in different regions interact with localized content. Compare engagement metrics across countries and timezones to optimize content strategy and regional campaigns.
dbt Transformation: Turning Raw Enrichment Into Analytics-Ready Models
Raw API response data needs cleaning before analysts can use it reliably. Here is a practical dbt model that transforms enriched IP data into a clean dimension table:
-- models/ip_intelligence_clean.sql
-- Cleans and deduplicates the raw enrichment data
WITH raw AS (
SELECT * FROM {{ source('ip_intelligence', 'raw_enrichment') }}
),
deduplicated AS (
SELECT
ip_address,
-- Country: prefer non-null, standardize codes
COALESCE(NULLIF(country_code, ''), 'UNKNOWN') AS country_code,
-- City: lowercase for consistency
LOWER(NULLIF(city_name, '')) AS city_name,
-- Coordinates: validate ranges
CASE
WHEN latitude BETWEEN -90 AND 90
AND longitude BETWEEN -180 AND 180
THEN latitude
ELSE NULL
END AS latitude,
CASE
WHEN latitude BETWEEN -90 AND 90
AND longitude BETWEEN -180 AND 180
THEN longitude
ELSE NULL
END AS longitude,
-- Network intelligence
asn,
aso AS isp_name,
-- Threat classification
COALESCE(is_vpn, FALSE) AS is_vpn,
COALESCE(is_proxy, FALSE) AS is_proxy,
COALESCE(is_tor, FALSE) AS is_tor,
COALESCE(NULLIF(threat_level, ''), 'low') AS threat_level,
-- Metadata
timezone,
accuracy_radius,
-- Keep the most recent enrichment
ROW_NUMBER() OVER (
PARTITION BY ip_address
ORDER BY enriched_at DESC
) AS rn
FROM raw
)
SELECT
ip_address,
country_code,
city_name,
latitude,
longitude,
asn,
isp_name,
is_vpn,
is_proxy,
is_tor,
threat_level,
timezone,
accuracy_radius
FROM deduplicated
WHERE rn = 1Choosing the Right Pattern
Each pattern has trade-offs. Use this comparison to pick the approach that fits your volume, latency, and cost constraints:
| Criterion | Batch | Stream | Query-Time |
|---|---|---|---|
| Data freshness | Hourly to daily | Real-time (seconds) | Always current |
| Cost efficiency | Best — deduplicates naturally | Good — needs in-memory cache | Worst — repeated API calls |
| Infrastructure complexity | Low — SQL + cron | High — streaming stack | Medium — external function setup |
| Best for | Historical analysis, reporting | Real-time dashboards, fraud ops | Ad-hoc investigation |
| Scale limit | Millions of IPs per cycle | Thousands per second (cached) | Hundreds per query |
Start Enriching Your Warehouse With IP Intelligence
Test the API with a sample of IPs from your production data. The live demo returns country, city, ASN, ISP, VPN status, and threat level — the same fields you would load into your warehouse.