Data Engineering & Analytics

IP Data Enrichment for Analytics: Warehouse, Batch, and Stream Patterns

By Sarah Kim, Analytics Engineering Lead10 min read

Your data warehouse stores millions of IP addresses in logs, transaction records, and session tables. Without enrichment, those IPs are opaque strings. This post covers three architecture patterns for turning raw IP addresses into geographic, network, and risk intelligence — batch loading into Snowflake or BigQuery, streaming enrichment for real-time dashboards, and SQL-based query-time lookups that avoid moving data entirely.

IP Enrichment for Analytics: Key Patterns

8K/min
Batch Throughput
IPs per minute via bulk API
7 Fields
Per IP Enriched
Country, city, ASN, ISP, VPN, threat, timezone
3 Patterns
Architecture Options
Batch, stream, and query-time
Zero
Schema Changes Needed
Enrichment is additive

Why IP Enrichment Belongs in Your Warehouse, Not Just Your Application

Most teams enrich IP addresses at the application layer — checking geolocation during login, scoring risk at checkout, or appending country headers at the edge. This handles real-time decisions well, but it leaves your historical data un-enriched. When your analytics team wants to answer questions like "what share of fraud attempts came from VPN traffic last quarter" or "how does conversion rate vary by city," they hit a wall: the warehouse has raw IPs, not location or risk data.

Warehouse-side enrichment solves this by joining IP intelligence into your fact tables. Once enriched, every downstream analyst, dashboard, and ML model can filter and group by country, ISP, VPN status, ASN, threat level, and timezone — without calling an API at query time.

Three Enrichment Architecture Patterns

There is no single right pattern. The best choice depends on your data volume, freshness requirements, query patterns, and warehouse platform. Here are the three main approaches, from simplest to most sophisticated:

1Batch Enrichment: Load IP Intelligence into Your Warehouse

Export unique IP addresses from your warehouse, call the bulk geolocation API, and load the enriched results back as a dimension table. This pattern runs on a schedule (daily, hourly) and gives every downstream query access to location and risk data without API calls.

# Step 1: Extract unique IPs from your warehouse
# Snowflake example
COPY INTO @staging/ips_to_enrich.csv
FROM (
  SELECT DISTINCT ip_address
  FROM raw_events
  WHERE event_date >= DATEADD(day, -1, CURRENT_DATE())
  AND ip_address IS NOT NULL
)
FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',')

# Step 2: Call the bulk geolocation API
# POST the CSV file and receive enriched results
curl -X POST "https://ip-info.app/api/v1/geolocate/bulk" \
  -H "x-api-key: YOUR_API_KEY" \
  -F "file=@ips_to_enrich.csv" \
  -o enriched_results.jsonl

# Step 3: Load enriched data back into Snowflake
CREATE OR REPLACE TABLE ip_intelligence_dim (
  ip_address STRING PRIMARY KEY,
  country_code STRING,
  city_name STRING,
  latitude FLOAT,
  longitude FLOAT,
  timezone STRING,
  asn INTEGER,
  aso STRING,
  is_vpn BOOLEAN,
  is_proxy BOOLEAN,
  is_tor BOOLEAN,
  threat_level STRING,
  accuracy_radius INTEGER,
  enriched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

When to use: Daily or hourly enrichment of historical data. Works well when your warehouse is the single source of truth and analysts need to run ad-hoc queries against enriched IP data. Supports Snowflake, BigQuery, Redshift, Databricks, and any platform that can run SQL and import files.

2Stream Enrichment: Enrich in Real Time Before Warehouse Insert

Add an enrichment step in your streaming pipeline — Kafka, Kinesis, or Flink — that calls the IP intelligence API on each event before it lands in the warehouse. Events arrive already enriched, so no batch job is needed.

// Kafka Streams / Node.js: Enrich events before warehouse insert
const { Kafka } = require('kafkajs');
const axios = require('axios');

const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'ip-enricher' });
const producer = kafka.producer();

const API_KEY = process.env.IP_INFO_API_KEY;
const API_BASE = 'https://ip-info.app/api/v1/geolocate';
const cache = new Map(); // In-memory cache for dedup

async function enrichIp(ip) {
  if (cache.has(ip)) return cache.get(ip);
  const { data } = await axios.get(`${API_BASE}/${ip}`, {
    headers: { 'x-api-key': API_KEY },
    timeout: 5000,
  });
  const enriched = {
    countryCode: data.countryCode,
    cityName: data.city?.name,
    asn: data.asn,
    aso: data.aso,
    isVpn: data.isVPN,
    isProxy: data.isProxy,
    threatLevel: data.threatLevel,
    timezone: data.city?.timeZone,
  };
  cache.set(ip, enriched);
  if (cache.size > 100_000) {
    // Evict oldest entries to bound memory
    const firstKey = cache.keys().next().value;
    cache.delete(firstKey);
  }
  return enriched;
}

async function run() {
  await consumer.connect();
  await producer.connect();
  await consumer.subscribe({ topic: 'raw-events' });

  for await (const message of consumer) {
    const event = JSON.parse(message.value);
    if (event.ipAddress) {
      try {
        event.ipIntel = await enrichIp(event.ipAddress);
      } catch {
        event.ipIntel = { error: 'enrichment_failed' };
      }
    }
    await producer.send({
      topic: 'enriched-events',
      messages: [{ value: JSON.stringify(event) }],
    });
  }
}

run().catch(console.error);

When to use: Real-time dashboards, fraud operations that need current data, or when your warehouse serves operational queries that should not wait for a batch cycle. Adds latency to the ingestion pipeline (one API call per unique IP), so cache aggressively.

3Query-Time Enrichment: Call the API from SQL (No Data Movement)

Some platforms support external functions or UDFs that call HTTP APIs during query execution. Snowflake External Functions, BigQuery Remote Functions, and Redshift Lambda UDFs all support this pattern. The enrichment happens at query time — no ETL, no batch schedule, no data movement.

-- Snowflake: External Function for IP geolocation
-- (Requires API Gateway or Azure Function as proxy)
CREATE OR REPLACE EXTERNAL FUNCTION
  enrich_ip(ip_address VARCHAR)
  RETURNS VARIANT
  API_INTEGRATION = ip_info_api
  AS 'https://your-gateway.example.com/ip-enrich';

-- Usage: enrich IPs at query time
SELECT
  e.event_id,
  e.ip_address,
  enrich_ip(e.ip_address):countryCode AS country,
  enrich_ip(e.ip_address):cityName AS city,
  enrich_ip(e.ip_address):asn AS asn,
  enrich_ip(e.ip_address):aso AS isp,
  enrich_ip(e.ip_address):isVpn AS is_vpn,
  enrich_ip(e.ip_address):threatLevel AS threat
FROM analytics.events e
WHERE e.event_date >= DATEADD(day, -7, CURRENT_DATE())
  AND enrich_ip(e.ip_address):isVpn = TRUE;

-- For production: cache results in a temp table
-- to avoid repeated API calls on the same IP
CREATE TEMPORARY TABLE ip_cache AS
SELECT DISTINCT ip_address, enrich_ip(ip_address) AS intel
FROM analytics.events
WHERE event_date >= DATEADD(day, -7, CURRENT_DATE());

When to use: Ad-hoc analysis where you do not want to maintain an ETL pipeline, or when you need to enrich a small subset of IPs on demand. Not suitable for large-scale recurring queries — the per-row API call cost and latency make batch or stream patterns more efficient at scale.

What You Can Build With Enriched IP Data

Once IP intelligence lives in your warehouse, it powers queries across fraud, analytics, compliance, and operations. Here are the highest-value use cases teams actually ship:

Fraud analysis by VPN and proxy status

Group transactions, login attempts, and account changes by VPN flag, proxy status, and threat level. Identify which regions, ISPs, and ASN ranges produce the most suspicious activity.

Geographic conversion and revenue analysis

Segment conversion rates, average order value, and churn by country, city, and timezone. Enriched IP data gives you geographic analytics without client-side tracking or cookies.

Network and ISP intelligence

Understand which ISPs and carriers your users connect through. Identify patterns like high churn from mobile carriers in specific regions or fraud clusters from particular ASN ranges.

Compliance and regulatory reporting

Filter transactions by country for PSD3 reporting, GDPR data residency checks, or sanctions screening. Build automated reports that map IP-based geographic data to regulatory requirements.

Suspicious traffic clustering

Use ASN, ISP, and threat-level fields to detect coordinated attacks. A sudden spike from a single ASN range, or multiple accounts sharing the same VPN exit node, signals organized fraud.

Content localization analytics

Measure how users in different regions interact with localized content. Compare engagement metrics across countries and timezones to optimize content strategy and regional campaigns.

dbt Transformation: Turning Raw Enrichment Into Analytics-Ready Models

Raw API response data needs cleaning before analysts can use it reliably. Here is a practical dbt model that transforms enriched IP data into a clean dimension table:

-- models/ip_intelligence_clean.sql
-- Cleans and deduplicates the raw enrichment data
WITH raw AS (
  SELECT * FROM {{ source('ip_intelligence', 'raw_enrichment') }}
),

deduplicated AS (
  SELECT
    ip_address,
    -- Country: prefer non-null, standardize codes
    COALESCE(NULLIF(country_code, ''), 'UNKNOWN') AS country_code,
    -- City: lowercase for consistency
    LOWER(NULLIF(city_name, '')) AS city_name,
    -- Coordinates: validate ranges
    CASE
      WHEN latitude BETWEEN -90 AND 90
        AND longitude BETWEEN -180 AND 180
      THEN latitude
      ELSE NULL
    END AS latitude,
    CASE
      WHEN latitude BETWEEN -90 AND 90
        AND longitude BETWEEN -180 AND 180
      THEN longitude
      ELSE NULL
    END AS longitude,
    -- Network intelligence
    asn,
    aso AS isp_name,
    -- Threat classification
    COALESCE(is_vpn, FALSE) AS is_vpn,
    COALESCE(is_proxy, FALSE) AS is_proxy,
    COALESCE(is_tor, FALSE) AS is_tor,
    COALESCE(NULLIF(threat_level, ''), 'low') AS threat_level,
    -- Metadata
    timezone,
    accuracy_radius,
    -- Keep the most recent enrichment
    ROW_NUMBER() OVER (
      PARTITION BY ip_address
      ORDER BY enriched_at DESC
    ) AS rn
  FROM raw
)

SELECT
  ip_address,
  country_code,
  city_name,
  latitude,
  longitude,
  asn,
  isp_name,
  is_vpn,
  is_proxy,
  is_tor,
  threat_level,
  timezone,
  accuracy_radius
FROM deduplicated
WHERE rn = 1

Choosing the Right Pattern

Each pattern has trade-offs. Use this comparison to pick the approach that fits your volume, latency, and cost constraints:

CriterionBatchStreamQuery-Time
Data freshnessHourly to dailyReal-time (seconds)Always current
Cost efficiencyBest — deduplicates naturallyGood — needs in-memory cacheWorst — repeated API calls
Infrastructure complexityLow — SQL + cronHigh — streaming stackMedium — external function setup
Best forHistorical analysis, reportingReal-time dashboards, fraud opsAd-hoc investigation
Scale limitMillions of IPs per cycleThousands per second (cached)Hundreds per query

Start Enriching Your Warehouse With IP Intelligence

Test the API with a sample of IPs from your production data. The live demo returns country, city, ASN, ISP, VPN status, and threat level — the same fields you would load into your warehouse.