Databricks
Customers on an Enterprise or Growth plan can access Data Pipeline as an add-on package. See our pricing page for more details.
Export your Mixpanel data to Databricks using Unity Catalog Managed Volumes. This integration supports all Databricks clouds (AWS, GCP, and Azure).
Design
Mixpanel exports data to customer’s Databricks workspace using Unity Catalog Managed Volumes. We first load the data into a single-column raw (VARIANT type) data table, then create a view to expose all properties as typed columns.
Supported Features
- Cross-cloud Support: Works with Databricks on AWS, GCP, or Azure
- Date Partitioning: Raw tables are clustered by
event_datefor efficient date-based queries - Static IP Addresses: Supports IP allowlisting for secure connections
IP Allowlist
Mixpanel Data Pipelines supports static IP addresses for Databricks connections when IP restrictions are configured on your Databricks workspace. If you are using network policies to restrict access to your instance, you might need to add the following IP addresses to the allowed list:
US
34.31.112.201
35.184.21.33
35.225.176.74EU
34.147.68.192
35.204.164.122
35.204.177.251Prerequisites
Before setting up the integration, ensure you have:
- A Databricks workspace with Unity Catalog enabled
- Databricks Runtime 15.3+ (for VARIANT type support)
- A SQL Warehouse (Serverless recommended for best performance and cost)
- Admin permissions in your Databricks workspace to create Service Principals
Set Export Permissions
Step 1: Create a Service Principal
A Service Principal is a Databricks identity that Mixpanel will use to access your workspace.
- In your Databricks workspace, navigate to Settings → Identity and access → Service principals
- Click Add service principal
- Enter a name (e.g.,
mixpanel-export) - Click Add
- Note the Application ID (Client ID) - you’ll need this later
Step 2: Generate OAuth Secret
- Click on the Service Principal you just created
- Navigate to the Secrets tab
- Click Generate secret
- Copy the Secret value immediately - it won’t be shown again
- Store it securely - you’ll need it for Mixpanel configuration
Step 3: Create Catalog and Schema
Create a dedicated catalog and schema for Mixpanel data (or use existing ones).
CREATE CATALOG IF NOT EXISTS mixpanel_export
COMMENT 'Mixpanel data exports';
-- Create schema
CREATE SCHEMA IF NOT EXISTS mixpanel_export.json_pipelines;Step 4: Grant Permissions to Service Principal
Grant the Service Principal required permissions to operate within the catalog.
GRANT USE CATALOG ON CATALOG mixpanel_export
TO `<service-principal-application-id>`;
GRANT USE SCHEMA ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;
GRANT CREATE TABLE ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;
GRANT CREATE VOLUME ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;For Metastore Privilege Version 1.1+ only, additionally grant:
GRANT CREATE VIEW ON SCHEMA mixpanel_export.json_pipelines
TO `<service-principal-application-id>`;To check your metastore privilege version:
DESCRIBE METASTORE;Look for PrivilegeModelVersion:
- Version 1.0:
CREATE TABLEcovers both tables and views - Version 1.1+: Separate
CREATE VIEWprivilege required
Why These Permissions?
USE CATALOG: Required to access the catalogUSE SCHEMA: Required to access objects in the schemaCREATE TABLE: Create raw tables to store event dataCREATE VIEW: Create views with typed columns (metastore v1.1+ only)CREATE VOLUME: Create temporary volumes for uploading files
Note: The Service Principal automatically becomes the owner of all objects it creates (tables, views, volumes), which grants it full permissions (SELECT, MODIFY, READ FILES, WRITE FILES, etc.) on those objects. No additional grants needed.
Step 5: Grant SQL Warehouse Access
The Service Principal needs permission to use the SQL Warehouse to execute queries.
- In your Databricks workspace, navigate to SQL Warehouses
- Click on your SQL Warehouse (or create one if needed)
- Go to the Permissions tab
- Click Add or Grant permissions
- Search for your Service Principal by Application ID (from Step 1)
- Select permission level: Can use (minimum required)
- Click Save
Why this is needed: SQL Warehouses have their own access control separate from data permissions. Even with all catalog/schema/table permissions, the Service Principal cannot execute queries without warehouse access.
Step 6: Get SQL Warehouse Connection Details
- In the same SQL Warehouse, go to the Connection details tab
- Note the following values:
- Server hostname: e.g.,
abc123.cloud.databricks.com - HTTP Path: e.g.,
/sql/1.0/warehouses/xyz789
- Server hostname: e.g.,
Recommendation: Use a Serverless SQL Warehouse for:
- Fast startup (~3 seconds)
- Auto-scaling
- Pay-per-use pricing
- No idle cluster costs
Step 7: Configure Mixpanel Integration
Refer to Step 2: Creating the Pipeline to create data pipeline via UI. You’ll need to provide:
- Server Hostname (from Step 6)
- HTTP Path (from Step 6)
- Catalog (from Step 3, e.g.,
mixpanel_export) - Schema (from Step 3, e.g.,
json_pipelines) - Service Principal ID (Application ID from Step 1)
- Service Principal Secret (from Step 2)
Partitioning
Raw tables are clustered by the event_date column, which is computed in your project’s timezone during data load. This clustering significantly improves query performance when filtering by date.
Data Schema
Mixpanel creates a raw table and a view with typed columns:
Raw Table Columns (mp_master_event_raw):
DATA(VARIANT) - Contains the complete event JSONevent_date(DATE) - Computed from event time in your project’s timezone
View Columns (mp_master_event):
user_id(STRING)time(TIMESTAMP)properties(VARIANT) - All event properties as semi-structured datainsert_id(STRING)event_name(STRING)distinct_id(STRING)device_id(STRING)event_date(DATE)
Queries
Databricks supports a VARIANT type that can store semi-structured JSON data. Use the :: syntax to extract and cast properties from VARIANT columns.
Basic event query
SELECT count(*)
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_name = 'sign up';Query nested properties
SELECT
event_name,
time,
distinct_id,
properties:button_name::STRING AS button_name,
properties:product:category::STRING AS product_category,
properties:price::DOUBLE AS price
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_name = 'Purchase'
AND properties:product:category::STRING = 'Electronics';Efficient date filtering
Use the event_date column for best performance:
SELECT
event_name,
time,
distinct_id,
properties:button_name::STRING AS button_name
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
AND event_name = 'Purchase'
ORDER BY time DESC;Getting the number of events per day
SELECT
event_date,
count(*) as event_count
FROM mixpanel_export.json_pipelines.mp_master_event
WHERE event_date >= '2024-01-01' AND event_date <= '2024-01-31'
GROUP BY event_date
ORDER BY event_date;Costs
- Delta tables: Billed by your cloud provider (AWS S3, GCP GCS, or Azure ADLS) via Databricks
- Managed volumes: Temporary storage cleaned up after each export
- Compute: SQL Warehouse usage during COPY INTO operations
Was this page useful?