Overview
Pipeviz uses a simple JSON configuration to define your data lineage. Only pipelines are required - clusters and datasources are auto-created when referenced, but you can add rich descriptors for better documentation.
Root Structure
{
"clusters": [ ... ], // Optional: cluster definitions
"pipelines": [ ... ], // Required: pipeline definitions
"datasources": [ ... ] // Optional: rich data source definitions
}
Clusters (Optional)
Clusters can be optionally declared upfront or referenced on-the-fly. Support nested hierarchies with parent relationships.
{
"name": "real-time",
"description": "Real-time processing cluster", // Optional
"parent": "order-management" // Optional: creates nested cluster
}
Pipelines
{
"name": "user-enrichment",
"description": "Enriches user data...", // Optional
"input_sources": ["raw_users", "events"], // Optional
"output_sources": ["enriched_users"], // Optional
"schedule": "Every 2 hours", // Optional
"tags": ["user-data", "ml"], // Optional
"cluster": "user-processing", // Optional: single cluster
"upstream_pipelines": ["data-ingestion"], // Optional
"links": { // Optional
"airflow": "https://...",
"monitoring": "https://..."
}
}
Data Sources (Optional)
Data sources are auto-created when referenced in pipelines. Define them explicitly to add rich metadata, ownership, and documentation.
{
"name": "raw_users",
"description": "Raw user data...", // Optional
"type": "snowflake", // Optional
"owner": "data-team@company.com", // Optional
"tags": ["pii", "users"], // Optional
"cluster": "user-processing", // Optional: single cluster
"metadata": { // Optional
"size": "2.1TB",
"record_count": "45M"
},
"links": { // Optional
"snowflake": "https://...",
"docs": "https://..."
}
}
💡 Pro Tips:
- Only
pipelines
array with pipeline name
fields is required
- Clusters and datasources are auto-created when referenced in pipelines
- Define datasources explicitly to add rich metadata, ownership, and links
- Use
cluster
(singular) to assign nodes to one cluster each
- Create nested clusters using the
parent
field in cluster definitions
- Use
upstream_pipelines
to show pipeline dependencies
- Links can point to any external tools (Airflow, monitoring, docs, etc.)