JDBC
Important Capabilities
Capability | Status | Notes |
---|---|---|
Asset Containers | ✅ | Enabled by default |
Detect Deleted Entities | ✅ | Optionally enabled via stateful_ingestion.remove_stale_metadata |
Platform Instance | ✅ | Enabled by default |
Table-Level Lineage | ✅ | Enabled by default |
The JDBC source plugin provides comprehensive metadata extraction capabilities for JDBC-compliant databases. It supports:
- Extraction of database structure (tables, views, columns)
- Schema metadata including data types and constraints
- View definitions and dependencies
- Stored procedures (optional)
- SSL connections with certificate management
- Maven-based driver management
- Flexible pattern matching for schema/table filtering
The plugin uses Java Database Connectivity (JDBC) APIs through JPype and JayDeBeApi, allowing it to support any database with a JDBC driver. It handles connection pooling, retries, and proper resource cleanup to ensure reliable metadata extraction.
Setup
This integration pulls metadata from databases via JDBC connections. It supports various database systems through their respective JDBC drivers.
You'll need:
- A running database instance that you want to connect to
- The appropriate JDBC driver for your database
- Valid credentials with permissions to read metadata
Steps to Get Started
JDBC Driver Setup:
- Option 1: Download the JDBC driver JAR file for your database
- Option 2: Use Maven coordinates to automatically download the driver
Permissions Required:
- READ access to system catalogs/metadata views
- Ability to execute metadata queries
- Access to relevant schemas and tables
Connection Information:
- JDBC connection URL
- Username and password (if using basic authentication)
- SSL configuration (if required)
- Any additional JDBC properties needed for your specific database
CLI based Ingestion
Install the Plugin
The jdbc
source works out of the box with acryl-datahub
.
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: jdbc
config:
# JDBC Driver Configuration
driver:
driver_class: org.postgresql.Driver # Replace with your database's driver class
# Either specify maven_coordinates or driver_path
maven_coordinates: org.postgresql:postgresql:42.7.1
# driver_path: "/path/to/driver.jar"
# Connection Configuration
connection:
uri: "jdbc:postgresql://localhost:5432//mydb" # Replace with your database URI
username: "user"
password: "pass"
# Optional SSL Configuration
ssl_config:
cert_path: "/path/to/cert"
# cert_type: "pem" # pem, jks, or p12
# cert_password: ""
# Additional JDBC Properties
properties:
applicationName: "datahub_jdbc_ingestion"
# Additional JVM Arguments
jvm_args:
- "-Xmx1g"
- "-Djavax.net.ssl.trustStore=/etc/ssl/certs/java/cacerts"
# Optional: SQL dialect for query parsing
sqlglot_dialect: "postgres" # Replace with your database's dialect
# Optional Filters
schema_pattern:
allow:
- "schema1"
- "schema2"
table_pattern:
allow:
- "schema1.table1"
deny:
- "schema1.temp_.*"
# Feature flags
include_tables: true
include_views: true
include_stored_procedures: false
sink:
# sink configs
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
platform ✅ string | Name of platform being ingested. |
connection ✅ JDBCConnectionConfig | Database connection configuration |
connection.uri ❓ string | JDBC URI (jdbc:protocol://host:port/database) |
connection.password string | Database password |
connection.properties map(str,string) | |
connection.username string | Database username |
connection.ssl_config SSLConfig | SSL configuration |
connection.ssl_config.cert_content string | Base64 encoded certificate content |
connection.ssl_config.cert_password string | Certificate password if required |
connection.ssl_config.cert_path string | Path to SSL certificate file |
connection.ssl_config.cert_type string | Certificate type (pem, jks, p12) Default: pem |
driver ✅ JDBCDriverConfig | JDBC driver configuration |
driver.driver_class ❓ string | Fully qualified JDBC driver class name |
driver.driver_path string | Path to JDBC driver JAR |
driver.maven_coordinates string | Maven coordinates (groupId:artifactId:version) |
include_stored_procedures boolean | Include stored procedures in extraction Default: False |
include_tables boolean | Include tables in extraction Default: True |
include_views boolean | Include views in extraction Default: True |
platform_instance string | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details. |
profile_table_row_counts boolean | Include table row counts Default: True |
profile_table_size boolean | Include table size metrics Default: False |
sqlglot_dialect string | sqlglot dialect to use for SQL transpiling |
env string | The environment that all assets produced by this connector belong to Default: PROD |
jvm_args array | JVM arguments for JDBC driver Default: [] |
jvm_args.string string | |
schema_pattern AllowDenyPattern | Regex patterns for schemas Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
schema_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
schema_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] |
schema_pattern.allow.string string | |
schema_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] |
schema_pattern.deny.string string | |
table_pattern AllowDenyPattern | Regex patterns for tables Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
table_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] |
table_pattern.allow.string string | |
table_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] |
table_pattern.deny.string string | |
usage BaseUsageConfig | Usage statistics configuration Default: {'bucket_duration': 'DAY', 'end_time': '2025-01-02... |
usage.bucket_duration Enum | Size of the time window to aggregate usage stats. Default: DAY |
usage.end_time string(date-time) | Latest date of lineage/usage to consider. Default: Current time in UTC |
usage.format_sql_queries boolean | Whether to format sql queries Default: False |
usage.include_operational_stats boolean | Whether to display operational stats. Default: True |
usage.include_read_operational_stats boolean | Whether to report read operational stats. Experimental. Default: False |
usage.include_top_n_queries boolean | Whether to ingest the top_n_queries. Default: True |
usage.start_time string(date-time) | Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration ). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. |
usage.top_n_queries integer | Number of top queries to save to each table. Default: 10 |
usage.user_email_pattern AllowDenyPattern | regex patterns for user emails to filter in usage. Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
usage.user_email_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
usage.user_email_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] |
usage.user_email_pattern.allow.string string | |
usage.user_email_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] |
usage.user_email_pattern.deny.string string | |
view_pattern AllowDenyPattern | Regex patterns for views Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
view_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] |
view_pattern.allow.string string | |
view_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] |
view_pattern.deny.string string | |
stateful_ingestion StatefulStaleMetadataRemovalConfig | Base specialized config for Stateful Ingestion with stale metadata removal capability. |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"title": "JDBCSourceConfig",
"description": "Configuration for JDBC metadata extraction",
"type": "object",
"properties": {
"platform_instance": {
"title": "Platform Instance",
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://datahubproject.io/docs/platform-instances/ for more details.",
"type": "string"
},
"env": {
"title": "Env",
"description": "The environment that all assets produced by this connector belong to",
"default": "PROD",
"type": "string"
},
"stateful_ingestion": {
"$ref": "#/definitions/StatefulStaleMetadataRemovalConfig"
},
"driver": {
"title": "Driver",
"description": "JDBC driver configuration",
"allOf": [
{
"$ref": "#/definitions/JDBCDriverConfig"
}
]
},
"connection": {
"title": "Connection",
"description": "Database connection configuration",
"allOf": [
{
"$ref": "#/definitions/JDBCConnectionConfig"
}
]
},
"platform": {
"title": "Platform",
"description": "Name of platform being ingested.",
"type": "string"
},
"include_tables": {
"title": "Include Tables",
"description": "Include tables in extraction",
"default": true,
"type": "boolean"
},
"include_views": {
"title": "Include Views",
"description": "Include views in extraction",
"default": true,
"type": "boolean"
},
"include_stored_procedures": {
"title": "Include Stored Procedures",
"description": "Include stored procedures in extraction",
"default": false,
"type": "boolean"
},
"sqlglot_dialect": {
"title": "Sqlglot Dialect",
"description": "sqlglot dialect to use for SQL transpiling",
"type": "string"
},
"jvm_args": {
"title": "Jvm Args",
"description": "JVM arguments for JDBC driver",
"default": [],
"type": "array",
"items": {
"type": "string"
}
},
"schema_pattern": {
"title": "Schema Pattern",
"description": "Regex patterns for schemas",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"table_pattern": {
"title": "Table Pattern",
"description": "Regex patterns for tables",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"view_pattern": {
"title": "View Pattern",
"description": "Regex patterns for views",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"profile_table_size": {
"title": "Profile Table Size",
"description": "Include table size metrics",
"default": false,
"type": "boolean"
},
"profile_table_row_counts": {
"title": "Profile Table Row Counts",
"description": "Include table row counts",
"default": true,
"type": "boolean"
},
"usage": {
"title": "Usage",
"description": "Usage statistics configuration",
"default": {
"bucket_duration": "DAY",
"end_time": "2025-01-02T14:05:49.832947+00:00",
"start_time": "2025-01-01T00:00:00+00:00",
"queries_character_limit": 24000,
"top_n_queries": 10,
"user_email_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"include_operational_stats": true,
"include_read_operational_stats": false,
"format_sql_queries": false,
"include_top_n_queries": true
},
"allOf": [
{
"$ref": "#/definitions/BaseUsageConfig"
}
]
}
},
"required": [
"driver",
"connection",
"platform"
],
"additionalProperties": false,
"definitions": {
"DynamicTypedStateProviderConfig": {
"title": "DynamicTypedStateProviderConfig",
"type": "object",
"properties": {
"type": {
"title": "Type",
"description": "The type of the state provider to use. For DataHub use `datahub`",
"type": "string"
},
"config": {
"title": "Config",
"description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19).",
"default": {},
"type": "object"
}
},
"required": [
"type"
],
"additionalProperties": false
},
"StatefulStaleMetadataRemovalConfig": {
"title": "StatefulStaleMetadataRemovalConfig",
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"type": "object",
"properties": {
"enabled": {
"title": "Enabled",
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"default": false,
"type": "boolean"
},
"remove_stale_metadata": {
"title": "Remove Stale Metadata",
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
},
"JDBCDriverConfig": {
"title": "JDBCDriverConfig",
"description": "JDBC Driver configuration",
"type": "object",
"properties": {
"driver_class": {
"title": "Driver Class",
"description": "Fully qualified JDBC driver class name",
"type": "string"
},
"driver_path": {
"title": "Driver Path",
"description": "Path to JDBC driver JAR",
"type": "string"
},
"maven_coordinates": {
"title": "Maven Coordinates",
"description": "Maven coordinates (groupId:artifactId:version)",
"type": "string"
}
},
"required": [
"driver_class"
],
"additionalProperties": false
},
"SSLConfig": {
"title": "SSLConfig",
"description": "SSL Certificate configuration",
"type": "object",
"properties": {
"cert_path": {
"title": "Cert Path",
"description": "Path to SSL certificate file",
"type": "string"
},
"cert_content": {
"title": "Cert Content",
"description": "Base64 encoded certificate content",
"type": "string"
},
"cert_type": {
"title": "Cert Type",
"description": "Certificate type (pem, jks, p12)",
"default": "pem",
"type": "string"
},
"cert_password": {
"title": "Cert Password",
"description": "Certificate password if required",
"type": "string"
}
},
"additionalProperties": false
},
"JDBCConnectionConfig": {
"title": "JDBCConnectionConfig",
"description": "JDBC Connection configuration",
"type": "object",
"properties": {
"uri": {
"title": "Uri",
"description": "JDBC URI (jdbc:protocol://host:port/database)",
"type": "string"
},
"username": {
"title": "Username",
"description": "Database username",
"type": "string"
},
"password": {
"title": "Password",
"description": "Database password",
"type": "string"
},
"properties": {
"title": "Properties",
"description": "Additional JDBC properties",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"ssl_config": {
"title": "Ssl Config",
"description": "SSL configuration",
"allOf": [
{
"$ref": "#/definitions/SSLConfig"
}
]
}
},
"required": [
"uri"
],
"additionalProperties": false
},
"AllowDenyPattern": {
"title": "AllowDenyPattern",
"description": "A class to store allow deny regexes",
"type": "object",
"properties": {
"allow": {
"title": "Allow",
"description": "List of regex patterns to include in ingestion",
"default": [
".*"
],
"type": "array",
"items": {
"type": "string"
}
},
"deny": {
"title": "Deny",
"description": "List of regex patterns to exclude from ingestion.",
"default": [],
"type": "array",
"items": {
"type": "string"
}
},
"ignoreCase": {
"title": "Ignorecase",
"description": "Whether to ignore case sensitivity during pattern matching.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
},
"BucketDuration": {
"title": "BucketDuration",
"description": "An enumeration.",
"enum": [
"DAY",
"HOUR"
],
"type": "string"
},
"BaseUsageConfig": {
"title": "BaseUsageConfig",
"type": "object",
"properties": {
"bucket_duration": {
"description": "Size of the time window to aggregate usage stats.",
"default": "DAY",
"allOf": [
{
"$ref": "#/definitions/BucketDuration"
}
]
},
"end_time": {
"title": "End Time",
"description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
"type": "string",
"format": "date-time"
},
"start_time": {
"title": "Start Time",
"description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
"type": "string",
"format": "date-time"
},
"top_n_queries": {
"title": "Top N Queries",
"description": "Number of top queries to save to each table.",
"default": 10,
"exclusiveMinimum": 0,
"type": "integer"
},
"user_email_pattern": {
"title": "User Email Pattern",
"description": "regex patterns for user emails to filter in usage.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"include_operational_stats": {
"title": "Include Operational Stats",
"description": "Whether to display operational stats.",
"default": true,
"type": "boolean"
},
"include_read_operational_stats": {
"title": "Include Read Operational Stats",
"description": "Whether to report read operational stats. Experimental.",
"default": false,
"type": "boolean"
},
"format_sql_queries": {
"title": "Format Sql Queries",
"description": "Whether to format sql queries",
"default": false,
"type": "boolean"
},
"include_top_n_queries": {
"title": "Include Top N Queries",
"description": "Whether to ingest the top_n_queries.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
}
}
}
Starter Recipe
source:
type: jdbc
config:
# JDBC Driver Configuration
driver:
driver_class: org.postgresql.Driver # Replace with your database's driver class
# Either specify maven_coordinates or driver_path
maven_coordinates: org.postgresql:postgresql:42.7.1
# driver_path: "/path/to/driver.jar"
# Connection Configuration
connection:
uri: "jdbc:postgresql://localhost:5432//mydb" # Replace with your database URI
username: "user"
password: "pass"
# Optional SSL Configuration
ssl_config:
cert_path: "/path/to/cert"
# cert_type: "pem" # pem, jks, or p12
# cert_password: ""
# Additional JDBC Properties
properties:
applicationName: "datahub_jdbc_ingestion"
# Additional JVM Arguments
jvm_args:
- "-Xmx1g"
- "-Djavax.net.ssl.trustStore=/etc/ssl/certs/java/cacerts"
# Optional: SQL dialect for query parsing
sqlglot_dialect: "postgres" # Replace with your database's dialect
# Optional Filters
schema_pattern:
allow:
- "schema1"
- "schema2"
table_pattern:
allow:
- "schema1.table1"
deny:
- "schema1.temp_.*"
# Feature flags
include_tables: true
include_views: true
include_stored_procedures: false
sink:
# sink configs
Code Coordinates
- Class Name:
datahub.ingestion.source.jdbc.source.JDBCSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for JDBC, feel free to ping us on our Slack.