Datasets
A Spicepod can contain one or more datasets referenced by relative path or defined inline.
Inline example:
spicepod.yaml
datasets:
- from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb / sqlite / postgres
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental
spicepod.yaml
datasets:
- from: databricks:spiceai.datasets.specific_table
name: uniswap_eth_usd
params:
environment: prod
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental
Relative path example:
spicepod.yaml
datasets:
- ref: datasets/taxi_trips
datasets/taxi_trips/dataset.yaml
from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
type: overwrite
acceleration:
enabled: true
refresh: 1h
from​
The from field is a string that represents the Uniform Resource Identifier (URI) for the dataset. This URI is composed of two parts: a prefix indicating the Data Connector to use to connect to the dataset, a delimiter, and the path to the dataset within the source.
The syntax for the from field is as follows:
from: <data_connector>:<path>
# OR
from: <data_connector>/<path>
# OR
from: <data_connector>://<path>
Where:
-
<data_connector>: The Data Connector to use to connect to the datasetCurrently supported data connectors:
spiceaidremiosparkdatabrickss3postgresmysqlflightsqlsnowflakeftp,sftphttp,httpsclickhousegraphql
If the Data Connector is not explicitly specified, it defaults to
spiceai. -
<delimiter>: The delimiter between the Data Connector and the path. Currently supported delimiters are:,/, and://. Some connectors place additional restrictions on the allowed delimiters to better conform to the expected syntax of the underlying data source, i.e.s3://is the only supported delimiter for thes3connector. -
<path>: The path to the dataset within the source.
ref​
An alternative to adding the dataset definition inline in the spicepod.yaml file. ref can be use to point to a directory with a dataset defined in a dataset.yaml file. For example, a dataset configured in a dataset.yaml in the "datasets/sample" directory can be referenced with the following:
dataset.yaml
from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
type: overwrite
acceleration:
enabled: true
refresh: 1h
ref used in spicepod.yaml
version: v1
kind: Spicepod
name: duckdb
datasets:
- ref: datasets/sample
name​
The name of the dataset. Used to reference the dataset in the pod manifest, as well as in external data sources.
description​
The description of the dataset. Used as part of the Semantic Data Model.
time_column​
Optional. The name of the column that represents the temporal (time) ordering of the dataset.
Required to enable a retention policy on the dataset.
time_format​
Optional. The format of the time_column. The following values are supported:
timestamp- Default. Timestamp without a timezone. E.g.2016-06-22 19:10:25with data typetimestamp.timestamptz- Timestamp with a timezone. E.g.2016-06-22 19:10:25-07with data typetimestamptz.unix_seconds- Unix timestamp in seconds. E.g.1718756687.unix_millis- Unix timestamp in milliseconds. E.g.1718756687000.ISO8601- ISO 8601 format.date- Date in YYYY-MM-DD format. E.g.2024-01-01.
Spice emits a warning if the time_column from the data source is incompatible with the time_format config.
- String-based columns are assumed to be ISO8601 format.
time_partition_column​
(Optional) Specify the column that represents the physical partitioning of the dataset when using append-based acceleration. When the defined time_column is a fine-grained timestamp and the dataset is physically partitioned by a coarser granularity (for example, by date), setting time_partition_column to the partition column (e.g. date_col) improves partition pruning, excludes irrelevant partitions during refreshes, and optimizes scan efficiency.
time_partition_format​
(Optional) Define the format of the time_partition_column. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to date. The same format options as time_format are supported for time_partition_column.
unsupported_type_action​
Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered.
The following values are supported:
error- Default. Return an error when an unsupported data type is encountered.warn- Log a warning and ignore the column containing the unsupported data type.ignore- Log nothing and ignore the column containing the unsupported data type.string- Attempt to convert the unsupported data type to a string. Currently only supports converting the PostgreSQL JSONB type.
Not all connectors support specifying an unsupported_type_action. When specified on a connector that does not support the option, the connector will fail to register. The following connectors support unsupported_type_action:
ready_state​
Supports one of two values:
on_registration: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is completeon_load: Mark the dataset as ready only after the initial acceleration. Queries against the dataset will return an error before the load has been completed.
datasets:
- from: s3://my_bucket/my_dataset/
name: my_dataset
ready_state: on_registration # or on_load
params: ...
acceleration:
enabled: true
acceleration​
Optional. Accelerate queries to the dataset by caching data locally.
acceleration.enabled​
Enable or disable acceleration, defaults to true.
acceleration.engine​
The acceleration engine to use, defaults to arrow. The following engines are supported:
arrow- Accelerated in-memory backed by Apache Arrow DataTables.duckdb- Accelerated by an embedded DuckDB database.postgres- Accelerated by a Postgres database.sqlite- Accelerated by an embedded Sqlite database.
acceleration.mode​
Optional. The mode of acceleration. The following values are supported:
memory- Store acceleration data in-memory.file- Store acceleration data in a file. Only supported forduckdbandsqliteacceleration engines.
mode is currently only supported for the duckdb engine.
acceleration.refresh_mode​
Optional. How to refresh the dataset. The following values are supported:
full- Refresh the entire dataset.append- Append new data to the dataset. Whentime_columnis specified, new records are fetched from the latest timestamp in the accelerated data at theacceleration.refresh_check_interval.