Datasets

A Spicepod can contain one or more datasets referenced by relative path or defined inline.

Inline example:

spicepod.yaml

datasets:
  - from: spice.ai/spiceai/quickstart/datasets/taxi_trips
    name: taxi_trips
    acceleration:
      enabled: true
      mode: memory # / file
      engine: arrow # / duckdb / sqlite / postgres
      refresh_check_interval: 1h
      refresh_mode: full / append # update / incremental

spicepod.yaml

datasets:
  - from: databricks:spiceai.datasets.specific_table
    name: uniswap_eth_usd
    params:
      environment: prod
    acceleration:
      enabled: true
      mode: memory # / file
      engine: arrow # / duckdb
      refresh_check_interval: 1h
      refresh_mode: full / append # update / incremental

Relative path example:

spicepod.yaml

datasets:
  - ref: datasets/taxi_trips

datasets/taxi_trips/dataset.yaml

from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
type: overwrite
acceleration:
  enabled: true
  refresh: 1h

`from`

The from field is a string that represents the Uniform Resource Identifier (URI) for the dataset. This URI is composed of two parts: a prefix indicating the Data Connector to use to connect to the dataset, a delimiter, and the path to the dataset within the source.

The syntax for the from field is as follows:

from: <data_connector>:<path>
# OR
from: <data_connector>/<path>
# OR
from: <data_connector>://<path>

Where:

<data_connector>: The Data Connector to use to connect to the dataset

Currently supported data connectors:
- spiceai
- dremio
- spark
- databricks
- s3
- postgres
- mysql
- flightsql
- snowflake
- ftp, sftp
- http, https
- clickhouse
- graphql
If the Data Connector is not explicitly specified, it defaults to spiceai.
<delimiter>: The delimiter between the Data Connector and the path. Currently supported delimiters are :, /, and ://. Some connectors place additional restrictions on the allowed delimiters to better conform to the expected syntax of the underlying data source, i.e. s3:// is the only supported delimiter for the s3 connector.
<path>: The path to the dataset within the source.

`ref`

An alternative to adding the dataset definition inline in the spicepod.yaml file. ref can be use to point to a directory with a dataset defined in a dataset.yaml file. For example, a dataset configured in a dataset.yaml in the "datasets/sample" directory can be referenced with the following:

dataset.yaml

from: spice.ai/spiceai/quickstart/datasets/taxi_trips
name: taxi_trips
type: overwrite
acceleration:
  enabled: true
  refresh: 1h

ref used in spicepod.yaml

version: v1
kind: Spicepod
name: duckdb
datasets:
  - ref: datasets/sample

`name`

The name of the dataset. Used to reference the dataset in the pod manifest, as well as in external data sources.

`description`

The description of the dataset. Used as part of the Semantic Data Model.

`time_column`

Optional. The name of the column that represents the temporal (time) ordering of the dataset.

Required to enable a retention policy on the dataset.

`time_format`

Optional. The format of the time_column. The following values are supported:

timestamp - Default. Timestamp without a timezone. E.g. 2016-06-22 19:10:25 with data type timestamp.
timestamptz - Timestamp with a timezone. E.g. 2016-06-22 19:10:25-07 with data type timestamptz.
unix_seconds - Unix timestamp in seconds. E.g. 1718756687.
unix_millis - Unix timestamp in milliseconds. E.g. 1718756687000.
ISO8601 - ISO 8601 format.
date - Date in YYYY-MM-DD format. E.g. 2024-01-01.

Spice emits a warning if the time_column from the data source is incompatible with the time_format config.

Limitations

String-based columns are assumed to be ISO8601 format.

`time_partition_column`

(Optional) Specify the column that represents the physical partitioning of the dataset when using append-based acceleration. When the defined time_column is a fine-grained timestamp and the dataset is physically partitioned by a coarser granularity (for example, by date), setting time_partition_column to the partition column (e.g. date_col) improves partition pruning, excludes irrelevant partitions during refreshes, and optimizes scan efficiency.

`time_partition_format`

(Optional) Define the format of the time_partition_column. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to date. The same format options as time_format are supported for time_partition_column.

`unsupported_type_action`

Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered.

The following values are supported:

error - Default. Return an error when an unsupported data type is encountered.
warn - Log a warning and ignore the column containing the unsupported data type.
ignore - Log nothing and ignore the column containing the unsupported data type.
string - Attempt to convert the unsupported data type to a string. Currently only supports converting the PostgreSQL JSONB type.

Limitations

Not all connectors support specifying an unsupported_type_action. When specified on a connector that does not support the option, the connector will fail to register. The following connectors support unsupported_type_action:

`ready_state`

Supports one of two values:

on_registration: Mark the dataset as ready immediately, and queries on this table will fall back to the underlying source directly until the initial acceleration is complete
on_load: Mark the dataset as ready only after the initial acceleration. Queries against the dataset will return an error before the load has been completed.

datasets:
  - from: s3://my_bucket/my_dataset/
    name: my_dataset
    ready_state: on_registration # or on_load
    params: ...
    acceleration:
      enabled: true

`acceleration`

Optional. Accelerate queries to the dataset by caching data locally.

`acceleration.enabled`

Enable or disable acceleration, defaults to true.

`acceleration.engine`

The acceleration engine to use, defaults to arrow. The following engines are supported:

arrow - Accelerated in-memory backed by Apache Arrow DataTables.
duckdb - Accelerated by an embedded DuckDB database.
postgres - Accelerated by a Postgres database.
sqlite - Accelerated by an embedded Sqlite database.

`acceleration.mode`

Optional. The mode of acceleration. The following values are supported:

memory - Store acceleration data in-memory.
file - Store acceleration data in a file. Only supported for duckdb and sqlite acceleration engines.

mode is currently only supported for the duckdb engine.

`acceleration.refresh_mode`

Optional. How to refresh the dataset. The following values are supported:

full - Refresh the entire dataset.
append - Append new data to the dataset. When time_column is specified, new records are fetched from the latest timestamp in the accelerated data at the acceleration.refresh_check_interval.

`acceleration.refresh_check_interval`

Optional. How often data should be refreshed. For append datasets without a specific time_column, this config is not used. If not defined, the accelerator will not refresh after it initially loads data.

See Duration

`acceleration.refresh_sql`

Optional. Filters the data fetched from the source to be stored in the accelerator engine. Only supported for full refresh_mode datasets.

Must be of the form SELECT * FROM {name} WHERE {refresh_filter}. {name} is the dataset name declared above, {refresh_filter} is any SQL expression that can be used to filter the data, i.e. WHERE city = 'Seattle' to reduce the working set of data that is accelerated within Spice from the data source.

Limitations

The refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
Queries for data that have been filtered out will not fall back to querying against the federated table.

`acceleration.refresh_data_window`

Optional. A duration to filter dataset refresh source queries to recent data (duration into past from now). Requires time_column and time_format to also be configured. Only supported for full refresh mode datasets.

For example, refresh_data_window: 24h will include only records with a timestamp within the last 24 hours.

See Duration

`acceleration.refresh_append_overlap`

Optional. A duration to specify how far back to include records based on the most recent timestamp found in the accelerated data. Requires time_column to also be configured. Only supported for append refresh mode datasets.

This setting can help mitigate missing data issues caused by late arriving data.

Example: If the latest timestamp in the accelerated data table is 2020-01-01T02:00:00Z, setting refresh_append_overlap: 1h will include records starting from 2020-01-01T01:00:00Z.

See Duration

`acceleration.refresh_retry_enabled`

Optional. Specifies whether an accelerated dataset should retry data refresh in the event of transient errors. The default setting is true.

Retries utilize a Fibonacci backoff strategy. To disable refresh retries, set refresh_retry_enabled: false.

`acceleration.refresh_retry_max_attempts`

Optional. Defines the maximum number of retry attempts when refresh retries are enabled. The default is undefined, allowing for unlimited attempts.

`acceleration.params`

Optional. Parameters to pass to the acceleration engine. The parameters are specific to the acceleration engine used.

`acceleration.engine_secret`

Optional. The secret store key to use the acceleration engine connection credential. For supported data connectors, use spice login to store the secret.

`acceleration.retention_check_enabled`

Optional. Enable or disable retention policy check, defaults to false.

`acceleration.retention_period`

Optional. The retention period for the dataset. Combine with time_column and time_format to determine if the data should be retained or not.

Required when acceleration.retention_check_enabled is true.

See Duration

`acceleration.retention_check_interval`

Optional. How often the retention policy should be checked.

Required when acceleration.retention_check_enabled is true.

See Duration

`acceleration.refresh_jitter_enabled`

Optional. Enable or disable refresh jitter, defaults to false. The refresh jitter adds/substracts a randomized time period from the refresh_check_interval.

`acceleration.refresh_jitter_max`

Optional. The maximum amount of jitter to add to the refresh interval. The jitter is a random value between 0 and refresh_jitter_max. Defaults to 10% of refresh_check_interval.

`acceleration.indexes`

Optional. Specify which indexes should be applied to the locally accelerated table. Not supported for in-memory Arrow acceleration engine.

The indexes field is a map where the key is the column reference and the value is the index type.

A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

See Indexes

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    acceleration:
      enabled: true
      engine: sqlite
      indexes:
        number: enabled # Index the `number` column
        '(hash, timestamp)': unique # Add a unique index with a multicolumn key comprised of the `hash` and `timestamp` columns

`acceleration.primary_key`

Optional. Specify the primary key constraint on the locally accelerated table. Not supported for in-memory Arrow acceleration engine.

The primary_key field is a string that represents the column reference that should be used as the primary key. The column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

See Constraints

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    acceleration:
      enabled: true
      engine: sqlite
      primary_key: hash # Define a primary key on the `hash` column

`acceleration.on_conflict`

Optional. Specify what should happen when a constraint is violated. Not supported for in-memory Arrow acceleration engine.

The on_conflict field is a map where the key is the column reference and the value is the conflict resolution strategy.

A column reference can be a single column name or a multicolumn key. The column reference must be enclosed in parentheses if it is a multicolumn key.

Only a single on_conflict target can be specified, unless all on_conflict targets are specified with drop.

The possible conflict resolution strategies are:

upsert - Upsert the incoming data when the primary key constraint is violated.
drop - Drop the data when the primary key constraint is violated.

See Constraints

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    acceleration:
      enabled: true
      engine: sqlite
      primary_key: hash
      indexes:
        '(number, timestamp)': unique
      on_conflict:
        # Upsert the incoming data when the primary key constraint on "hash" is violated,
        # alternatively "drop" can be used instead of "upsert" to drop the data update.
        hash: upsert

`columns`

Optional. Define metadata and features for specific columns in the dataset.

datasets:
  - from: file:sales_data.parquet
    name: sales
    columns:
      - name: address_line1
        description: The first line of the address.
        embeddings:
          - from: hf_minilm
            row_id: order_number
            chunking:
              enabled: true
              target_chunk_size: 256
              overlap_size: 32

`columns[*].name`

The name of the column in the table schema.

`columns[*].description`

Optional. A description of the column's contents and purpose. Used as part of the Semantic Data Model.

`columns[*].embeddings`

Optional. Create vector embeddings for this column.

`columns[].embeddings[].from`

The embedding model to use, specify the component name.

`columns[].embeddings[].row_id`

Optional. For datasets without a primary key, used to explicitly specify column(s) that uniquely identify a row.

Specifying a row_id enables unique identifier lookups for datasets from external systems that may not have a primary key.

`columns[].embeddings[].chunking`

Optional. The configuration to enable and define the chunking strategy for the embedding column.

columns:
  - name: description
    embeddings:
      - from: hf_minilm
        chunking:
          enabled: true
          target_chunk_size: 512
          overlap_size: 128
          trim_whitespace: false

See embeddings[*].chunking for details.

`embeddings`

Optional. Create vector embeddings for specific columns of the dataset.

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    embeddings:
      - column: extra_data
        use: hf_minilm

`embeddings[*].column`

The column name to create an embedding for.

`embeddings[*].use`

The embedding model to use, specific the component name embeddings[*].name.

`embeddings[*].column_pk`

Optional. For datasets without a primary key, explicitly specify column(s) that uniquely identify a row.

`embeddings[*].chunking`

Optional. The configuration to enable and define the chunking strategy for the embedding column.

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    embeddings:
      - column: extra_data
        use: hf_minilm
        chunking:
          enabled: true
          target_chunk_size: 512
          overlap_size: 128
          trim_whitespace: false

`embeddings[*].chunking.enabled`

Optional. Enable or disable chunking for the embedding column. Defaults to false.

`embeddings[*].chunking.target_chunk_size`

The desired size of each chunk, in tokens.

If the desired chunk size is larger than the maximum size of the embedding model, the maximum size will be used.

`embeddings[*].chunking.overlap_size`

Optional. The number of tokens to overlap between chunks. Defaults to 0.

`embeddings[*].chunking.trim_whitespace`

Optional. If enabled, the content of each chunk will be trimmed to remove leading and trailing whitespace. Defaults to true.

`metadata`

Optional. Additional key-value metadata for the dataset. Used as part of the Semantic Data Model.

datasets:
  - from: spice.ai/eth.recent_blocks
    name: eth.recent_blocks
    metadata:
      instructions: The last 128 blocks.

from
ref
name
description
time_column
time_format
time_partition_column
time_partition_format
unsupported_type_action
ready_state
acceleration
acceleration.enabled
acceleration.engine
acceleration.mode
acceleration.refresh_mode
acceleration.refresh_check_interval
acceleration.refresh_sql
acceleration.refresh_data_window
acceleration.refresh_append_overlap
acceleration.refresh_retry_enabled
acceleration.refresh_retry_max_attempts
acceleration.params
acceleration.engine_secret
acceleration.retention_check_enabled
acceleration.retention_period
acceleration.retention_check_interval
acceleration.refresh_jitter_enabled
acceleration.refresh_jitter_max
acceleration.indexes
acceleration.primary_key
acceleration.on_conflict
columns
columns[*].name
columns[*].description
columns[*].embeddings
columns[*].embeddings[*].from
columns[*].embeddings[*].row_id
columns[*].embeddings[*].chunking
embeddings
embeddings[*].column
embeddings[*].use
embeddings[*].column_pk
embeddings[*].chunking
embeddings[*].chunking.enabled
embeddings[*].chunking.target_chunk_size
embeddings[*].chunking.overlap_size
embeddings[*].chunking.trim_whitespace
metadata

from​

ref​

name​

description​

time_column​

time_format​

time_partition_column​

time_partition_format​

unsupported_type_action​

ready_state​

acceleration​

acceleration.enabled​

acceleration.engine​

acceleration.mode​

acceleration.refresh_mode​

acceleration.refresh_check_interval​

acceleration.refresh_sql​

acceleration.refresh_data_window​

acceleration.refresh_append_overlap​

acceleration.refresh_retry_enabled​

acceleration.refresh_retry_max_attempts​

acceleration.params​

acceleration.engine_secret​

acceleration.retention_check_enabled​

acceleration.retention_period​

acceleration.retention_check_interval​

acceleration.refresh_jitter_enabled​

acceleration.refresh_jitter_max​

acceleration.indexes​

acceleration.primary_key​

acceleration.on_conflict​

columns​

columns[*].name​

columns[*].description​

columns[*].embeddings​

columns[*].embeddings[*].from​

columns[*].embeddings[*].row_id​

columns[*].embeddings[*].chunking​

embeddings​

embeddings[*].column​

embeddings[*].use​

embeddings[*].column_pk​

embeddings[*].chunking​

embeddings[*].chunking.enabled​

embeddings[*].chunking.target_chunk_size​

embeddings[*].chunking.overlap_size​

embeddings[*].chunking.trim_whitespace​

metadata​

`from`

`ref`

`name`

`description`

`time_column`

`time_format`

`time_partition_column`

`time_partition_format`

`unsupported_type_action`

`ready_state`

`acceleration`

`acceleration.enabled`

`acceleration.engine`

`acceleration.mode`

`acceleration.refresh_mode`

`acceleration.refresh_check_interval`

`acceleration.refresh_sql`

`acceleration.refresh_data_window`

`acceleration.refresh_append_overlap`

`acceleration.refresh_retry_enabled`

`acceleration.refresh_retry_max_attempts`

`acceleration.params`

`acceleration.engine_secret`

`acceleration.retention_check_enabled`

`acceleration.retention_period`

`acceleration.retention_check_interval`

`acceleration.refresh_jitter_enabled`

`acceleration.refresh_jitter_max`

`acceleration.indexes`

`acceleration.primary_key`

`acceleration.on_conflict`

`columns`

`columns[*].name`

`columns[*].description`

`columns[*].embeddings`

`columns[].embeddings[].from`

`columns[].embeddings[].row_id`

`columns[].embeddings[].chunking`

`embeddings`

`embeddings[*].column`

`embeddings[*].use`

`embeddings[*].column_pk`

`embeddings[*].chunking`

`embeddings[*].chunking.enabled`

`embeddings[*].chunking.target_chunk_size`

`embeddings[*].chunking.overlap_size`

`embeddings[*].chunking.trim_whitespace`

`metadata`