GeoParquet
implproposalmainnoneThe features are stored using one or multiple (Geo)Parquet files.
Limitations
The following limitations are known:
- Only 2D geometries are supported.
- The option
linearizeCurvesis not supported. As described in the GeoParquet specification all geometries must be encoded with WKB or according to the geometry types "point", "linestring", "polygon", "multipoint", "multilinestring", "multipolygon" from the GeoArrow specification. - The CQL2 functions
DIAMETER2D()andDIAMETER3D()are not supported. - CRUD operations are not supported.
- Columns with
structcontent are not yet supported. - Populating a table with the content of multiple specific (Geo)Parquet files is not yet supported. However, it is possible to select multiple files using the
*and?wildcard operators. - The behavior on dataset changes has not been tested.
- The configuration is not checked for errors. This is especially the case for any S3-related configurations.
- When working with large S3 datasets, the data might not be available immediately after application start, especially if no cache has been built beforehand. Requesting the data before it is ready may return
HTTP Error 503. After some minutes the data should become available, the exact time depends on the size of the files. - The S3 access may fail if S3-credentials are provided although the bucket is public.
Configuration
Options
| Name | Default | Description | Type | Since |
|---|---|---|---|---|
connectionInfo | See Connection Info. | object | v2.0 |
Connection Info
The connection info object for GeoParquet has the following properties:
| Name | Default | Description | Type | Since |
|---|---|---|---|---|
database | Only relevant for local (Geo)Parquet files: The relative path starting from /resources/features to the directory containing the (Geo)Parquet files / subdirectories with (Geo)Parquet files. | string | v2.0 | |
host | Only relevant for S3: The URL of the bucket containing the (Geo)Parquet files. A subdirectory can be appended to the URL to limit access, e.g. s3://bucket-eu-central-1/subdirectory. | string | v2.0 | |
user | Only relevant for S3: The access key required to access the bucket. Must not be set for public buckets. | string | v2.0 | |
password | Only relevant for S3: The secret access key required to access the bucket. Must not be set for public buckets. | string | v2.0 | |
driverOptions | This mapping is used to assign table names to (Geo)Parquet files and to configure S3. See Table Mapping and Configuration of S3 for details. | object | v2.0 |
Table Mapping
To work with (Geo)Parquet files, it must be configured which tables exist and from which (Geo)Parquet files they are populated. For this, a mapping of table names to the (Geo)Parquet files with the corresponding data is required. This mapping is set in driverOptions. The following is important to note:
- The table names must be in the
tablenamespace by preceding them withtable., e.g.table.FOO. - For S3, the path must be relative to the URL in
host. For local files, it must be relative to the path set indatabase. - Currently it is not possible to populate a table with the content of multiple specific files. However, the wildcard operators
*and?can be used to select all files matching a specific pattern.
Examples:
connectionInfo:
driverOptions:
table.FOO: "subdirectory/foo.parquet" # Match one specific file
table.BAR: "subdirectory/subdirectory_2/*.parquet" # All parquet files inside subdirectory/subdirectory_2/
Finally the table names can be referenced in sourcePath:
types:
example:
sourcePath: /FOO
Configuration of S3
The possible parameters and their default values can be found in the S3 documentation of DuckDB. The following must be noted:
- Every value must be a String. This also applies to boolean parameters (in these cases use the Strings "true" and "false").
- The provided keys must be in lower-case, e.g. "endpoint" for the parameter
ENDPOINT. - The parameters
KEY_IDandSECRETmust be provided usinguserandpasswordinstead ofdriverOptions. - The platform-specific secret type can be provided using the internal parameter "type". Valid options are "s3", "r2" and "gcs". For other providers use "s3" and set a custom endpoint instead.
- Instead of using the parameter
SCOPE(which is not supported), set your sub-path as part of the bucket-URL inhost.
Example for MinIO:
connectionInfo:
host: "s3://geoparquet/subdirectory/"
user: "KEY-ID"
password: "SECRET KEY-ID"
driverOptions:
endpoint: "s3.minio-provider.net"
use_ssl: "true"
url_style: "path"
table.TEST: "file.parquet"