GeoParquet

implproposalmainnone

The features are stored using one or multiple (Geo)Parquet files.

Limitations

The following limitations are known:

  • Only 2D geometries are supported.
  • The option linearizeCurves is not supported. As described in the GeoParquet specificationopen in new window all geometries must be encoded with WKB or according to the geometry types "point", "linestring", "polygon", "multipoint", "multilinestring", "multipolygon" from the GeoArrow specification.
  • The CQL2 functions DIAMETER2D() and DIAMETER3D() are not supported.
  • CRUD operations are not supported.
  • Columns with struct content are not yet supported.
  • Populating a table with the content of multiple specific (Geo)Parquet files is not yet supported. However, it is possible to select multiple files using the * and ? wildcard operators.
  • The behavior on dataset changes has not been tested.
  • The configuration is not checked for errors. This is especially the case for any S3-related configurations.
  • When working with large S3 datasets, the data might not be available immediately after application start, especially if no cache has been built beforehand. Requesting the data before it is ready may return HTTP Error 503. After some minutes the data should become available, the exact time depends on the size of the files.
  • The S3 access may fail if S3-credentials are provided although the bucket is public.

Configuration

Options

NameDefaultDescriptionTypeSince
connectionInfo
object
v2.0

Connection Info

The connection info object for GeoParquet has the following properties:

NameDefaultDescriptionTypeSince
database
Only relevant for local (Geo)Parquet files: The relative path starting from /resources/features to the directory containing the (Geo)Parquet files / subdirectories with (Geo)Parquet files.
string
v2.0
host
Only relevant for S3: The URL of the bucket containing the (Geo)Parquet files. A subdirectory can be appended to the URL to limit access, e.g. s3://bucket-eu-central-1/subdirectory.
string
v2.0
user
Only relevant for S3: The access key required to access the bucket. Must not be set for public buckets.
string
v2.0
password
Only relevant for S3: The secret access key required to access the bucket. Must not be set for public buckets.
string
v2.0
driverOptions
This mapping is used to assign table names to (Geo)Parquet files and to configure S3. See Table Mapping and Configuration of S3 for details.
object
v2.0

Table Mapping

To work with (Geo)Parquet files, it must be configured which tables exist and from which (Geo)Parquet files they are populated. For this, a mapping of table names to the (Geo)Parquet files with the corresponding data is required. This mapping is set in driverOptions. The following is important to note:

  • The table names must be in the table namespace by preceding them with table., e.g. table.FOO.
  • For S3, the path must be relative to the URL in host. For local files, it must be relative to the path set in database.
  • Currently it is not possible to populate a table with the content of multiple specific files. However, the wildcard operators * and ? can be used to select all files matching a specific pattern.

Examples:


connectionInfo:
 driverOptions:
   table.FOO: "subdirectory/foo.parquet" # Match one specific file
   table.BAR: "subdirectory/subdirectory_2/*.parquet" # All parquet files inside subdirectory/subdirectory_2/

Finally the table names can be referenced in sourcePath:


types:
 example:
   sourcePath: /FOO

Configuration of S3

The possible parameters and their default values can be found in the S3 documentation of DuckDBopen in new window. The following must be noted:

  • Every value must be a String. This also applies to boolean parameters (in these cases use the Strings "true" and "false").
  • The provided keys must be in lower-case, e.g. "endpoint" for the parameter ENDPOINT.
  • The parameters KEY_ID and SECRET must be provided using user and password instead of driverOptions.
  • The platform-specific secret type can be provided using the internal parameter "type". Valid options are "s3", "r2" and "gcs". For other providers use "s3" and set a custom endpoint instead.
  • Instead of using the parameter SCOPE (which is not supported), set your sub-path as part of the bucket-URL in host.

Example for MinIO:


connectionInfo:
 host: "s3://geoparquet/subdirectory/"
 user: "KEY-ID"
 password: "SECRET KEY-ID"
 driverOptions:
   endpoint: "s3.minio-provider.net"
   use_ssl: "true"
   url_style: "path"
   table.TEST: "file.parquet"