-
Notifications
You must be signed in to change notification settings - Fork 223
Docs add rustac quickstart #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
0b40c70
e53a76b
84bcb8a
9b74227
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,257 @@ | ||||||
| { | ||||||
| "cells": [ | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "# Querying Planetary Computer STAC items with rustac", | ||||||
| "", | ||||||
| "This notebook queries the [Sentinel-2 L2A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) STAC-GeoParquet snapshot with [rustac](https://github.com/stac-utils/rustac-py), a Rust-powered STAC toolkit that reads GeoParquet through a bundled DuckDB engine. Key benefits:", | ||||||
| "", | ||||||
| "1. **Filter pushdown**: spatial, temporal, and property predicates go into the Parquet read, so only matching items cross the network.", | ||||||
| "2. **No download step**: query the snapshot in place instead of pulling the whole file into pandas first.", | ||||||
| "3. **Bundled engine**: the DuckDB reader ships with rustac, so there's no separate database to install.", | ||||||
| "4. **Arrow-native**: results come back as an Arrow table you can write to Parquet or hand to GeoPandas without a Python-dictionary round-trip.", | ||||||
| "5. **Bulk scale**: built for scanning across many items at once, where the STAC API would need many paged requests.", | ||||||
| "", | ||||||
| "We'll pull a handful of low-cloud Sentinel-2 scenes over Portland, Oregon from the historical snapshot, then filter them, export to Parquet, and bridge them into GeoPandas.", | ||||||
| "", | ||||||
| "The companion [rustac tutorial](../overview/rustac.md) has the full narrative.", | ||||||
| "", | ||||||
| "> **Note on coverage:** the Sentinel-2 GeoParquet snapshot is a point-in-time export covering 2015-07 through 2018-10, not the live archive. Use it for bulk historical analysis. For current acquisitions, query the live STAC API with `pystac-client`." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Install" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "%pip install --quiet 'rustac[arrow]' pystac-client planetary-computer geopandas pyarrow" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Find the GeoParquet snapshot", | ||||||
| "", | ||||||
| "Every Planetary Computer collection carries a collection-level `geoparquet-items` asset that points at its STAC-items snapshot. The href resolves to a directory of monthly `part-NNNN` Parquet files on the `pcstacitems` storage account.", | ||||||
| "", | ||||||
| "**Expected result:** `abfs://items/sentinel-2-l2a.parquet`." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "import pystac_client\n", | ||||||
| "\n", | ||||||
| "catalog = pystac_client.Client.open(\"https://planetarycomputer.microsoft.com/api/stac/v1\")\n", | ||||||
| "asset = catalog.get_collection(\"sentinel-2-l2a\").assets[\"geoparquet-items\"]\n", | ||||||
| "asset.href" | ||||||
|
Comment on lines
+57
to
+61
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is a rustac tutorial, we can use rustac to search as well. |
||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Authenticate the DuckDB engine", | ||||||
| "", | ||||||
| "The account name looks public, but reads still need a SAS token. The high-level `rustac.search()` coroutine cannot pass Azure credentials, so use `rustac.DuckdbClient` instead and configure its connection once: fetch a container SAS from the Planetary Computer token API, register an Azure secret, and switch the Azure transport to curl.", | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know what "looks public" means.
Suggested change
|
||||||
| "", | ||||||
| "The `azure_transport_option_type = 'curl'` line is not optional. Without it, DuckDB's default Azure transport fails with an opaque SSL CA-certificate error. The SAS expires after about an hour, so long-running jobs re-fetch it.", | ||||||
| "", | ||||||
| "**Expected result:** a configured `client`, no output printed." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "import json\n", | ||||||
| "import urllib.request\n", | ||||||
| "import rustac\n", | ||||||
| "\n", | ||||||
| "sas = json.load(urllib.request.urlopen(\n", | ||||||
| " \"https://planetarycomputer.microsoft.com/api/sas/v1/token/pcstacitems/items\"\n", | ||||||
| "))[\"token\"]\n", | ||||||
| "\n", | ||||||
| "client = rustac.DuckdbClient()\n", | ||||||
| "client.execute(\"INSTALL azure; LOAD azure; SET azure_transport_option_type = 'curl';\")\n", | ||||||
| "client.execute(\n", | ||||||
| " \"CREATE SECRET pc (TYPE azure, PROVIDER config, ACCOUNT_NAME 'pcstacitems', \"\n", | ||||||
| " f\"CONNECTION_STRING 'BlobEndpoint=https://pcstacitems.blob.core.windows.net;SharedAccessSignature={sas}')\"\n", | ||||||
| ")" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Query the snapshot", | ||||||
| "", | ||||||
| "Point `search()` at the snapshot glob and pass the filters you want pushed into the read, so only matching rows cross the network. The glob (`/*.parquet`) spans the monthly part files.", | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it calls them "part" a couple other places as well, should be partition.
Suggested change
|
||||||
| "", | ||||||
| "`DuckdbClient.search` is synchronous (no `await`); the bundled engine scans the part files for you, which takes a little time on the first call.", | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can get a count of the number of parquet files from the obstore list that I suggest above.
Suggested change
|
||||||
| "", | ||||||
| "**Expected result:** a handful of Portland scenes (5)." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "SNAPSHOT = \"az://items/sentinel-2-l2a.parquet/*.parquet\"\n", | ||||||
| "\n", | ||||||
| "items = client.search(\n", | ||||||
| " SNAPSHOT,\n", | ||||||
| " collections=[\"sentinel-2-l2a\"],\n", | ||||||
| " bbox=[-122.7, 45.5, -122.6, 45.6],\n", | ||||||
| " datetime=\"2017-07-01/2017-08-01\",\n", | ||||||
| ")\n", | ||||||
| "len(items)" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Filter on space, time, and properties", | ||||||
| "", | ||||||
| "`search()` accepts the STAC query parameters you would expect, including CQL2-JSON property filters. Here we add a cloud-cover threshold.", | ||||||
| "", | ||||||
| "**Expected result:** a list of low-cloud scenes over the wider summer window." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "items = client.search(\n", | ||||||
| " SNAPSHOT,\n", | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we're narrowing the datetime, we could show how you can speed up the query by narrowing the glob to something like |
||||||
| " collections=[\"sentinel-2-l2a\"],\n", | ||||||
| " bbox=[-122.7, 45.5, -122.6, 45.6],\n", | ||||||
| " datetime=\"2017-06-01/2017-09-01\",\n", | ||||||
| " filter={\"op\": \"<\", \"args\": [{\"property\": \"eo:cloud_cover\"}, 20]},\n", | ||||||
| ")\n", | ||||||
| "len(items)" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Write results without materializing Python objects", | ||||||
| "", | ||||||
| "For bulk work, skip the round-trip through Python dictionaries. `search_to_arrow` takes the same arguments as `search` and returns an Arrow table. It is backed by `arro3` (rustac's Arrow runtime), so adopt it into PyArrow with `pa.table(...)`, a zero-copy hand-off, before writing it to a new Parquet file.", | ||||||
| "", | ||||||
| "**Expected result:** a `portland-2017.parquet` file written locally; `table.num_rows` is 5." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "import pyarrow as pa\n", | ||||||
| "import pyarrow.parquet as pq\n", | ||||||
| "\n", | ||||||
| "table = pa.table(client.search_to_arrow(\n", | ||||||
| " SNAPSHOT,\n", | ||||||
| " collections=[\"sentinel-2-l2a\"],\n", | ||||||
| " bbox=[-122.7, 45.5, -122.6, 45.6],\n", | ||||||
| " datetime=\"2017-07-01/2017-08-01\",\n", | ||||||
| "))\n", | ||||||
| "pq.write_table(table, \"portland-2017.parquet\")\n", | ||||||
| "table.num_rows" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Bridge to GeoPandas", | ||||||
| "", | ||||||
| "To analyze results as a GeoDataFrame, convert the Arrow table. The snapshot geometries are lon/lat, but the CRS is not carried on the Arrow table, so set it explicitly.", | ||||||
| "", | ||||||
| "**Expected result:** a 5-row GeoDataFrame with `Polygon` geometry." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "import geopandas\n", | ||||||
| "\n", | ||||||
| "gdf = geopandas.GeoDataFrame.from_arrow(table).set_crs(4326)\n", | ||||||
| "gdf[[\"id\", \"datetime\", \"eo:cloud_cover\"]].head()" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## From metadata to pixels", | ||||||
| "", | ||||||
| "A search returns item metadata, not imagery. To read the actual rasters, pull asset hrefs from the results and sign them, then hand them to a reader such as [async-geotiff](../overview/async-geotiff.md) for windowed reads.", | ||||||
| "", | ||||||
| "**Expected result:** a signed HTTPS URL for the B04 (red) band of the first scene." | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": null, | ||||||
| "metadata": {}, | ||||||
| "outputs": [], | ||||||
| "source": [ | ||||||
| "import planetary_computer\n", | ||||||
| "\n", | ||||||
| "href = planetary_computer.sign(items[0][\"assets\"][\"B04\"][\"href\"])\n", | ||||||
| "href" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## You're done", | ||||||
| "", | ||||||
| "If every cell above ran, you pushed a spatial, temporal, and cloud-cover filter into a remote GeoParquet snapshot, read back only the matching Sentinel-2 scenes, wrote them to Parquet, and bridged them into GeoPandas: STAC-GeoParquet \u2192 filtered Arrow table \u2192 GeoDataFrame, with no full download.", | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| "", | ||||||
| "Swap in your own bbox, datetime window, or collection and the same pattern applies. For small or current-data queries, the live STAC API through `pystac-client` is the better tool. To ask analytical questions or join the catalog against other datasets, see the [DuckDB tutorial](../overview/duckdb.md), which uses the same authentication. For the imagery behind the metadata, hand a signed asset href to the [async-geotiff tutorial](../overview/async-geotiff.md)." | ||||||
| ] | ||||||
| } | ||||||
| ], | ||||||
| "metadata": { | ||||||
| "kernelspec": { | ||||||
| "display_name": "Python 3", | ||||||
| "language": "python", | ||||||
| "name": "python3" | ||||||
| }, | ||||||
| "language_info": { | ||||||
| "name": "python", | ||||||
| "version": "3.12.12" | ||||||
| } | ||||||
| }, | ||||||
| "nbformat": 4, | ||||||
| "nbformat_minor": 5 | ||||||
| } | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to list the partition files, e.g. with obstore.