Managing Large Spatial Datasets in Memory

Attach an on-disk GeoPackage to an in-memory SQLite connection, route every query through the table’s R-tree spatial index, and materialize only bounded, fixed-size result sets — never load an entire feature table into Python memory at once.

Why This Matters

Field GIS technicians and automation engineers routinely hit OS memory ceilings when parsing multi-gigabyte GeoPackages on edge devices, mobile workers, and CI runners. The bottleneck is not disk I/O; it is unbounded Python object allocation. Loading a full feature table forces the interpreter to hold every geometry blob, attribute, and index page in RAM simultaneously. Connection Pooling & Lifecycle Management patterns already guard against connection leaks; this page extends that discipline to the data plane, ensuring that only the rows your current processing batch actually needs ever occupy heap space. The result is that multi-million-feature GeoPackages become tractable on hardware with 1–2 GB total RAM — the reality for many field survey devices.

Prerequisites

Python 3.9+ with the built-in sqlite3 module
libspatialite 5.x installed and on the dynamic-linker path (LD_LIBRARY_PATH on Linux, DYLD_LIBRARY_PATH on macOS)
A GeoPackage with an enabled RTree spatial index extension (verify with SELECT * FROM gpkg_extensions WHERE extension_name = 'gpkg_rtree_index')
Shapely 2.0+ if you plan to deserialize WKB bytes into geometry objects
Familiarity with Python Integration & Database Workflows fundamentals and basic SQL window concepts

Primary Method

The core idea is a generator that attaches the on-disk GeoPackage to an in-memory SQLite connection, filters rows through the GeoPackage R-tree virtual table, and yields chunks of raw (rowid, wkb_bytes) pairs. The caller drives iteration; the finally block guarantees deterministic teardown.

GeoPackages store geometries as GeoPackage Binary (GPB) blobs — an OGC-specified envelope header prepended to a standard WKB payload. ST_AsBinary() (available when mod_spatialite is loaded) strips the header and returns clean WKB that Shapely 2.0’s from_wkb() can parse without extra preprocessing.

python

# -- GeoPackage / SpatiaLite context --
import sqlite3
import logging
from typing import Generator, Tuple

logger = logging.getLogger(__name__)

def stream_spatial_chunks(
    gpkg_path: str,
    table_name: str,
    geom_column: str,
    bbox: Tuple[float, float, float, float],
    chunk_size: int = 5_000,
) -> Generator[list, None, None]:
    """
    Stream features from a GeoPackage in memory-bounded chunks.

    Attaches the on-disk file as read-only, filters by bounding box through
    the GeoPackage R-tree index, and yields lists of (rowid, wkb_bytes)
    tuples.  Consume or copy each chunk before advancing the iterator —
    the in-memory connection is closed in the finally block once iteration
    is exhausted or abandoned.

    Args:
        gpkg_path:   Absolute path to the source .gpkg file.
        table_name:  Feature table name (must exist in gpkg_contents).
        geom_column: Geometry column name (typically 'geom' or 'geometry').
        bbox:        (minx, miny, maxx, maxy) in the layer's native CRS.
        chunk_size:  Maximum rows per yielded chunk (default 5 000).
    """
    mem_conn = sqlite3.connect(":memory:")
    try:
        # Load mod_spatialite so ST_AsBinary() is available for GPB → WKB
        mem_conn.enable_load_extension(True)
        mem_conn.load_extension("mod_spatialite")
        mem_conn.execute("SELECT InitSpatialMetaData(1)")

        # Attach the on-disk GeoPackage as read-only; immutable=1 skips the
        # shared-cache lock, allowing concurrent readers on the same file.
        mem_conn.execute(
            "ATTACH DATABASE 'file:{}?mode=ro&immutable=1' AS src".format(gpkg_path)
        )

        minx, miny, maxx, maxy = bbox

        # GeoPackage R-trees follow the naming convention rtree_<table>_<geom>.
        # The virtual table exposes: id (= feature rowid), minx, maxx, miny, maxy.
        # The bbox predicate uses fully overlapping semantics:
        #   row's bbox overlaps query bbox  ↔  row.minx ≤ query.maxx
        #                                        AND row.maxx ≥ query.minx
        #                                        AND row.miny ≤ query.maxy
        #                                        AND row.maxy ≥ query.miny
        rtree = "src.rtree_{}_{}".format(table_name, geom_column)
        sql = """
            SELECT t.rowid, ST_AsBinary(t.{geom})
            FROM src.{table} AS t
            WHERE t.rowid IN (
                SELECT id FROM {rtree}
                WHERE minx <= ? AND maxx >= ?
                  AND miny <= ? AND maxy >= ?
            )
        """.format(geom=geom_column, table=table_name, rtree=rtree)

        cursor = mem_conn.execute(sql, (maxx, minx, maxy, miny))

        while True:
            rows = cursor.fetchmany(chunk_size)
            if not rows:
                break
            # Yield raw bytes; defer shapely.from_wkb() to the caller so that
            # GEOS objects are only allocated for the rows you actually need.
            yield [(rowid, wkb) for rowid, wkb in rows]

    except Exception:
        logger.exception("Spatial chunk streaming failed for %s::%s", gpkg_path, table_name)
        raise
    finally:
        # DETACH before close; omitting DETACH leaves the WAL reader open
        # and can prevent a concurrent writer from acquiring an exclusive lock.
        try:
            mem_conn.execute("DETACH DATABASE src")
        except Exception:
            pass
        mem_conn.close()

The R-tree index eliminates non-intersecting rows before any geometry is read. Only the current fetchmany batch occupies heap space at any moment; the connection and all its page cache are reclaimed in the finally block.

Step-by-step Walkthrough

1. Open the in-memory connection and load mod_spatialite

python

# -- SpatiaLite context --
mem_conn = sqlite3.connect(":memory:")
mem_conn.enable_load_extension(True)
mem_conn.load_extension("mod_spatialite")
mem_conn.execute("SELECT InitSpatialMetaData(1)")

InitSpatialMetaData(1) creates the SpatiaLite system tables in the in-memory database. The 1 argument suppresses the “already exists” error on repeated calls. Without this initialisation ST_AsBinary() raises no such function.

2. Attach the on-disk GeoPackage as read-only

python

# -- GeoPackage context --
mem_conn.execute(
    "ATTACH DATABASE 'file:{}?mode=ro&immutable=1' AS src".format(gpkg_path)
)

mode=ro prevents accidental writes. immutable=1 tells SQLite to skip the change-counter check on every statement, which avoids taking a shared lock and allows other processes to write the file while this reader is active. Omit immutable=1 if the file may be modified by another process during iteration — SQLite will then detect stale pages and re-read them.

3. Query through the R-tree index

The GeoPackage specification mandates that each feature table with a geometry column may register an RTree index named rtree_<table>_<geom_column>. That virtual table exposes five columns: id (= the feature rowid), minx, maxx, miny, maxy (the geometry’s axis-aligned bounding envelope). A bbox overlap test against those columns lets SQLite skip the geometry BLOB entirely for non-intersecting rows:

sql

-- GeoPackage R-tree overlap predicate
SELECT id FROM rtree_roads_geom
WHERE minx <= 180.0 AND maxx >= 170.0
  AND miny <= -34.0 AND maxy >= -36.0;

Combine this with a join to the feature table to fetch the matching geometry blobs:

sql

-- GeoPackage context: GPB → WKB via ST_AsBinary
SELECT t.rowid, ST_AsBinary(t.geom)
FROM src.roads AS t
WHERE t.rowid IN (
    SELECT id FROM src.rtree_roads_geom
    WHERE minx <= 180.0 AND maxx >= 170.0
      AND miny <= -34.0 AND maxy >= -36.0
);

4. Stream with fetchmany

python

cursor = mem_conn.execute(sql, (maxx, minx, maxy, miny))
while True:
    rows = cursor.fetchmany(5_000)
    if not rows:
        break
    process(rows)

fetchmany(n) holds at most n SQLite row objects in Python at once. Never use fetchall() or for row in cursor: on large spatial tables — both materialise the entire result set before your code can inspect a single row.

5. Defer geometry deserialisation

Keep data as raw bytes while streaming. Instantiating Shapely objects eagerly multiplies memory overhead by 3–5x because each shapely.geometry.BaseGeometry wraps a GEOS C pointer with Python reference-counting overhead. Deserialise only when you perform an actual spatial operation:

python

# -- Python / Shapely 2.0 context --
from shapely import from_wkb

for chunk in stream_spatial_chunks(path, "roads", "geom", bbox):
    for rowid, wkb_bytes in chunk:
        if wkb_bytes is None:
            continue          # geometry column was NULL
        geom = from_wkb(wkb_bytes)   # allocate GEOS object only here
        # ... spatial computation ...
        del geom              # immediately eligible for GC

6. Deterministic teardown

python

finally:
    try:
        mem_conn.execute("DETACH DATABASE src")
    except Exception:
        pass
    mem_conn.close()

DETACH DATABASE before close() is mandatory. Without the explicit detach, SQLite may leave an open WAL reader reference on the source file, which blocks writers from obtaining an exclusive lock during checkpoint operations. See Connection Pooling & Lifecycle Management for the broader pattern on deterministic resource cleanup.

Verification

After a streaming run, confirm that no memory-mapped file handles are still open and that the WKB output is valid:

python

# -- Python verification context --
import gc
from shapely import from_wkb, is_valid, is_geometry

collected = 0
for chunk in stream_spatial_chunks("survey.gpkg", "parcels", "geom", (-180, -90, 180, 90)):
    for rowid, wkb in chunk:
        if wkb:
            geom = from_wkb(wkb)
            assert is_geometry(geom), f"rowid {rowid}: not a geometry"
            assert is_valid(geom),    f"rowid {rowid}: invalid geometry"
        collected += 1
    gc.collect()   # reclaim unreferenced GEOS objects between chunks

print(f"Streamed {collected} features; no memory ceiling hit.")

On Linux, cross-check with lsof | grep survey.gpkg after the loop — no open file handles should remain once the generator is exhausted and the finally block has run.

Alternative Approaches

Attribute-only streaming (no mod_spatialite required)

If you only need scalar attributes and bounding-box coordinates (no geometry deserialisation), you can skip loading mod_spatialite entirely and read the raw GPB blob as bytes, or query the R-tree index alone:

python

# -- GeoPackage context: attribute-only, no mod_spatialite --
mem_conn = sqlite3.connect(":memory:")
mem_conn.execute(
    "ATTACH DATABASE 'file:{}?mode=ro&immutable=1' AS src".format(gpkg_path)
)
cursor = mem_conn.execute(
    """
    SELECT t.rowid, t.name, t.category
    FROM src.parcels AS t
    WHERE t.rowid IN (
        SELECT id FROM src.rtree_parcels_geom
        WHERE minx <= ? AND maxx >= ? AND miny <= ? AND maxy >= ?
    )
    """,
    (maxx, minx, maxy, miny),
)
for chunk in iter(lambda: cursor.fetchmany(10_000), []):
    handle_attributes(chunk)
mem_conn.close()

GDAL/OGR layer-iteration approach

For pipelines that use Fiona & OGR Driver Configuration, the equivalent pattern uses fiona.open() with a bounding-box filter and iterates the layer in blocks:

python

# -- Fiona / OGR context --
import fiona
from fiona.crs import from_epsg

bbox = (170.0, -36.0, 180.0, -34.0)
with fiona.open("survey.gpkg", layer="roads") as src:
    # Fiona applies a bbox pre-filter via OGR's SetSpatialFilterRect
    filtered = src.filter(bbox=bbox)
    while True:
        block = list(itertools.islice(filtered, 5_000))
        if not block:
            break
        process_features(block)

The Fiona path is simpler but allocates Python feature dicts for every row immediately — the raw sqlite3 path above gives finer control over when geometry objects are created. For async database queries in Python GIS workloads, wrap the sqlite3 generator in asyncio.to_thread() so the chunk loop runs off the event loop.

Troubleshooting

`OperationalError: no such table: rtree_parcels_geom`

Cause. The RTree index extension was not registered for this feature table, or the GeoPackage was written by a tool that skips optional extensions.

Fix. Register and build the index before querying:

sql

-- GeoPackage context: create RTree index
SELECT gpkgAddSpatialIndex('parcels', 'geom');

Verify with:

sql

SELECT * FROM gpkg_extensions
WHERE table_name = 'parcels' AND extension_name = 'gpkg_rtree_index';

`ImportError: /usr/lib/libspatialite.so: cannot open shared object`

Cause. libspatialite is not on the linker search path for the Python process.

Fix. On Debian/Ubuntu: sudo apt install libspatialite-dev. On macOS with Homebrew: brew install libspatialite. On Windows, place mod_spatialite.dll in the same directory as your script and call load_extension("mod_spatialite.dll") with the explicit path. After installing, confirm with:

python

import sqlite3
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)
conn.load_extension("mod_spatialite")
print(conn.execute("SELECT spatialite_version()").fetchone())

Memory still grows across chunks

Cause. Shapely geometry objects are being allocated inside the generator loop without being explicitly deleted, and Python’s cyclic garbage collector is not running between chunks.

Fix. Either call gc.collect() between chunks, or restructure the loop to use a local scope that allows reference counts to drop to zero naturally:

python

import gc

for chunk in stream_spatial_chunks(path, "roads", "geom", bbox):
    process_chunk(chunk)   # chunk is local to this block
    gc.collect()           # force-sweep GEOS circular refs

If memory still grows, profile with tracemalloc to identify whether the leak is in the GEOS layer or in your accumulation logic.

Connection Pooling & Lifecycle Management — parent guide covering pool design, thread safety, and deterministic teardown for SpatiaLite and GeoPackage connections
Async Database Queries in Python GIS — wrapping the sqlite3 chunking pattern in asyncio for non-blocking GIS pipelines
GeoPackage Specification Deep Dive — OGC mandatory tables, geometry column constraints, and R-tree index registration rules
Fiona & OGR Driver Configuration — alternative layer-iteration patterns using GDAL/OGR drivers
Python Integration & Database Workflows — section overview of all Python patterns for SpatiaLite and GeoPackage automation

Why This Matters #

Prerequisites #

Primary Method #

Step-by-step Walkthrough #

1. Open the in-memory connection and load mod_spatialite #

2. Attach the on-disk GeoPackage as read-only #

3. Query through the R-tree index #

4. Stream with fetchmany #

5. Defer geometry deserialisation #

6. Deterministic teardown #

Verification #

Alternative Approaches #

Attribute-only streaming (no mod_spatialite required) #

GDAL/OGR layer-iteration approach #

Troubleshooting #

OperationalError: no such table: rtree_parcels_geom #

ImportError: /usr/lib/libspatialite.so: cannot open shared object #

Memory still grows across chunks #

Related #

Why This Matters

Prerequisites

Primary Method

Step-by-step Walkthrough

1. Open the in-memory connection and load mod_spatialite

2. Attach the on-disk GeoPackage as read-only

3. Query through the R-tree index

4. Stream with fetchmany

5. Defer geometry deserialisation

6. Deterministic teardown

Verification

Alternative Approaches

Attribute-only streaming (no mod_spatialite required)

GDAL/OGR layer-iteration approach

Troubleshooting

`OperationalError: no such table: rtree_parcels_geom`

`ImportError: /usr/lib/libspatialite.so: cannot open shared object`

Memory still grows across chunks

Related