Implementing Connection Retries for Offline Apps

Use Write-Ahead Logging, strict error filtering, and jittered exponential backoff to survive file-lock contention and storage I/O interruptions when Python automation writes to SpatiaLite or GeoPackage databases in disconnected field environments.

Why This Matters

Field GIS workflows differ from server-backed stacks in one fundamental way: there is no TCP socket to reconnect. A GeoPackage file is a single SQLite database on a flash card, an external drive, or a network share that may temporarily disappear when a device suspends or a mount point drops. When multiple sync threads or background workers compete for the same file, the OS enforces mandatory file locks and SQLite raises OperationalError: database is locked instead of a network timeout. Blind retries exhaust file descriptors, trigger WAL checkpoint starvation, and leave partial geometry writes that corrupt R-tree spatial indexes. A disciplined retry layer — one that understands which errors are transient and which are not — is the difference between a reliable offline cache and a field data loss incident.

This page is a focused implementation guide under Transaction Scoping & Rollback Strategies, which covers the broader rules for explicit BEGIN/COMMIT scoping and savepoint rollbacks.

Prerequisites

Python 3.9+ with the standard library sqlite3 module (no third-party dependencies required for this pattern).
SQLite 3.35+ linked into your Python build. Check with import sqlite3; print(sqlite3.sqlite_version).
A SpatiaLite 5.0+ or GeoPackage 1.3-compliant database file. If you are loading spatial functions at runtime, ensure mod_spatialite is installed — see Native sqlite3 Spatial Extensions.
Familiarity with Python’s sqlite3 isolation-level semantics: the module defaults to auto-commit for DDL and implicit transactions for DML. The retry wrapper below uses isolation_level=None (autocommit mode) so that BEGIN/COMMIT/ROLLBACK are driven explicitly.

Primary Method

The class below is a zero-dependency retry wrapper. It enforces WAL mode on connection, distinguishes retryable OS-level errors from hard schema errors, and applies jittered exponential backoff to avoid thundering-herd collisions when multiple field devices reconnect after a sync gap.

python

# SpatiaLite / GeoPackage connection retry — Python 3.9+, sqlite3 stdlib only
import sqlite3
import time
import random
import logging
from typing import Tuple

logger = logging.getLogger(__name__)

# Substrings present in SQLite error messages that indicate a transient OS-level
# condition: SQLITE_BUSY (5), SQLITE_CANTOPEN (14), SQLITE_IOERR (10).
_RETRYABLE_TOKENS = frozenset(
    ["locked", "busy", "unable to open", "disk i/o error", "io error"]
)


class SpatialDBRetry:
    """Retry wrapper for SpatiaLite/GeoPackage with exponential backoff and WAL enforcement."""

    def __init__(
        self,
        db_path: str,
        max_retries: int = 5,
        base_delay: float = 0.5,
        max_delay: float = 12.0,
    ) -> None:
        self.db_path = db_path
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay

    def _is_retryable(self, exc: Exception) -> bool:
        if not isinstance(exc, sqlite3.OperationalError):
            return False
        msg = str(exc).lower()
        return any(token in msg for token in _RETRYABLE_TOKENS)

    def _backoff(self, attempt: int) -> float:
        """Full-jitter exponential backoff, capped at max_delay."""
        cap = min(self.base_delay * (2 ** attempt), self.max_delay)
        return random.uniform(0, cap)

    def connect(self) -> sqlite3.Connection:
        """Open the database and enforce WAL mode, retrying on transient I/O failures."""
        for attempt in range(self.max_retries + 1):
            try:
                # isolation_level=None → manual BEGIN/COMMIT/ROLLBACK in callers
                conn = sqlite3.connect(
                    self.db_path,
                    timeout=10.0,
                    check_same_thread=False,
                    isolation_level=None,
                )
                conn.execute("PRAGMA journal_mode=WAL;")
                conn.execute("PRAGMA synchronous=NORMAL;")
                conn.execute("PRAGMA foreign_keys=ON;")
                return conn
            except sqlite3.OperationalError as exc:
                if self._is_retryable(exc) and attempt < self.max_retries:
                    delay = self._backoff(attempt)
                    logger.warning(
                        "connect attempt %d/%d failed (%s); retrying in %.2fs",
                        attempt + 1,
                        self.max_retries,
                        exc,
                        delay,
                    )
                    time.sleep(delay)
                else:
                    raise

    def execute(
        self, conn: sqlite3.Connection, sql: str, params: Tuple = ()
    ) -> sqlite3.Cursor:
        """Execute one statement, retrying only when no transaction is open.

        If the caller already issued BEGIN, a mid-transaction lock error is
        re-raised immediately so the *entire transaction* can roll back and
        retry as a unit — never just the failing statement in isolation.
        """
        in_tx = conn.in_transaction
        for attempt in range(self.max_retries + 1):
            try:
                return conn.execute(sql, params)
            except sqlite3.OperationalError as exc:
                if self._is_retryable(exc) and not in_tx and attempt < self.max_retries:
                    delay = self._backoff(attempt)
                    logger.warning(
                        "execute attempt %d/%d failed (%s); retrying in %.2fs",
                        attempt + 1,
                        self.max_retries,
                        exc,
                        delay,
                    )
                    time.sleep(delay)
                else:
                    raise

Step-by-step Walkthrough

1. Open the connection with WAL mode

python

client = SpatialDBRetry("/data/field_survey.gpkg", max_retries=5, base_delay=0.5)
conn = client.connect()

PRAGMA journal_mode=WAL decouples readers from writers and eliminates exclusive read locks, so background sync threads can query the file while a write transaction is in progress. This is the single most important setting for field devices that run concurrent data-collection and sync processes. The Connection Pooling & Lifecycle Management guide explains how to share one WAL-mode connection safely across threads.

2. Scope every write inside an explicit transaction

python

try:
    conn.execute("BEGIN IMMEDIATE;")
    client.execute(
        conn,
        "INSERT INTO survey_points (id, geom, status) VALUES (?, ?, ?);",
        (101, b"\x00\x01...", "pending"),
    )
    client.execute(
        conn,
        "UPDATE sync_queue SET status = 'synced' WHERE id = ?;",
        (42,),
    )
    conn.execute("COMMIT;")
except Exception as exc:
    conn.execute("ROLLBACK;")
    logger.error("transaction aborted: %s", exc)
    raise
finally:
    conn.close()

BEGIN IMMEDIATE acquires a reserved lock at the start, preventing a second writer from grabbing the lock between your first and second statement — a race condition that causes spurious SQLITE_BUSY mid-transaction. The matching ROLLBACK on any error ensures orphaned geometry records never persist.

3. Retry the entire transaction unit on lock errors

When execute re-raises inside an open transaction (because conn.in_transaction is True), the caller’s except block catches it, rolls back, and can retry the entire unit:

python

def sync_with_retry(client: SpatialDBRetry, record: dict, max_tx_retries: int = 3) -> None:
    for tx_attempt in range(max_tx_retries):
        conn = client.connect()
        try:
            conn.execute("BEGIN IMMEDIATE;")
            client.execute(
                conn,
                "INSERT OR REPLACE INTO features (fid, geom, label) VALUES (?, ?, ?);",
                (record["fid"], record["geom"], record["label"]),
            )
            conn.execute("COMMIT;")
            return
        except sqlite3.OperationalError as exc:
            conn.execute("ROLLBACK;")
            if tx_attempt < max_tx_retries - 1:
                delay = client._backoff(tx_attempt)
                logger.warning("tx attempt %d failed; retrying in %.2fs: %s", tx_attempt + 1, delay, exc)
                time.sleep(delay)
            else:
                raise
        finally:
            conn.close()

4. Verify WAL mode is active

After the first successful connect, confirm WAL mode is applied before trusting any write:

python

mode = conn.execute("PRAGMA journal_mode;").fetchone()[0]
assert mode == "wal", f"Expected WAL, got {mode!r}"

On read-only filesystems or certain Android storage mounts, WAL mode silently falls back to DELETE journal mode — this assertion surfaces the problem immediately.

Only contention errors (locked/busy) enter the backoff loop; schema errors and constraint failures propagate immediately. Jitter spreads reconnecting field devices across time to avoid I/O spikes.

Verification

After a successful write cycle, run these checks to confirm the database is healthy:

python

# Confirm WAL is active
mode = conn.execute("PRAGMA journal_mode;").fetchone()[0]
assert mode == "wal", f"Expected WAL, got {mode!r}"

# Confirm row count increased
count = conn.execute("SELECT COUNT(*) FROM survey_points;").fetchone()[0]
assert count > 0, "No rows written — transaction may have silently rolled back"

# Check for outstanding WAL frames that need checkpointing
wal_info = conn.execute("PRAGMA wal_checkpoint(PASSIVE);").fetchone()
# Returns (busy, log, checkpointed); log==checkpointed means WAL is fully flushed
print(f"WAL checkpoint: busy={wal_info[0]}, log={wal_info[1]}, checkpointed={wal_info[2]}")

For GeoPackage files, also verify spatial index integrity after any bulk-write session:

python

# GeoPackage R-tree check (requires gpkg_geometry_columns populated)
rows = conn.execute(
    "SELECT table_name, column_name FROM gpkg_geometry_columns;"
).fetchall()
for table, col in rows:
    result = conn.execute(
        f"SELECT CheckSpatialIndex('{table}', '{col}');"
    ).fetchone()
    print(f"R-tree {table}.{col}: {result[0]}")  # 1 = valid, 0 = needs rebuild

Alternative Approaches and Edge Cases

SpatiaLite vs GeoPackage path differences

When using SpatiaLite (a .sqlite or .db file with mod_spatialite loaded), the connection setup adds one step:

python

conn.enable_load_extension(True)
conn.load_extension("mod_spatialite")
conn.execute("SELECT InitSpatialMetaDataFull(1);")  # only on first creation

The retry logic in SpatialDBRetry is identical for both formats because both are SQLite files and surface the same OperationalError subtypes. The R-tree check command differs: SpatiaLite uses RecoverSpatialIndex('table_name') rather than the GeoPackage gpkgAddSpatialIndex approach.

Android and iOS background throttling

Mobile operating systems suspend background processes aggressively. On Android (API 23+), Doze mode can pause your sync thread for several minutes. On iOS, background fetch windows are limited to 30 seconds. Configure max_delay to stay below your platform’s background task time limit:

python

import sys

# Shorter ceiling on mobile runtimes where background execution is time-capped
max_delay = 6.0 if sys.platform in ("android", "ios") else 12.0
client = SpatialDBRetry("/data/field.gpkg", max_retries=4, base_delay=0.4, max_delay=max_delay)

If the device suspends mid-transaction, SQLite rolls back automatically on the next open — WAL mode ensures the previously committed frames are safe. The pending incomplete transaction is discarded cleanly.

Async context (asyncio + run_in_executor)

The standard library sqlite3 module is synchronous. In an asyncio application, wrap blocking calls in loop.run_in_executor:

python

import asyncio

async def async_sync(client: SpatialDBRetry, record: dict) -> None:
    loop = asyncio.get_running_loop()
    await loop.run_in_executor(None, lambda: sync_with_retry(client, record))

For async-native spatial queries, see Async Database Queries in Python GIS.

Troubleshooting

`OperationalError: database is locked` does not clear after retries

Cause: A writer elsewhere is holding a RESERVED or PENDING lock and not committing. WAL mode allows concurrent readers but only one writer at a time. If a separate process opened the file without WAL mode (DELETE journal), it may hold an exclusive lock that blocks all WAL writers.

Fix: Identify the competing process with lsof /path/to/file.gpkg. Ensure all processes that open the file issue PRAGMA journal_mode=WAL before their first write. If you cannot control other openers, use a file-level mutex (e.g., fcntl.flock) as a cross-process write gate.

`OperationalError: unable to open database file`

Cause: The storage mount is unavailable. Common on Android external storage when the device locks the screen, or on Linux when an NFS/SMB share drops.

Fix: The retry loop will catch this (it matches unable to open). If retries are exhausted, the mount has not recovered — check os.path.exists(db_path) before attempting connection and surface a user-facing “storage unavailable” alert rather than logging a raw exception.

R-tree reports corruption after a failed write

Cause: A geometry row was written but the corresponding R-tree index row was not, typically from a crash between the two internal SQLite steps. This cannot happen inside a properly scoped transaction — it only occurs when writes bypass the transaction boundary.

Fix: Rebuild the index. For GeoPackage:

python

conn.execute("SELECT gpkgAddSpatialIndex('survey_points', 'geom');")

For SpatiaLite:

python

conn.execute("SELECT RecoverSpatialIndex('survey_points', 'geom');")

Then re-run the CheckSpatialIndex verification above. For background on why R-tree indexes desynchronize, see the GeoPackage Specification Deep Dive.

Transaction Scoping & Rollback Strategies — parent guide covering explicit BEGIN/COMMIT, savepoints, and spatial validation rollbacks
Connection Pooling & Lifecycle Management — safe connection sharing across threads with WAL mode
Async Database Queries in Python GIS — running spatial queries inside asyncio event loops
Native sqlite3 Spatial Extensions — loading mod_spatialite and enabling spatial functions at runtime
Securing GeoPackage Files for Field Use — file-permission hardening and encryption for offline deployments

Why This Matters #

Prerequisites #

Primary Method #

Step-by-step Walkthrough #

1. Open the connection with WAL mode #

2. Scope every write inside an explicit transaction #

3. Retry the entire transaction unit on lock errors #

4. Verify WAL mode is active #

Verification #

Alternative Approaches and Edge Cases #

SpatiaLite vs GeoPackage path differences #

Android and iOS background throttling #

Async context (asyncio + run_in_executor) #

Troubleshooting #

OperationalError: database is locked does not clear after retries #

OperationalError: unable to open database file #

R-tree reports corruption after a failed write #

Related #

Why This Matters

Prerequisites

Primary Method

Step-by-step Walkthrough

1. Open the connection with WAL mode

2. Scope every write inside an explicit transaction

3. Retry the entire transaction unit on lock errors

4. Verify WAL mode is active

Verification

Alternative Approaches and Edge Cases

SpatiaLite vs GeoPackage path differences

Android and iOS background throttling

Async context (asyncio + run_in_executor)

Troubleshooting

`OperationalError: database is locked` does not clear after retries

`OperationalError: unable to open database file`

R-tree reports corruption after a failed write

Related