Goals

Raw camera frames never leave the device by default.
Category recognition must work offline for a useful subset.
The action surface is GeraNexus — same commit shape across every Gera vertical.
The user can always inspect what the lens thinks it saw.

Stage 1: on-device embedding

When the lens is active, the camera frame is processed locally by a small on-device vision encoder. This produces a compact embedding vector — not a raw image. The embedding stays on the device.

Model choice matters. We are targeting a 30-50MB on-device model quantised to int8. This runs comfortably on mid-tier Android and iPhone silicon. Candidate architectures: MobileCLIP, quantised ViT variants. The model is shipped with the app and updated with app releases.

Stage 2: privacy-preserving category lookup

The device hashes the embedding (locality-sensitive hashing plus server-side obfuscation) and looks up against a public category index. The server sees a hash bucket, not an image or a precise embedding. The response is a candidate-category list with confidence scores.

For truly offline contexts, a compact local category cache handles the top-200 object categories without a network round trip.

Stage 3: intent disambiguation

Pointing at a restaurant could mean book-a-table, order-delivery, get-directions or read-reviews. The lens surfaces the most likely intent based on context (time of day, user’s history, with explicit consent from GeraMind) and offers 2-3 alternatives.

Stage 4: service matching via GeraNexus

The intent maps to a GeraNexus capability. The lens queries the relevant Gera vertical — book a reservation on GeraEats, a quote on GeraHome, a consultation on GeraClinic — via the GeraNexus negotiate call. The user sees the quote.

Stage 5: consent-scoped commit

One tap commits the transaction. The consent token is bound to the specific recognition event (image hash, timestamp, purpose) so the commit cannot be replayed for a different purpose later.

What never leaves the device

Raw camera frames (unless explicitly uploaded by the user).
The precise embedding.
Faces. The model is trained to refuse face recognition outputs; face-shaped regions are blurred in the embedding.

What does leave the device

The hash bucket (privacy-preserving, one-way).
Category labels once matched.
User-explicit context (location, if granted).

Failure modes

The lens misidentifies a category. Design defence: show the user what we think we saw, never commit silently, always let the user correct.

The network is unavailable. Fall back to cached categories; surface no action if nothing matches. The user should understand they are offline.

Sensitive content is in frame. The encoder is trained to detect and degrade recognition on flagged content (weapons, sensitive body parts, identity documents); the lens surfaces no action.

Where we are in the build

On-device encoder selection: in progress. Category index schema: drafting. Reference mobile app: scaffolded. AR compatibility layer: paused pending consumer hardware direction.

How to contribute

Research drafts live at /research. Vision researchers working on on-device encoders and privacy-preserving recognition — we want to compare notes. The waitlist is open for pilot integrators.