Snail: Private Computer Vision

Imagine a delivery robot that needs to navigate a private facility — a warehouse, a hospital, a secure office building. The robot has a camera. The facility has a detailed 3D map of its interior. To figure out where it is, the robot needs to match what its camera sees against that map — a task called visual localization.

Here’s the tension: neither side wants to share. The facility’s map is confidential — it reveals the layout of a secure building, the placement of assets, the routes people take. The robot’s camera images are sensitive too — they might capture a patient’s face, classified documents on a desk, or the contents of a restricted area. And the computed position itself is a secret: knowing where the robot is tells you something about what it’s doing.

Both sides need each other to compute the answer, but neither side can afford to show their cards.

What if they didn’t have to?

This post is about a system called SNaIL — Secure, Non-linear and Iterative Localization — the first privacy-preserving localization method with formal cryptographic security guarantees and a working real-world demonstration. SNaIL allows two parties — one holding camera images, the other holding a 3D map — to jointly compute a camera’s position and orientation without either party learning the other’s private input. The demo is a Raspberry Pi robot shaped like a snail. It moves at 0.01 meters per second. It is very slow, and very private.


What Visual Localization Actually Does

Visual localization answers a deceptively simple question: given what a camera sees, where is it?

More precisely, it solves the Perspective-n-Point (PnP) problem. You have a set of 2D feature points detected in a camera image — say, corners of objects — and you know where those objects are in 3D space from a pre-built map. The task is to find the camera’s pose: its position and orientation in 3D (six numbers total — three for translation, three for rotation).

The standard approach is iterative optimization. Start with a rough guess of the pose. Project the 3D map points into the image plane using that guess. Compare the projected positions to where the features actually appear. Compute how to adjust the pose to shrink that gap. Repeat until the error is small enough. This is gradient descent applied to reprojection error, typically using the Levenberg-Marquardt or Gauss-Newton algorithms.

Each step of this optimization requires inverting a matrix, which is done via Singular Value Decomposition (SVD) — itself an iterative algorithm. So localization is an iterative algorithm with another iterative algorithm nested inside it. This matters a lot for what comes next.


Why Privacy Matters Here

Privacy in localization isn’t hypothetical. Camera images can contain faces, license plates, confidential documents, or sensitive infrastructure. The map itself may be proprietary — think of a commercial mapping service that doesn’t want to give away its data. And the computed pose reveals the device’s location, which may be sensitive for military drones, personal devices, or robots in secure facilities.

Two real scenarios motivate this work:

Computation offload. A lightweight device (robot, phone, drone) has both the image and the map but lacks the power to run localization itself. It offloads to two or more cloud servers, neither of which should learn anything.

Split inputs. The device has the camera image, but a third party owns the map. The device shouldn’t learn the full map; the map owner shouldn’t learn the device’s image or location. They cooperate to compute the pose without sharing their secrets.


The Graveyard of Broken Obfuscation

Researchers have tried to make localization private before. The approaches were fast but ultimately failed.

Line clouds (Speciale et al., 2019) transformed 3D point clouds into line clouds — replacing points with lines passing through them — to prevent image reconstruction from map features. It was efficient and clever. Then Chelani et al. showed the original image could be fully reconstructed from a single localization query. The defense was completely broken.

Adversarial affine subspace embeddings (Dusmanu et al., 2021) tried to protect image features using learned transformations. This too was broken — the input image could be recovered from the supposedly protected features.

Neither approach made formal security statements. Both were eventually defeated by attacks that exploited knowledge of the obfuscation technique itself. The lesson is a familiar one in security: if your defense relies on the attacker not being clever enough, it will eventually fail. You need something provable.


Why Not Just Use Encryption?

The natural instinct is to reach for heavyweight cryptographic tools. Homomorphic encryption (HE) lets you compute on encrypted data without decrypting it. Problem solved?

Not for localization. One iteration of Levenberg-Marquardt requires over 1,000 ciphertext divisions and 7,000 multiplications in a deep computation graph. Using a state-of-the-art boolean FHE library (TFHE-rs), even a rough estimate puts one iteration at over 400 seconds — for just the multiplications. Localization needs many iterations. HE is a dead end here.

Garbled circuits, a form of secure multi-party computation (MPC), are a better fit. In a garbled circuit protocol, computation is structured as a boolean circuit where each wire carries an encryption key instead of a 0 or 1. One party (the generator) creates the encrypted circuit; another (the evaluator) decrypts it gate-by-gate using keys received via oblivious transfer. Neither party learns the other’s inputs, and the computation is provably secure.

Garbled circuits are communication-intensive — the amount of data exchanged is proportional to the size of the computation, not just the inputs. But they can handle the operations localization needs (floating-point arithmetic, division, branching) without the crippling overhead of HE.

There’s just one problem.


The Iteration Problem

Garbled circuits can evaluate any boolean circuit. But which circuit? In plaintext, localization runs gradient descent until convergence — the number of iterations depends on the data. Under MPC, the circuit must be fixed before execution begins. You can’t branch on secret data because that would reveal information about the data.

The standard fix is to run a worst-case upper bound of iterations regardless of when convergence actually happens. OpenCV’s Levenberg-Marquardt defaults to a maximum of 20 gradient descent steps. The SVD inside each step defaults to a maximum of 30 iterations. A naive, data-oblivious MPC implementation would bake all 20 × 30 = 600 SVD computations into one enormous circuit, even though most real inputs converge much faster.

This is expensive — but it’s also the safe thing to do. If you let the number of iterations vary, you leak information. More iterations generally means the initial pose estimate was far from the truth. Over time, watching iteration counts tells the servers which areas of the environment are harder to localize in, effectively revealing location information. This is exactly the kind of subtle leakage that the formal security definition in this work is designed to capture and prevent.


Two Observations That Change Everything

SNaIL is built on two insights that dramatically reduce the cost of running localization under MPC.

The SVD always takes 12 iterations

The matrix being inverted during localization always has exactly six singular values — one for each physical degree of freedom of the camera (three rotation, three translation). The standard SVD algorithm (Demmel-Kahan) uses QR sweeps to eliminate off-diagonal entries, and a well-known rule of thumb is two sweeps per singular value. That gives 12 sweeps total.

But is this just a rule of thumb, or does it hold reliably for the matrices that actually arise in localization? The authors tested extensively on real images and the ETH3D benchmark dataset. In every case, the diagonal entries of the bidiagonalized matrix were far from the superdiagonal — the condition that would require extra sweeps. Twelve iterations always sufficed.

This means the inner iterative loop can be fixed at 12 instead of 30 — a 60% reduction — without any loss of accuracy and without leaking information, because the count doesn’t depend on secret inputs.

Run each gradient descent step independently

This is the more radical insight. Instead of encoding the entire multi-step optimization as one giant garbled circuit, SNaIL executes one gradient descent step at a time as a separate MPC invocation. The client receives the intermediate result, checks locally whether convergence has been reached, and either re-invokes with the same inputs (to refine further) or moves on to a new image.

At first glance this seems to reintroduce the leakage problem — the servers can count how many times they’re called. But consider how localization works in practice: a robot doesn’t localize once. It localizes on every camera frame, continuously, as it moves through an environment. The servers see a stream of SNaIL invocations but cannot tell which invocations belong to which images. Are these 100 calls serving 100 images (one step each) or 5 images (20 steps each)? The MPC protocol guarantees each call looks identical to the servers regardless of its inputs.

The paper proves that if SNaIL is invoked at least g times — where g is the publicly known maximum iterations (20, by OpenCV’s default) — the adversary’s ability to guess the number of input images is no better than random chance. For a robot continuously localizing on camera frames, 20 invocations takes seconds. The security requirement is trivially met.

This leads to a striking property: privacy gets stronger with more use. The longer the robot runs, the more invocations accumulate, and the harder it becomes for the servers to partition them into individual localization runs. This is the exact opposite of prior obfuscation-based approaches, where privacy degraded with repeated use.


Engineering Choices

Getting from insight to implementation required several careful decisions.

MPC framework: EMP over ABY. Both are garbled circuit frameworks, and in theory they should perform similarly. In practice, ABY pre-generates the entire circuit in memory before executing it — consuming over 100 GB of RAM for a 6-point localization and crashing with an out-of-memory error. EMP generates and evaluates gates on the fly, using under 1 GB. Even comparing only the online execution phase (ignoring ABY’s setup costs), EMP is two orders of magnitude faster because ABY’s heavy use of dynamic memory allocation dominates runtime. A cautionary tale: similar theory does not guarantee similar practice.

Data representation: 32-bit float. The authors tested both floating-point and fixed-point representations. 32-bit fixed point overflows. 64-bit fixed point converges but is 3× slower than 32-bit float (because multiplication — the dominant operation — is quadratic in bit width) and fails to converge 90% more often. Floating point wins decisively.

Bandwidth optimization for offloading. In the standard garbled circuit protocol, the client would need to receive all wire label pairs from the generator and forward the correct labels to the evaluator. By letting the client choose the random seed used to generate wire labels (exploiting a feature of EMP’s implementation), the client can generate the evaluator’s labels directly. This reduces client-side bandwidth from 369 Kb to 123 Kb for six input point correspondences.


Performance

SNaIL with Levenberg-Marquardt localization on six input point correspondences takes approximately 11 seconds per localization — roughly 100 to 1,000 times faster than the naive data-oblivious baseline, which must run all 20 × 30 worst-case iterations.

For larger inputs (256 points, a realistic count for feature-rich images), one LM iteration takes about 20 seconds. Assuming 8 iterations to converge (typical for the ETH3D dataset), that’s about 160 seconds and 64 GB of network traffic between the two servers.

This is clearly not fast enough for augmented reality or VR, where localization must complete in milliseconds to avoid motion sickness. But for robotics applications — navigation, delivery, surveillance — where a few seconds of latency is acceptable, it’s the first viable option that doesn’t compromise on security.

For comparison, Trusted Execution Environments (TEEs) like Intel SGX could run this computation near native speed. But TEEs require specific hardware, are vulnerable to side-channel attacks, and Intel recently discontinued SGX on most consumer processors. The “sluggish” performance of MPC is the cost of eliminating hardware trust assumptions and side channels.


Meet Turbo the Snail

To prove this all works beyond benchmarks, the authors built a robot. It’s a Raspberry Pi 3B+ with an RGB camera, WiFi, and a 3D-printed snail shell. Its name is Turbo.

Turbo detects printed AprilTag markers in its environment using its camera, matches the detected marker corners (2D) to known positions in a 3D map, and offloads the localization computation to two servers over WiFi. The servers compute the pose using SNaIL without learning the image features, map points, or result. Turbo receives only the encrypted pose, which it decrypts locally, then moves toward a target position using visual servoing.

At roughly 10 seconds per localization on 8 input points, Turbo moves 0.1 meters between pose updates, traveling at 0.01 m/s. For context, native plaintext localization takes about 10 milliseconds — the time it takes a cheetah to cover the same distance. Hence: a snail.

But Turbo has a trick. After the first pose is solved, the result serves as a warm start for the next localization. The initial estimate is already close to the truth, and convergence frequently happens in a single SNaIL call. In practice, Turbo moves faster than the worst case suggests.

Offloading also saves power. The Raspberry Pi draws 3.7 W when computing localization on-device versus 2.8 W when offloading — a 24% reduction, meaningful for a battery-powered robot.

Turbo is the first robot to offload localization without revealing its camera images, its map, its position, or its orientation to the offload servers. It is slow, but it is provably private.


What This Means

SNaIL demonstrates that formal cryptographic security and practical privacy-preserving localization are not mutually exclusive. The two key techniques — fixing the SVD iteration count via structural analysis and decomposing gradient descent into independent MPC calls — are not specific to localization. Any iterative optimization run under MPC faces similar challenges, and the SNaIL approach offers a template: exploit domain knowledge to eliminate data-dependent iteration counts, and restructure the outer loop to hide convergence behavior across a stream of inputs.

The gap between “provably secure” and “real-time” remains large. SNaIL is roughly 1,000× slower than plaintext localization. Closing that gap will likely require a combination of better MPC protocols, hardware acceleration (Intel AMX for matrix operations, new floating-point formats), and application-specific co-design that maintains formal guarantees while exploiting problem structure — more cheetah, less snail, but still provably private.

The source code is available under the MIT license at github.com/secret-snail/localization-server.