Why I Spent My First Day Writing Zero Lines of Code
Building a Multi-Rate AUV Perception Pipeline with GStreamer and ONNX
This is one of my earliest projects in optimisation and underwater vehicles. I originally set out with a straightforward motivation: I wanted to ingest and process data from a camera and a sonar for an Autonomous Underwater Vehicle (AUV), and I wanted to do it fast.
In my head, it was a classic computer vision task for a Producer-Consumer pipeline. Grab the frames, run the inference, and output the data. But the moment I started mapping out the architecture, the reality of sensor fusion hit me. A high-speed camera array and a forward-scanning sonar do not operate at the same pace. Not even close.
I realized quickly that if I didn’t synchronize these asynchronous streams at the foundational level, my inference node was going to choke on stale data or lock up the system entirely. Here is the autopsy of how I built the pipeline, the low-level systems decisions I had to make, and why I spent my entire first day writing exactly zero lines of code.
Day 1: The cv2.read() Trap and the Push-Based Paradigm
If you are used to building standard CV-Ops pipelines, your muscle memory tells you to write a cv2.VideoCapture().read() loop. But for an AUV handling high-volume, multi-rate data, a pull-based model is a bottleneck. It blocks execution and isn’t what a real-time pipeline needs.
I realized I needed a push-based model, which led me straight to GStreamer. I spent my entire first session (8 PM to 11 PM) just doing top-down learning. I refused to write any code until I understood the underlying mechanics. I dug into:
-
Caps Negotiation: How the pipeline dynamically agrees on data formats.
-
Sinks and Pads: The actual routing logic of the GStreamer graph.
-
Separation of Concerns: Ensuring the ingestion layer had no idea what the inference layer was doing.
It was all study, but you can’t write robust C-level or system-level code if you treat the middleware like a black box.
Bypassing the Python GIL: The Synchronization Engine
The core roadblock in this project was the Producer-Consumer problem. The camera is a fast producer; the YOLOv8 ONNX inference node is a relatively slow consumer. If the YOLO node falls behind, memory fills up, or worse, you process old frames while the AUV is already three meters ahead.
To fix this, I built frame_buffer.py. I needed a thread-safe way to handle this data without the Python Global Interpreter Lock (GIL) turning my concurrency into a traffic jam.
Instead of a standard queue.Queue, I opted for collections.deque(maxlen=1). This helped a ton because standard queues require thread locks that stall execution. A bounded deque, implemented at the C-level, inherently handles atomic appends and pops. If a new frame arrives before YOLO finishes the last one, the deque simply drops the old frame. No locks, no memory leaks, just the absolute freshest data available.
I paired this with an appsink using leaky queues, and used np.frombuffer to map the memory directly. Zero-copy operations are non-negotiable when you are handling continuous arrays.
Bridging C and Python for Hardware Simulation
By Day 3, I was writing the orchestration logic (main.py, camera_source.py). To make Python talk to GStreamer’s underlying C libraries, I relied on PyGObject and GObject Introspection, using .typelib files as the middleware translation layer.
Then came Day 4: simulating the acoustic environment. Monocular vision is essentially useless in highly turbid water, making sonar your primary source of truth. I modeled sonar_simulator.py after a 100-beam forward-scanning sonar (like a Tritech Gemini). Functionally, the physics rely on time-of-flight which is the exact same principle as LiDAR, just utilizing sound waves instead of light.
The Fused Output
With the streams synchronized, I fired up the pipeline (after fixing many many bugs). The high-speed camera data and the mocked acoustic distance data funneled perfectly into the ONNX inference node.
The terminal output validated the entire architecture:
Fused Object -> Class: 5 | Conf: 0.90 | Distance: 0.35mFused Object -> Class: 2 | Conf: 0.85 | Distance: 0.16m
I implemented a clean KeyboardInterrupt shutdown to flush the buffers and gracefully release the hardware, ensuring no zombie processes were left hanging and ruining my next run.
What’s Next: The Path to the Edge
Simulating this on a standard machine was a great stress test for memory management and concurrency, but the end goal of any AUV perception pipeline is bare-metal execution.
Running this architecture on edge robotics means moving away from standard GPU compute. The next evolution of this code would involve stripping down the Python overhead entirely, migrating the critical paths to C++, and compiling the YOLOv8 weights to run on a dedicated Neural Processing Unit (NPU). By utilizing ARM NEON SIMD instructions for the image preprocessing before it even hits the accelerator, this pipeline could hit the latency requirements needed for real-time underwater docking.
This was a fun project that got me into the basics of GStreamer and fast data processing. As my skills improve I plan on increasing the project’s scope as mentioned above and implementing it on actual hardware.
Thanks for reading and check out the source code here: https://github.com/PlexiTAURAD/kelp-and-kernels!