The Next Wave of Manufacturing Optimization with AI Video

Jan 2

Software engineers have used debuggers and profilers for decades to find hidden issues in their code. But in manufacturing, when a process breaks or slows down, we are often left with nothing but a stopwatch and a clipboard.

In the latest episode of Humans in the Loop, Tulip Co-founder Rony Kubat and AI Product Lead Mark Watabe sit down with CMO Madi Castillo to explore how this is changing.

Drawing on backgrounds that span architecture, cinema, and software engineering, the team discusses the emergence of Physical Observability — using Vision Language Models (VLMs) to "profile" the physical world of the factory just like code, turning video from a passive storage medium into an active tool for continuous improvement.

Watch the full episode below:

The Context Gap: Cooking for Grandma vs. TV

Data without context is noise. To explain why, Mark and Rony turn to a universal analogy: Cooking.

If you are cooking a meal for your grandmother’s 100th birthday, the process is defined by care, tradition, and perhaps a slower pace. If you are cooking that exact same meal for a live TV competition, the process is defined by speed, performance, and stress.

A simple temperature sensor on the stove would read 350°F in both scenarios. It cannot tell the difference. But the reality of the operation is entirely different.

"The architect deals with the problematic scale, which is the human scale... It's the same thing with LLMs. We want infinite context, but it doesn't exist. You have to design referential context so the AI knows where to go." — Mark Watabe

In manufacturing, we often suffer from this "Context Gap." We have the sensor data (the machine is running), but we lack the human context (who is running it, why are they struggling, and what is the environment doing?). Bridging this gap requires a new type of sensor—one that can see the whole picture.

Enter Physical Observability

Rony introduces the concept of the Software Profiler — a tool that watches a program run and highlights exactly where the code is lagging or breaking.

For the factory floor, Physical Observability is the equivalent. It isn't just about recording video; it's about using AI to understand the action within the video.

With the rise of Vision Language Models (VLMs), manufacturers can now "query" their video feeds. Instead of a human scrubbing through 8 hours of footage to find a jam, they can ask an agent: "Show me every time the forklift blocked the aisle in Shift A."

This allows operations leaders to "debug" their physical processes with the same precision software engineers debug code.

From Surveillance to Support: The "Handstand" Metaphor

A common objection to cameras on the shop floor is the fear of "Big Brother" — that video will be used to punish workers. Mark flips this narrative with a personal metaphor: Learning a handstand.

If you want to learn to do a handstand, the most valuable tool you can have is a camera. You film yourself not to be judged, but to see what you cannot feel — is your back arched? Are your elbows bent?

This is the core of Human-Centric AI. When Physical Observability is used correctly, it adapts the station to the worker, rather than forcing the worker to adapt to the station. It identifies where ergonomics are poor, where tools are out of reach, and where the process is fighting the human—enabling mass customization of the workflow for the individual.

Key Takeaways from the Episode

The "Squint Test" for UI: Complexity should be visualized simply. An operator should be able to "squint" at a screen and immediately understand the status without reading raw JSON or data tables.
Factory Playback: We are moving toward a "Highlight Reel" era of operations, where AI agents curate the most important moments of the day for review, saving hours of manual analysis.
Vision Language Models (VLMs): These models make video a searchable, structured data type, unlocking insights that were previously trapped in "write-only" storage.

Ready to start profiling your operations? Learn more about how Tulip is building the future of Computer Vision and Frontline AI.

‍