AI that may see and perceive what's taking place in a video — particularly a reside feed — is understandably a beautiful product to numerous enterprises and organizations. Past performing as a safety "watchdog" over websites and services, such an AI mannequin is also used to clip out essentially the most thrilling components of selling movies and repurpose them for social, determine inconsistencies and gaffs in movies and flag them for removing, and determine physique language and actions of individuals in managed research or candidates making use of for brand new roles.
Whereas there are some AI fashions that supply one of these performance in the present day, it's removed from a mainstream functionality. The 2-year-old startup Perceptron Inc. is looking for to alter all that, nonetheless. As we speak, it introduced the discharge of its flagship proprietary video evaluation reasoning mannequin, Mk1 (brief for "Mark One") at a price — $0.15 per million tokens enter / $1.50 per million output by way of its utility programming interface (API) — that is available in about 80-90% lower than different main proprietary rivals, specifically, Anthropic's Claude Sonnet 4.5, OpenAI's GPT-5, and Google's Gemini 3.1 Professional.
Led by Co-founder and CEO Armen Aghajanyan, previously of Meta FAIR and Microsoft, the corporate spent 16 months growing a "multi-modal recipe" from the bottom as much as deal with the complexities of the bodily world.
This launch indicators a brand new period the place fashions are anticipated to know cause-and-effect, object dynamics, and the legal guidelines of physics with the identical fluency they as soon as utilized to grammar.
customers and potential enterprise prospects can attempt it out for themselves on a public demo website from Perceptron right here.
Efficiency throughout spatial and video benchmarks
The mannequin's efficiency is backed by a set of industry-standard benchmarks centered on grounded understanding.
In spatial reasoning (ER Benchmarks), Mk1 achieved a rating of 85.1 on EmbSpatialBench, surpassing Google’s Robotics-ER 1.5 (78.4) and Alibaba’s Q3.5-27B (approx. 84.5).
Within the specialised RefSpatialBench, Mk1's rating of 72.4 represents an enormous leap over rivals like GPT-5m (9.0) and Sonnet 4.5 (2.2), highlighting a major benefit in referring expression comprehension.
Video benchmarks present related dominance; on the EgoSchema "Onerous Subset"—the place first-and-last-frame inference is inadequate—Mk1 scored 41.4, matching Alibaba’s Q3.5-27B and considerably beating Gemini 3.1 Flash-Lite (25.0).
On the VSI-Bench, Mk1 reached 88.5, the best recorded rating among the many in contrast fashions, additional validating its capacity to deal with precise temporal reasoning duties.
Market positioning and the effectivity frontier
Perceptron has explicitly focused the "Effectivity Frontier," a metric that plots imply scores throughout video and embodied reasoning benchmarks in opposition to the blended value per million tokens.
Benchmarking information reveals that Mk1 occupies a novel place: it matches or exceeds the efficiency of "frontier" fashions like GPT-5 and Gemini 3.1 Professional whereas sustaining a price profile nearer to "Lite" or "Flash" variations.
Particularly, Perceptron Mk1 is priced at $0.15 per million enter tokens and $1.50 per million output tokens. As compared, the "Effectivity Frontier" chart reveals GPT-5 at a considerably larger blended value (close to $2.00) and Gemini 3.1 Professional at roughly $3.00, whereas Mk1 sits on the $0.30 blended value mark with superior reasoning scores.
This aggressive pricing technique is meant to make high-end bodily AI accessible for large-scale industrial use quite than simply experimental analysis.
Structure and temporal continuity
The technical core of Perceptron Mk1 is its capacity to course of native video at as much as 2 frames per second (FPS) throughout a major 32K token context window.
In contrast to conventional vision-language fashions (VLMs) that usually deal with video as a disjointed sequence of nonetheless pictures, Mk1 is designed for temporal continuity.
This structure permits the mannequin to "watch" prolonged streams and keep object id even by way of occlusions, a important requirement for robotics and surveillance functions.
Builders can question the mannequin for particular moments in an extended stream and obtain structured time codes in return, streamlining the method of video clipping and occasion detection.
Reasoning with the legal guidelines of physics
A major differentiator for Mk1 is its "Bodily Reasoning" functionality. Perceptron defines this as a high-precision spatial consciousness that enables the mannequin to know object dynamics and bodily interactions in real-world settings.
For instance, the mannequin can analyze a scene to find out if a basketball shot was taken earlier than or after a buzzer by collectively reasoning over the ball's place within the air and the readout on a shot clock.
This requires extra than simply sample recognition; it requires an understanding of how objects transfer by way of house and time.
The mannequin is able to "pixel-precise" pointing and counting into the tons of inside dense, complicated scenes. It will possibly additionally learn analog gauges and clocks, which have traditionally been troublesome for purely digital imaginative and prescient techniques to interpret with excessive reliability.
It additionally appears to have robust basic world and historic data. In my transient take a look at, I uploaded a classic public area movie of skyscraper development in New York Metropolis dated 1906 from the U.S. Library of Congress, and Mk1 was capable of not solely accurately describe the contents of the footage — together with odd, atypical sights as employees being suspended by ropes — however did so quickly and even accurately recognized the tough date (early 1900s) from the look of the footage alone.
A developer platform for bodily AI
Accompanying the mannequin launch is an expanded developer platform designed to show these high-level notion capabilities into useful functions with minimal code.
The Perceptron SDK, obtainable by way of Python, introduces a number of specialised capabilities comparable to "Focus," "Counting," and "In-Context Studying".
The Focus function permits customers to zoom and crop into particular areas of a body mechanically primarily based on a pure language immediate, comparable to detecting and localizing private protecting tools (PPE) on a development website. The Counting operate is optimized for dense scenes, comparable to figuring out and pointing to each pet in a gaggle or particular person objects of produce.
Moreover, the platform helps in-context studying, permitting builders to adapt Mk1 to particular duties by offering only a few examples, comparable to exhibiting a picture of an apple and instructing the mannequin to label each occasion of Class 1 in a brand new scene.
Licensing methods and the Isaac sequence
Perceptron is using a dual-track technique for its mannequin weights and licensing. The flagship Perceptron Mk1 is a closed-source mannequin accessed by way of API, designed for enterprise-grade efficiency and safety.
Nevertheless, the corporate can also be sustaining its "Isaac" sequence, which kicked off with the launch of Isaac 0.1 in September 2025, as an open-weights various. Isaac 0.2-2b-preview, launched in December 2025, is a 2-billion parameter vision-language mannequin with reasoning capabilities that’s obtainable for edge and low-latency deployments.
Whereas the weights for the Isaac fashions are open on the favored AI code sharing group Hugging Face, Perceptron presents industrial licenses for corporations that require most management or on-premise deployment of the weights.
This strategy permits the corporate to assist each the open-source group and specialised industrial companions who want proprietary flexibility. The documentation notes that Isaac 0.2 fashions are particularly optimized for sub-200ms time-to-first-token, making them ideally suited for real-time edge units.
Background on Perceptron founding and focus
Perceptron AI is a Bellevue, Washington-based bodily AI startup based by Aghajanyan and Akshat Shrivastava, each former analysis scientists at Meta’s Fb AI Analysis (FAIR) lab.
The corporate’s public supplies date its founding to November 2024, whereas a Washington company submitting report for Perceptron.ai Inc. reveals an earlier overseas registration submitting on October 9, 2024, itemizing Shrivastava and Aghajanyan as governors.
In founder launch posts from late 2024, Aghajanyan mentioned he had left Meta after almost six years and “joined forces” with Shrivastava to construct AI for the bodily world, whereas Shrivastava mentioned the corporate grew out of his work on effectivity, multimodality and new mannequin architectures.
The founding seems to have adopted straight from the pair’s work on multimodal basis fashions at Meta. In Could 2024, Meta researchers printed Chameleon, a household of early-fusion fashions designed to know and generate blended sequences of textual content and pictures, work that Perceptron later described as a part of the lineage behind its personal fashions.
A July 2024 follow-on paper, MoMa, explored extra environment friendly early-fusion coaching for mixed-modal fashions and listed each Shrivastava and Aghajanyan among the many authors. Perceptron’s acknowledged thesis extends that analysis path into “bodily AI”: fashions that may course of real-world video and different sensory streams to be used instances comparable to robotics, manufacturing, geospatial evaluation, safety and content material moderation.
Companion ecosystems and future outlook
The true-world affect of Mk1 is already being demonstrated by way of Perceptron's accomplice community. Early adopters are utilizing the mannequin for numerous functions, comparable to auto-clipping highlights from reside sports activities, which leverages the mannequin's temporal understanding to determine key performs with out human intervention.
Within the robotics sector, companions are curating teleoperation episodes into coaching information, successfully automating the method of labeling and cleansing information for robotic arms and cell models.
Different use instances embrace multimodal high quality management brokers on manufacturing traces, which may detect defects and confirm meeting steps in real-time, and wearable assistants on good glasses that present context-aware assist to customers.
Aghajanyan acknowledged that these releases are the end result of analysis supposed to make AI operate finest within the bodily world, transferring towards a future the place "bodily AI" is as ubiquitous as digital AI.

