Search overlay panel for performing site-wide searches

Salesforce (Heroku) Recognized as a Leader. Learn More!

Looking into the Future of Agentic AI with Kit Merker

TAGS

  • Deeply Technical
  • Agents
  • AI
  • technology

Looking into the Future of Agentic AI with Kit Merker

At the bleeding edge of computer vision is Plainsight Technologies, a company that’s modernizing infrastructure to handle future agentic AI workloads. Join us as we speak with CEO Kit Merker on Plainsight’s vision for the future, technological goals, and the leading case studies for computer vision.

Hear from host Julián Duque and Kit Merker in this new, insightful episode of the Code[ish] podcast.


Show Notes

Narrator
Hello and welcome to Code[ish], an exploration of the lives of modern developers. Join us as we dive into topics like languages and frameworks, data and event-driven architectures, artificial intelligence, and individual and team productivity. Tailored to developers and engineering leaders, this episode is part of our deeply technical series.

Julián
And welcome to Code[ish], the Heroku podcast. Today, we have the opportunity to talk with Kit Merker. He’s the CEO of Plainsight. Hello, Kit. How are you doing?

Kit
Hey, Julian. Great to see you, err, talk to you.

Julián
Oh, yes, we definitely are going to work on a video podcast in the future, so we can now see each other. So far, we are going to continue with the audio format. Kit, how are you doing? Tell me a little bit more about you and Plainsight.

Kit
Yeah, Plainsight is focused on vision infrastructure. And I’ve been with the company for, I think, just over 18 months now. It had been in the past focused on building computer vision solutions, and we’ve really pivoted the company in the last little bit here toward software infrastructure and building something that feels a bit like I kind of jokingly say Visionetes or something like that was kind of the vision Kubernetes.

Julián
Visionetes?

Kit
Visionetes. Yeah, that’s my joking name for it. Yeah. For people who are in the software infrastructure space, it might mean something. For other people, it might not. But I think maybe another way of describing it is… you think about all the video infrastructure in the world, most of it’s designed for human consumption. So you’re, you know, watching movies or you’re on a video call.

Kit
And we humans, we like to have a lot of redundancy in our video. We want to see very smooth motion and bright colors and contrast and all this to see what’s really going on. And if you have an AI system watching that video, they’re not really experiencing that the same way as a human consumer of video.

Kit
But most of our video infrastructure for video streaming across the internet, which is an incredible system, most of it is designed for humans. So, what we’re trying to do is build… I jokingly say Netflix for robots. You know, it’s really trying to build a video streaming or what we call, you know, vision workloads and this idea that we can take all of these different video inputs from cameras and eventually turn them into spreadsheets for businesses, cameras to spreadsheets using, you know, Netflix for robots I guess is how I would describe what Plainsight is building in kind of layman’s terms, but we’re, you know, we’re kind of thinking about these different units of vision applications. We refer to those as filters, which is kind of a runtime framework we built for taking video, image data, cameras, etc., and processing it in these modular, composable applications.

Kit
And then you can stream those together for, you know, edge and cloud and all this. And then the idea is that you have… data comes in one side via a camera, gets processed through this pipeline of filters in this orchestrated cloud computing environment. Like I was saying, like Visionetes-style, right? Runs in Docker and Kubernetes.

Kit
And then the output you get is structured data in the form of, you know, maybe JSON data or MQTT data if you’re in the IoT world, you know, you might just want a, you know, sensors. And use this MQTT format, for example, so you can kind of think about a camera as another sensor. Or you might put that data into an ERP system or a CRM system, right?

Kit
And so in the future, what we really want to do is we want to take the new agent models that are appearing, these agents that basically use LLMs and different workflows to do work, we want to give them sight. And so Plainsight today is more focused, I would say, on quote unquote traditional vision applications where you have cameras, maybe watching a flock of sheep or a manufacturing plant or security systems.

Kit
In the future, we want to bring the capability of sight to agents so that they can autonomously look around in the world and see things in different businesses, facilities, or operational centers. You know, everything from watching, maybe a drive-through at a quick-serve restaurant to checking the inventory for manufacturing or a warehouse, distribution center, etc..

Kit
And do that on demand in a way that’s cost effective, because unfortunately, if we just took all the cameras in the world and we pumped them into all the GPUs in the world, that would be a very expensive proposition. And going back to the first point, a lot of that information is incredibly redundant. So you’d spend a lot of time moving a lot of data around that doesn’t have very much information in it, and that’s not going to scale.

Kit
So this is really I think, Plainsight’s big mission in the world, right, we’re watching this transformation happen right now to AI as the main consumer of content. And that’s going to happen increasingly with the big push for agents. And so our idea is to basically build the software infrastructure that makes that video streaming optimized for this AI agent-first world, which is in direct contrast to the current way that most video streaming from cameras is done.

Julián
This is fascinating. And how much do I need to know about AI or machine learning to start using your services? Is this a bring your own model type of service, or do you have your own models to process all of these videos or vision information?

Kit
So it’s a great question because, you know, one thing is, I think the world is obsessed with models and model quality. And in the LLM world, you know, the ChatGPTs and Claudes of the world, those models are very big, and they spend a lot of time training them and making sure that they’re operating correctly across many, many different domains.

Kit
In this vision world, what we tend to do is look at how we can bring smaller models that are much more focused and bespoke. So, in some cases you have general-purpose things like, you know, OCR—Optical Character Recognition, OCR, which is used to take images of text and read the text. And, you know, that’s a fairly universal AI task.

Kit
On the other hand, you might have a very specific task, like, for example, maybe you’re manufacturing, I don’t know, maybe you bake, you know, a million donuts a year, and you want to, you know, make sure that those donuts be very specific, you know, specific specifications. And so, you therefore train some very highly bespoke AI models yourself based on real data, real video, real images from your factory.

Kit
And no one else in the world has access to your proprietary, you know, donut-frying process. So, therefore, you know, there is no model out in the world. There’s no way that ChatGPT knows about that at that level. Right? So, there’s a spectrum of availability of different AI models and so the real key to this is what I referred to as the data supply chain.

Kit
Similar in some ways to the software supply chain, which people talk about for security from open-source software and vulnerabilities. The data supply chain has to do with how you manage both raw data, images, cameras, etc., the data feeds into models, the annotation process, and then how you improve those models over time. So, where does it differ for Plainsight is that we’re enabling all these different modes of consuming these AI models, but we do it by a new abstraction.

Kit
And this is what I was referring to earlier as the filter. The filter abstraction is different than a model. It actually brings together model plus code in a specific runtime context. So if you think about the model, right, it could be let’s say an image classification model. It tells you, you know what type of object is in this image, and maybe it’s trained on classes like cats and dogs, right? That’s what the model is. But in order to actually use the model, you have to invoke that inference. And so then the question becomes, well, how am I going to bring that model to life in a specific application context that has, you know, what video am I accepting what output data, where am I writing that, how many instances am I running, etc.?

Kit
Well, now this starts to feel very much like a software workload. And so we take that model plus code, we wrap it into this filter runtime that gives you the basic services to host and run any vision app, which could be, by the way, as simple as if-then statements if you think about it. The model is a more sophisticated version of that. Now we give you a universal interface to define, and compose, and deploy these computer vision apps, regardless of what model is inside them. In fact, this enables a new model lifecycle that’s independent of the software lifecycle. Software, you know, you may have a bug in the application code, you might have a security vulnerability you need to patch, etc..

Kit
So these life cycles need to work together. And this mode of operating in a lot of ways, it’s like moving from a data scientist, you know, experimenting with a, you know, a notebook to a CI system for these vision apps. And so this gives you a full continuous delivery process by which changes to your model, which are really, if you think about it, changes to your data sets, usually and usually sometimes some changes to your hyperparameters.

Kit
The training process is outputting models based on the data coming in, some configuration, and some hyperparameters. And that produces a model that has, you know, some quality level. You don’t know what it is until you test it. So it goes through testing and benchmarking. And if that’s good enough, then you say, “Yep,” I want to promote this to production, and now make this part of my new inference pipeline.

Kit
So that life cycle is what Plainsight is enabling. And it may be as simple as using an off-the-shelf model like you would in the case of an OCR, or you know, maybe something from Hugging Face or some other PyTorch model you’re using. But it might also mean you’re building your own models. And we at Plainsight, we don’t create or maintain models on behalf of our customers.

Kit
We let them treat models as… we think of them as user data. They’re part of the content of the computer vision applications. So that’s something we’re enabling for customers. But we are not in the business of, you know, model training, and for the purposes of like, you know, licensing a model, we provide the tools for training for sure, but, and for data collection and preparation, but really the key is that the model needs to live with the code. And once you create that combination and you can deploy code and model together, that’s really what we’re… we think will power this, you know, the vision internet, right? This kind of scalable vision infrastructure for agents and AI to consume video data from businesses.

Julián
What is the interface that I need to use to interact with these filters? Is this in a form of an SDK, is it an API, or how… when I’m building this application, this solution, how do I interface with the filter?

Kit
Yeah, it’s probably more like an SDK. I mean, the simplest mode is you basically take a repo, a GitHub repo, and fork it and you’ve got a filter runtime base image. So you have your Docker file that will have, like, from Plainsight filter runtime. And then you can add your code in your repo Docker, build it, and the output is you get a Docker container which you can, you know if you don’t know how to deploy a Docker container, you can pretty much deploy a vision app so you don’t really have to know anything, frankly, about how vision works. If you’re a DevOps engineer and you’ve deployed a Docker container like you got like a third-party Docker container that you got from someone else, and you can deploy it, then you can probably at least get started.

Kit
And if you’re an application developer, you get a starting point where you don’t have to worry about, for example, like how I get the video in, oh, RTSP feed—it’s like a real-time streaming protocol—you don’t have to know how any of that stuff works. You just configure it to point to an RTSP URL, and you get a little frame API, and you can talk to that and describe it the way you want to do it. Whatever your logic is.

Kit
And compose these Docker containers together. So it’s actually pretty quick to get started. The thing that takes the most time is if you are starting from a model that has either, you know, low quality that needs to be fine tuned or improved, or if you’re starting from scratch, you just have, you know, a bunch of images or you aren’t sure about the model, then that does take some clock time.

Kit
But what we’ve done with this system is made that more iterative. So you’re really not having to wait months and months, you start with, okay, I got the first, you know, 500 frames or 1200 frames. Got it annotated by some annotation service. If you’re familiar with annotation, it’s like… you do a CAPTCHA, you know, have you ever done a CAPTCHA where you have to click on, you know, stoplights or buses or whatever. You’re actually training AI models when you do that.

Julián
Staircases.

Kit
Yeah. Yeah. Exactly. Yeah, you’re training AI when you do that. That’s what that is. It’s crowdsourcing of the annotations. And in the AI business, there’s like a $5 billion, like, cottage industry that’s come up around just literally annotating images. And, you know, people do everything from these common objects… that’s actually much less prevalent today. More often, it will be these proprietary or highly specialized images.

Kit
So, for example, like medical imaging and things like that. But if you have a set of images, you can hire humans to go and, you know, annotate all that data, and then that data is used for both training and testing of AI models. That part can be laborious depending on the task at hand. But let’s assume for a second you have a model that’s ready to go. You’re using EasyOCR or something like that, right, for this type of use case, or you’re using one of the face detection data sets that are out there, or car dent scratches is one I just was looking at on Kaggle.

Kit
Kaggle has got all kinds of these different image data sets, for example. So you can use them for primarily noncommercial uses, which is fine. And when you go to commercial, then you got to like actually pay for data, generally speaking. But if that stuff’s not, you know, if that’s not a challenge, you’re not starting from a completely proprietary type of model, then the application development is quite, quite quick, and you don’t really have to know that much about vision in these, especially in these easier use cases. You quickly learn things that will screw you up, like, you know, motion blur can be a problem. Or if you try to detect distances using a single camera, like those things, can screw you up.

Kit
And there’s some trigonometry you might have to go look at to figure some of these things out. But for the most part, the simple ones, like you know, detecting if an object is there or looking for motion or, you know, those kinds of things, that’s built into the SDK. Very simple code to write. And you can get a system up and running pretty quickly, and then you can start layering on additional, you know, application business logic and the rest of it.

Kit
But the part that Plainsight offers is really just the engine, right? It’s the vision Docker containers, the filters as we call them. And then you can extend those and customize them with your own code and business logic, and of course with your own models. And so that then gives you the space, and then of course, you know, vision never lives on its own.

Kit
And so we partner with our developer ecosystem to build and extend into, you know, reporting and sign-up experience and, you know, all the other parts of building a business application. And many times, what customers really want is they want to get their data into their database or their ERP system of some sort. So maybe they have inventory tracking or they’ve got a customer support system or something like that, and they’re trying to get their data from a camera, you know, their counting… this is a common one, right, simple to understand. You got a bunch of cows you want to count, how many cows are coming off of a truck, right, at a facility, for example, is a very common one. And we see a lot of… we actually do a lot in livestock, interestingly, but… and not just cows, it’s sheep and pigs and chickens and turkeys and all this stuff. In the normal world, it’s like, they pay people to count those because that’s money.

Kit
There’s cattle changing hands. That’s money changing hands. So it has been since the dawn of, I think animal husbandry and agriculture that’s been part of society. Well, now we can use AI to do that counting for us and to help us be more accurate and more fair in dealing with these things. So you know, imagine you got animals walking off of the back of a truck.

Kit
They’re walking sort of single file, stick a camera over them, and have it count the number of animals that are coming down the track. That’s like a very clear case. Well, that’s a ERP problem, right? Now I’ve got… If I’m running a large livestock operation, I want to put that into my ERP system, and I can see exactly the flow of goods.

Kit
This is a, you know, food supply chain in this scenario. And so that can go into an ERP system and be tracked as part of inventory tracking and supply chain and understanding where things are coming from and making sure that the numbers match and, any other data, you know, even if you have, like, ear tags or you’ve got chips, all these things can be used together to get a more complete picture because each of those systems has its own weaknesses.

Kit
When you use multiple systems, that’s where you get the highest accuracy. And that’s what the vision is adding to these types of operations.

Julián
Nice. One use case that I saw in one of your videos is the wildfire detection.

Kit
Yes.

Julián
Can you talk about that a little bit more, about that project and how it was… how it was done and how it was trained?

Kit
Yeah, so wildfirewatch.org is a website we launched last year, and it’s… I’m based in Seattle, Washington, actually Kirkland, Washington of Costco fame. So in Washington state, we have a pretty bad wildfire problem. Not as bad as some other states around us, but it’s become pretty pronounced. And I found out that the state was spending a whole bunch of money on some very bespoke fire detection solutions.

Kit
And I had some friends that were in the fire prevention side, especially for the land management here, tons of, you know, forests and things like that in the state. But we also have a pretty comprehensive traffic network, and all the video cameras for traffic are all publicly available URLs. So I thought, well, you know, we’ll do we’ll just build it ourselves, right?

Kit
We’ll build our own wildfire tracking. And so what we did is we took all the data from the cameras, there’s 1500 something cameras. And we said we’d just track them every five minutes, and we’d run them through a filter with a model that we trained for fire detection, give it a score from 0 to 1 for a likelihood of fire.

Kit
Now, what’s happened since we launched is one, there haven’t been that many wildfires. So, we haven’t had really good data to test, I mean, it’s very fortunate, actually, that there haven’t been…

Julián
Of course.

Kit
that many wildfires, of course. But, you know, from purely from a, you know, a data perspective, it would be nice to have some examples to train on, but we looked at kind of some of the false positives. And so our accuracy on there, I’ll just be honest, is not very good at this point. It’s a side project that the team’s running it’s kind of, if you’re familiar with like Google’s 20% project it’s kind of like that.

Kit
It’s something we do to test out our software, but it’s not something that’s being, like, actively developed and maintained. It’s never going to make money. It’s just really a testbed for us to try out different things when we see it. So we still run into a lot of problems with things like headlights at night and sunsets and sometimes smoke.

Kit
These all trigger our fire detection algorithms. And, you know, we’ve looked at strategies to improve them. You know, frankly, it hasn’t gotten as much attention as it probably should have over the last year since we launched it. But that’s something we’re doing, and my hope is that this will become a showcase for what the technology can do, but at the same time, doing something good for the public.

Kit
So anyway, that’s the idea. And if anyone out there wants to help contribute to wildfirewatch.org, we would like to open-source it and get more people contributing to it. It’s just been on the back burner to kind of finalize the steps.

Julián
That would be amazing, especially what happened recently in Los Angeles with all these fires. If there is a way to have, like, early prevention through technology like this, that, as you say, that definitely will help communities and prevent disasters.

Kit
That’s right. Yeah, that’s the dream. And, I mean, it’s very unfortunate the LA fires were just, I mean, total tragedy. The wildfires in Washington, you know, they happen often in very remote locations and they have to… it’s very expensive to go out and check in person. So having this infrastructure should help them both do early detection, and in some cases, there’re you know, controlled burns and things like that that they have to manage. But yeah I agree with you. The idea of using AI as an early prevention system in concert with human vigilance and these other preventative systems, I think is, where the future needs to go.

Julián
You mentioned that this was trained on public camera feeds, like traffic cameras and such. Can these also rely on satellite video or imaging or that’s not that easy or accurate?

Kit
Well, in principle, any video feed, including infrared cameras, etc., could be used for vision and vision as, you know, as a technology, right, is used for, you know, microscopic and other things that are not what we just see with our eyes. So in principle, you could use any data source for it. I would say in the case of satellite images, it’s not something we’ve specifically explored for this project.

Kit
I think it could be a cool add-on. I haven’t personally looked at any satellite images of fires, and I also don’t know what the licensing implications would be. But yeah, in principle, there’s no reason why it couldn’t be done, it’s just out of scope of this, you know, this very small starter project. I mean, for us, we actually had the first version of Wildfire Watch working in a matter of days because the camera feeds, as I said, are just all available on this website.

Kit
So we were able to go get those and start scraping them and start processing very, very quickly, which is why this idea came to life so fast. I think if we had gone down the path of satellite imaging, I think we probably would have had a delay in our time to market. But I think this is… this brings up a point about businesses trying to adopt vision is, I often tell them, you know, kind of start with the cameras they have, see if we can get some sort of value out of it.

Kit
And oftentimes they do need to, you know, add new cameras, and they need to improve their lighting and other things to get the level of information. But it’s really an iterative process. I think that’s the key to all these types of AI projects is people will think, okay, I got an idea for what I think I can do.

Kit
And then reality gets in the way. And so we have to experiment and try different things. But I think over time it’s going to get easier and easier to, you know, literally just like write a prompt and get some data together and start down the path of having a real vision-powered solution. I think, especially again, as agents become more and more prevalent, this is going to be the way we do things, and right now there’s a lot of figuring out, but in the future that’s going to become know-how and hopefully know-how that’s embedded into our software and our AI systems themselves that can help us reinforce that learning and make these systems quick and inexpensive and effective, which is really what we’re trying to do.

Julián
How do you envision that agentic feature of… you mentioned something like giving an LLM vision on video. Because right now you can have like multimodal applications that can, sure, read an image or a couple of images and get a description out of the image. But I have not seen that on video.

Kit
Yeah.

Julián
I’ve seen on video, just let’s say, getting a transcription or the audio out of a video as part of the context of an agent application, but I have not seen what you mentioned, like real vision to a model. How do you envision this is going to be implemented in the near future?

Kit
Yeah, it’s a great question. And it’s a tough one because like you said, we’ve seen all this multimodal stuff and text has become the main mode of operation for LLM-type systems. And we do see a lot of generative video. And I think we’ve all experienced, you know, the quality, it’s amazing what we’re seeing now with generative video.

Kit
But the question of processing real video for accuracy, it can be done with LLMs. It’s just incredibly expensive today. This is really the big challenge. If you took like a 4… I think it was a full day’s worth of 4K video processed by the OpenAI API; it’s something like $18,000 a day, something like that. Yeah, it’s very, very expensive.

Kit
And so the first challenge, and this goes back to what I said at the beginning, the first challenge is like, if you did that, a lot of that data is like incredibly redundant. So there’s a semantic compression and information compression part of the problem because we don’t want to just dump all that video into an LLM. We want to find the meaning out of it.

Kit
But we don’t, you know, wouldn’t it be better to have it as text to ask the LLM for the context versus… or JSON instead of that? Or at a minimum, get it down to frames. Like, I’ll give you an example. Like, think about a video that’s watching a shelf, right? And there’s a bottle of your favorite beverage, you know, Liquid Death or something, right on the shelf.

Kit
And somebody reaches in and grabs the bottle or the can and takes it out of the frame. That video, maybe you had it running 24 by 7, and of that 24 by 7, you only care about the 3 or 4 seconds that the inventory item was changing. And of those 3 or 4 seconds, you really only care about two frames.

Kit
You care about the before frame and the after frame, right? And so if you think about the difference between a 24/7 streaming video versus two frames, it’s an incredible amount of compression just there. So the way that we think about this is that the filters can have various levels of accuracy and also various levels of algorithmic complexity.

Kit
So lower algorithmic complexity… this is, you know, trade-offs, right? So the lower algorithmic complexity means I can run it on cheaper hardware and closer to the edge, closer to the camera, without having to invest in significant hardware. And this is a real problem because GPUs are the new catalytic converters is my joke. Because you have these GPUs, you know, I have this very expensive piece of equipment that’s hard to maintain that’s sitting in, you know, a random coffee shop. It’s now a target for theft. And it can also be abused for, you know, Bitcoin mining, etc.. There’s all kinds of problems with just deploying lots and lots of GPUs without good oversight management. So okay, so what if we could do it just with CPUs?

Kit
So then the question would be like, well, what’s the maximum algorithmic complexity? It’s not going to be much. But we can look at pixel data, we can look at color data, etc.. And so, we want to architect our system in such a way that we have a lower algorithmic complexity, but the advantage is lower cost. And the other disadvantage is that you’re going to be constrained on what that can detect, right?

Kit
It’s not going to be able to have AI-level detection. It will give you, I would say, the shorter context window is maybe a way of thinking about this, but if I have a scene, I can look at changes in the color values of that grid. I mean, that’s all it is really, video, images, all it is, is a matrix of RGB values.

Kit
There’s actually no motion, there’s actually no objects per se. It’s like it’s all derived from these RGB values at the base layer. So okay, so once you have this infrastructure where you can now look at a scene and have a low computational complexity, then you can start to trigger events.

Kit
And this is… in this scenario I described earlier, like this inventory tracking. The first thing you could do is you can have a filter that all it knows how to do is to watch for color value changes. And then once that happens, for a region of interest that the color value changes, have that trigger an event that looks the full frame and looks for the beginning and end frames, etc..

Kit
So those are all just different parts of this application. So what we’re envisioning, and I know this is a little science fiction-y, so like, just go with me here. I’ll try to explain at the highest level possible. Instead of asking the LLM what’s going on in this video or what’s going on in this frame, what we want to do is say, “hey LLM write me a program that will generate another program that runs algorithmically less expensively, and deploy that to this edge, close to the camera in such a way that it’s aware of the cost constraints and will filter out images that are not worth being processed by the LLM.” Which I know is a complicated way to think about this, but simply thinking like, I don’t want to process this frame, I want you to give me a program that will watch the frame for me. And that watch concept, what we sometimes referred to as a pixel prompt.

Kit
But the idea is that you’re basically creating these prompts on the fly that call back to the LLM when the event that they care about is needed, right? And you’re pushing that on the fly as code, as a program that gets deployed into that edge, and then, you know, in this case, the edge filter. So now I can say, okay, for this short period of time, I’m going to say, set up a watch on this region of interest for motion, and when it happens, then send me the frame that results from that motion. I can then even do some additional pre-processing before it goes into an LLM using a series of other filters, perhaps running in the cloud, perhaps using GPUs, perhaps with AI models that are preventing the wrong data from getting into the LLM. So it’s a cost-conscious architecture.

Kit
It’s a sophisticated architecture, frankly. But the result is that you have a general-purpose vision-event-based system that allows you to deploy across multiple configurations and architectures. And the result is that you can take a variety of live video streams, and you can produce these very robust applications. But you have to remember that a lot of it becomes ephemeral and gets thrown away because, as the scenes are changing, you have to constantly be rewriting these different programs that are cost-conscious.

Kit
That’s really, I think, the key idea. Now, if we can do that again, this is… I’m speaking very abstractly because this is like a very infrastructure-y problem, but we can enable these really cool, prompt-based scenarios to come to life where you can say to an agent, you know, hey, your job is to watch this shelf and if anybody picks up a can, then decrement this table, and if there’s any exceptions, then do this other workflow and blah, blah, blah, blah, blah, right? You can kind of start to build up a prompt description, and that removes a ton of application development, etc., and if you change your mind, you can fix it, you can have exceptions and guardrails and all the other things that go into agent development, but with sight, right?

Kit
Not just waiting on text to come in, and it’s, you know, it’s still… this is still a work in progress. Agents are still I would say in their infancy, although we are seeing so much good tech coming out for it. The part that we want to play in this is, you know, with our very robust vision infrastructure concept, tying it to these agents and then also allowing business stakeholders to create on-the-fly, vision-based applications with merely, you know, writing prompts into a system like, you know, Agentforce or other agent platforms, but not just with text, but with vision data as well, but without being a vision engineer.

Julián
You say it is science fiction-ish, but this is a very fast-paced ecosystem. It’s constantly evolving. So whatever is science fiction today is going to be a reality tomorrow.

Kit
I agree. I agree. It’s moving incredibly fast. I mean, there are definitely limits to what people will want to do. I think there are more economic limits than technological limits, if you know what I mean. It’s like, at some point it’s never going to be worth it to have these GPUs running to do certain tasks.

Kit
But for the most part, you’re right. It’s things that we thought were not possible, or we thought were a toy, and now are becoming production-grade systems, and it’s ubiquitous. It’s absolutely ubiquitous, the LLM movement. Almost to the point where now people don’t even really think about it as AI, it’s just another tool that they use, and it’s just normal. How quickly we move from oddity to luxury to necessity, right?

Julián
It becomes definitely a commodity right now.

Kit
That’s right.

Julián
Let’s talk about the future. What are you looking ahead to 2026? Is there any emerging technology that is being tested today that you are looking forward to trying out or to implement, or how these ecosystem of computer vision is evolving?

Kit
Yeah, there’s a few areas that I’m particularly interested in, in this. One, I’m less of a machine learning… you know, that’s less of my expertise. I know there’s a lot of cool stuff emerging there, but I kind of just see this steady drumbeat. One of the cool things for Plainsight is whenever this technology becomes available, we just add more filters and we wrap that up with a consistent API.

Kit
It’s almost like a black box for vision, to be honest, which really makes it so that the cutting-edge machine learning technology is now incredibly consumable. One thing I’ve been looking at because I’m, I pay a lot of attention to infrastructure, is Wasm and that’s the WebAssembly. And WebAssembly being used in particular for cloud computing.

Kit
So you know, wasmCloud. And, you know, today we package our filters as Docker containers. And one of the things I’ve been thinking about is what would happen if we could package them as WebAssembly components. And I think that’s one area of evolution that will be interesting for us to look at. I also have been paying a lot of attention to edge hardware. And edge hardware today, I think there was a belief that a lot of GPUs would be deployed at the edge and these sort of edge appliances.

Kit
I think that that, as I kind of was mentioning before, I think that’s, you know, not realistic. And so what we’re seeing is new form factors for inference workloads, in particular, edge workloads. And I think we’re also going to see, and we’re already starting to see, distributed near-edge cloud environments that have GPUs enabled and will be optimized for inference, which I think is going to be a hugely valuable infrastructure. Because one of the big problems, obviously, with vision is this data transport and the turnaround time.

Kit
So a lot of times what people will do is they’ll scope out these hardware solutions and try to put everything into a, you know, into a box. And then now you have another box, right? This is kind of a problem, right? It’s like you’re trying to build some general-purpose thing. So, as we’re seeing more options for where to deploy things, this, to me, is what’s going to make this whole thing ubiquitous.

Kit
We’re testing some new cameras that have the ability to run Docker containers and GPUs on the box, which is…

Julián
Oh really?

Kit
…on the camera. Yeah. Which is a very cool solution. And so this is going to open up some really cool possibilities where you’ll be able to effectively take the camera, have it be all-in-one. In fact, there are some models that even have wireless connectivity built in.

Kit
So you’ll really have an all-in-one camera. And we’ll be able to do software-defined AI on that camera on the fly, which is just like awesome. And then being able to hook that seamlessly into a cloud system. So you say, okay, this camera is now the front end of a cloud system. And then be able to configure and manage that.

Kit
That’s a really cool use case. The hardware part… with this vision, you really have to think end-to-end on the technology side. I think the other part of this for us is, I would say, a business model change, or opportunity we’re seeing in the market. So I was mentioning before, kind of the dev ecosystem of taking our vision and integrating it.

Kit
And so we’re investing a lot in our vision services network, effectively taking both services and product companies that want to add vision to their suite of services or to their products. Don’t know where to start or are using, you know, open source and kind of hacking something together. We’re giving them the ability to take this vision and embed it into their products and take it to market in a very scalable way.

Kit
And so I’ve been investing a lot in like, the training materials and education, and you know, we’re building this… affectionately calling it the Vision Academy. Which we haven’t launched yet, so, you know, it’s coming soon, but we’re working on a bunch of ways to not only help on the technology side, but also to educate and to engage with people who want to be in this space.

Kit
I think there’s a role for a vision engineer as a job description. It brings together, actually, quite a few disparate roles. Where in the past there was, you know, application developers and there were these machine learning people, this vision expertise is really its own thing. And if that role evolves, then I think those people who, you know, decide to go down this path can have a very exciting and lucrative career building vision solutions.

Kit
And so we want to help build that community. So those are some things I’m looking forward to more. I would say on the technology front, I just assume that the LLMs are going to get better, that there’s gonna be more GPUs, that NVIDIA will keep making them. You know, it’s like there’s a lot of these things that have sort of become now stable features of the AI ecosystem.

Kit
And I think the question now is, like, how do we pivot from the hype to productivity? And that’s where I’m really focused. I want to make this stuff just high quality, easy to adopt, you know, braindead ROI like just obvious solutions versus where today it’s still is very much like an innovation project or, you know, something that’s an experiment or people are doing AI because they think they have to, as opposed to having a real true driving business need to adopt.

Kit
And we’re seeing… at least I’m seeing a lot more people who are trying to solve real, kind of sticky business problems. You know, counting is a big one, but quality control is another. Just to show another use case to think about, right? One of the things that’s happened is there’s a huge rise in mobile ordering of food.

Kit
I don’t know if you’re familiar with this trend, but ever since COVID, it’s become, seems like this, you know, oh, just like, send the food to my house. You know, that’s just the way it is. And unfortunately, this creates a little bit of a customer service challenge if somebody makes a mistake. And mistakes happen, right? So, you know, a couple of things that can happen.

Kit
One is like, a high-value item could be left off the order. You know, it’s like, hey, I wanted avocado on this salad. You know what I mean? This kind of thing. Or it could be that there was a condiment that was put on that maybe is offensive. Like, I guess some people don’t like mustard, it turns out, so.

Kit
There are certain, you know, mistakes that are more important than others. Like, you know, hey, I forgot one popcorn shrimp. It’s not that big of a deal. But like, if I put mustard on the burger and I specifically asked to not have it, I’m not gonna be happy. Or like, if I forgot to put the protein on the salad.

Kit
So we’re seeing a trend of people who are trying to solve this problem, both at the level of the platform companies that are offering food delivery, as well as the end restaurants that are, you know, scaled and, you know, quick-serve restaurants, for example, that are trying to prevent these mistakes. And so that’s one I see as a big growth opportunity for vision because we can kind of build a helper for the people to help them improve accuracy without a huge change to their workflow.

Julián
Yeah, how many times I had like the wrong order…

Kit
Yeah.

Julián
…coming at my door. It’s great to see that things that were just research, now they are becoming businesses. So, amazing to see all the evolution that Plainsight is having, and what’s coming next for the future, in your task of building the Netflix for robots.

Kit
Thank you very much. Yeah, I’m excited.

Julián
Kit, thank you very much for joining us here at Code[ish], the Heroku podcast, sharing your story, this amazing technology. And is there anything that you want to say to your audience before we say goodbye?

Kit
Well, I will say if you’re going to be at the Embedded Vision Summit on May 20th, we will be there, Booth 518, and I’m speaking, and it’s gonna be a lot of fun. It’s in Santa Clara, California. But otherwise, yeah, check us out on plainsight.ai, and thank you very much for having me.

Julián
Kit, thank you very much. See you on the next one.

Narrator
Thanks for joining us for this episode of the Code[ish] podcast. Code[ish] is produced by Heroku. The easiest way to deploy, manage, and scale your applications in the cloud. If you’d like to learn more about Code[ish] or any of Heroku’s podcasts, please visit heroku.com/podcasts.

About Code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Subscribe to Code[ish]

This field is for validation purposes and should be left unchanged.

Hosted By:
Julián Duque
Julián Duque
Principal Developer Advocate, Heroku
@julian_duque
with Guest:
Kit Merker
Kit Merker
CEO, Plainsight