OpenAI launched GBT Real-Time 1.5, and it's broadening our minds on what's possible and the things you can do on a website.
—
TIMESTAMPS:
0:00 This is GPT Realtime update!
00:13 Demo
01:26 How does this compete with using higher-level agents?
02:21 Realtime responses in your app
03:16 This needs further testing...
04:20 Edgar Allen website
—
Unlock the full potential of your online presence with Kabarza and Samuel—experts in web design and development (respectively), powered by cutting-edge AI solutions. We blend creative design with advanced tech to deliver smart, high-impact websites that stand out. Ready to elevate your business? Contact us today and see what AI-driven innovation can do for you!
LINKS & RESOURCES:
Website: https://cmdaishow.com
Check out Kabarza's amazing work: https://kabarza.com
Visit Samuel's website for more: https://samuelgregory.co.uk
📷 Follow on Instagram: https://www.instagram.com/cmdaishow
—
HASHTAGS:
#ai #podcast #aidesign #aidevelopment #vibecoding #webdesign #webdevelopment #ainews #webnews #designnews #devnews
Transcript
This is GPT real time which is a voice model and they've made an update which allows it to dynamically control your apps. Command AI. I want to show you a few ways to build with our real-time voice API so that your users can interact with the application more naturally with voice. This includes things like changing your settings, filling out forms, or even playing a game of chess. Let's say I run a website for a company and I want to improve the accessibility by adding some kind of voice widget. The pattern here is that using the real-time API with function calling, I can control my application state. The model can talk to me using notifications and it can also update the interface directly by calling tools that I've defined. This is not computer use or screen scraping. The application owns the state and the model calls these functions that change that state. In this demo, I've even added a cursor just to show the user what's happening on the screen. Let's start with a practical example. Here I've defined a simple application with a wake word and the ability to go from light mode to dark mode. Hey chat, can you go to dark mode? Can you go back to light mode? So it's that simple. Now let's go into the form demo. Here I can fill out a form using just my voice. My name is Jason Lou. My birthday is March 27, 1994. And can you make sure that I accept the terms? Great. Can you submit that form? But my understanding is you could build this into your website or app and it will just do things while you ask it. However, this and I and then I because whilst I was watching this, I was thinking, well, surely this now exists on the layer, you know, above the app like clicky, which is, you know, the thing we demonstrated at the beginning of the show, like comet browser or dia browser or or whatever or computer use with claw code. Surely, this exists a lot higher, whereas this is saying you can build it into the app experience to call certain functions. But does that person have to build a function to be able to write the name in or build a function to have to write their birthday into that field or is it some sort of screenshot thing? I don't know. So I'm not too sure now every time I move a piece I want you to respond automatically. Can you do that for me? So now it's just a real time. So he uses the application from what I understand and the application responds back all by itself. Now, let's reset the board. How much of this do you need to build yourself? I can if that was a website, why wouldn't you just use claw code with it? But it I mean it's cool. Don't get me wrong. The the real time I think is the point. Um so it's kind of like always on thing in a way. Does that make sense? That's my understanding. Realtime voice component react browser voice controls for tool constrain uh tool constrained UIs built on open air real time. Okay. So it's built on top of real time. Um is it 1.5? You said it was 1.5 but this looks like a whole new thing though. GBT real time 1.5. So users can control apps take state more naturally with voice. It's interesting. I I honestly think this is one of those things where it'll be cool to play around with and see what you can actually do with it. Is it like uh can you have an application that builds its own UI based on what the user needs real time? Your app defines the exact actions the assistant can take. So you need to build all of that stuff in um tools state app owned and narrow. the UI remains responsible for the visual state change. You want a react friendly controller and an optional launcher widget. So I think you you do need to build the responses in and this is just a nice way to basically it handles the understanding and turns that into essentially tool calls that bind to here we go reusable controller with binding. So it binds to certain actions. So you have to build that bridge but it does the heavy lifting of interpreting what the user says and which br it's like MCP in some way I think. Can you open a new tab and go to Edgar or Alan please? Uh they they have this agentic mode where you ask something and then it sends a a um the response to an AI and then you you see how you get like these pieces of the UI. All of these are predefined. So from my understanding or like there's a set of predefined one and the AI responses with what it best understands and yeah you could you could work it into this website I think but it wouldn't be the same sort of functionality. I think it would literally be you know on their form um let's say uh get in touch. You could you could navigate this form with your voice. Okay. So, I'm not sure if I'm gonna be able to explain this correctly, but there's two like the real time happens in the voice, but it would it would call a defined set of functions or tools or something like that, right? So, the real time happens in the voice whereas what Edgar Allan have right now, the real time happens uh after you've typed it in. So it's d the content itself is dynamic. So whilst you could probably do that using GBT real time, it's not really what it's intended to do because this is um probably rag storage of some description where it's understood what we it's both understanding what we're saying. So we could type in a bunch of typos in in the text box, but it's all like I don't think it will always I don't know actually. It might it might not always write the same bit of text. You can correct me, but I think uh the responses are pretty much predefined. Um it's I've just said yeah. So there is an element of you know why did scarecrow win an award because it was outstanding in his field. You know there is still a little bit of AI generated stuff. Yeah. There's a middle ground here where I think GPT real time would would help, but it wouldn't be getting the most out of it. Just you there are probably simpler ways to give this voice interactivity just by using voice to text, you know, just a voice to text model. I think what um and again I'm not saying that Edgar Allen couldn't use Ed uh GPT Realtime 1.5. I'm just saying it wouldn't be the best use case to to because you're you're seemingly able to bind to specific functions and actions as opposed to what Edgar Allan do which is actually quite a it's still probably doing a a bit of it it's got free reign a little bit to do whatever it needs to do whereas you're locking voice actions down to specific functions with GPT real time which has its own set of um benefits and and whatever But all that to be said, there's a middle ground here where yeah, you could bring voice to this and we should send this video to Mason. Mason, if you are watching, like we we should have Mason on the show and you can tell us how you made the website and that would be really cool and a bunch more about uh AI. This is part of a larger conversation on my show Command AI which we stream live every single week. We discuss the news and all things related to AI in the world of design and web. Catch us next week and join in the banter. Command AI.