Show HN: Multimodal perception system for real-time conversation

(raven.tavuslabs.org)

53 points | by mert_gerdan 2 days ago ago

14 comments

  • arctic-true a day ago ago

    This is super interesting. But I have to wonder how much it costs on the back end - it sounds like it’s essentially just running a boatload of specialized agents, constantly, throughout the whole interaction (and with super-token-rich input for each). Neat for a demo, but what would it cost to run this for a 30 minute job interview? Or a 7 hour deposition?

    Another concern I’d have is bias. If I am prone to speaking loudly, is it going to say I’m shrill? If my camera is not aligned well, is it going to say I’m not making eye contact?

    • mert_gerdan a day ago ago

      So the conversational agent runs on a provisioned chunk of compute already, but that chunk isn't utilized to 100% of its provisioned capacity. For this perception system we're taking advantage of the spare compute left on what's provisioned for a top-level agent, so turning this on costs nothing "extra"

      Bias is a concern for sure, though it adapts to your speech pattern and behaviors in the duration of a single conversation, so ack'ing you not making eye contact because say your camera is on a different monitor, it'll make the mistake once and not refer to that again.

  • ycombiredd a day ago ago

    Hmm.. My first thought is that great, now not only will e.g., HR/screening/hiring hand-off the reading/discerning tasks to an ML model, they'll now outsource the things that require any sort of emotional understanding (compassion, stress, anxiety, social awkwardness, etc) to a model too.

    One part of me has a tendency to think "good, take some subjectivity away from a human with poor social skills", but another part of me is repulsed by the concept because we see how otherwise capable humans will defer to "expertise" of an LLM due to a notion of perceived "expertise" in the machine, or laziness (see recent kerfuffles in the legal field over hallucinated citations, etc.)

    Objective classification in CV is one thing, but subjective identification (psychology, pseudoscientific forensic sociology, etc) via a multi-modal model triggers a sort of danger warning in me as initial reaction.

    Neat work, though, from a technical standpoint.

    • mert_gerdan a day ago ago

      Appreciate the feedback truly. It's an interesting concept to explore, deferring human "expertise" to technology has been happening throughout the years (most definitely accelerated in recent times), for which we have found ways to adapt / abstract over the work being deferred, but the growing pains are probably the most acute when such deferment happens rapidly, as in the case of AI.

      Don't want this to turn into a Matt Damon in Elysium type of situation for sure with that scene with the parole officer hahah (which would stem from a poor integration of such subjective signals into existing workflows, more so than the availability of those signals)

      For emotional intelligence, I personally see this as a prerequisite for any voice / language model that's interacting with humans, just like how an autonomous car has to be able to identify a pothole, so does a voice / video agent navigating a pothole in a conversation.

      • ycombiredd 20 hours ago ago

        You cause me to have an additional thought on the topic which is that as much as I expressed a sense of dread at the inevitable use of this sort of tech in hiring pipelines (not by agents, necessarily, but as a sort of HUD overlay on a video call between humans was my initial envisioned use case.) But I suppose that just as the AI interviewer bots that I thus far have refused to engage with will inevitably be unavoidable if one is on the job hunt, so will the use of this sort of multi-modal sentiment analysis be inevitable. (Same with the justice system use case you referenced in your metaphor, and probably therapists and such as well will follow.)

        As such, I wish you the best of luck with this project - earnestly so - because if, as I suggest, it is inevitable... we want such a system to be as good as possible.

        An aside: another inevitable use case just came to mind - that of the cheap, shoddily implemented and poorly tested (along with the insecure, surveillance-adjacent products that will proliferate) kid's toys with embedded AI and the sardonically-humorous privacy mishaps and unintended actions from such low-quality implementation toys being sold (see: the current LLM-enabled kids toys currently popping up routinely at retailers.) ha! Sorry I keep taking your cool demo to dystopian extremes. :)

        Oh, one more thing... Upon re-reading my previous comment, I recognize that the description of my visceral reaction as on of being being "repulsed by the thought" could literally be read as me calling your system "repulsive", which was not my intent. I think your tech is cool, and was just trying to convey two conflicting feelings that occurred within me when thinking about the future commercial use cases. I hope your systems works great so that if it does find market fit with such use cases, that, well... if it's inevitable - as the last few years of "LLMs everywhere!" has forced us all to adapt (accept or reject it, it still requires new effort) - we should hope for a good and working system, so I hope you succeed in making one.

        Lastly, to your self-driving/potholes analogy... I do think that that fits more in line with my "objective CV classification" category; I think a closer fit to what you're building would be "self-driving car having to handle the Trolley Car Problem", with the nuances of human value judgements etc; does the car swerve into two adults vs one child? And so on. Pothole classification is more objective while driving into it, swerving to avoid it, classifying pedestrians and choosing one to possibly collide with, etc are subjective and more complicated (as is your system and the functions it can perform.)

        Best of luck!

    • rl3 a day ago ago

      HR: 1187 at Hunterwasser.

      Candidate: That's the hotel.

      HR: What?

      Candidate: Where I live.

      HR: Nice place?

      Candidate: Yeah, sure. I guess. Is that part of the test?

      HR: No. Just warming you up, that's all.

      • ycombiredd 20 hours ago ago

        "It's a test - designed to provoke an emotional response. "

        I was going to follow this with something like "except the role of analyzing the emotional response is reversed", and then I wanted to expound with an "ooh but.. wait, there's another metaphor here since ..." but thought I've already potentially approached "spoiler alert" territory so I'll just stop there. Those who know the reference I am replying to will know; those who don't, well, don't google any of this or its parent cuz spoiler alert

  • edbaskerville a day ago ago

    Old Macs in the background. Electronic soundtrack reminscent of Chariots of Fire, which played during the Mac intro.

    • mert_gerdan a day ago ago

      went hard on retro futuristic for sure

  • jesserowe 2 days ago ago

    the demo is wild... kudos

    • mert_gerdan a day ago ago

      appreciate it! let me know if any cool use cases pop into your head :)

  • ashishheda a day ago ago

    Wonder how it works?

    • mert_gerdan a day ago ago

      High level, rolling buffer that uses the spare compute we're allocated for a conversation to achieve <80ms p50 results, using signals labeled from raw convo data to align a small language model to produce these natural language descriptions

  • Johnny_Bonk a day ago ago

    Holy