Part 2: Communicate with a Character

Part 2: Communicate with a Character

In Part 1 of this series, I wrote out fourteen research questions towards building a creative agent that is more capable in communicating with humans. Here, we start filling in approaches to answering these questions by addressing questions 3, 11, 12, and 13. In my next post, I’ll talk about question 6 and Backchanneling.

Question 3

Problem: I want to lead, not just respond. The current LLM paradigm is a <prompt, response> cycle, and it’s quite hard to shake the agent from doing that and instead prompt us. At a meta level, it works to tell the agent that it’s talking to another person and to be engaging; the agent will often make its own queries then. But that just means we’ve shifted the responsibility of the <prompt, response> to the human (or engineering) wizard running the show.
Why do we lead conversations? It’s when we have a goal that we’re trying to fulfill in the conversation. If we didn’t have a goal, then we would just respond to what others ask of us. Goals can be stated simply or with complexity. Here are some examples:
  1. I am curious about another’s emotional state.
  1. I am curious about another’s historic experience.
  1. I want to check into this hotel.
  1. I am an explorer of the world and want to minimize my uncertainty.
The language-learning app Quazel is an interesting case study. Every lesson is a conversation. At the beginning of the lesson, they give the human learning the language three tasks that they need to accomplish, e.g. “Ask if [the AI character has] any pets” or “Inquire what activities [the AI character’s resort has] for teenagers.” As the human progresses through the lesson, a model, presumably GPT3.5, reads through new messages to assess whether they satisfied one of the tasks. It does this with high enough accuracy. Because this is required before the app allows the lesson to be completed, it forces the user to be goal-directed.
We want exactly this from the agent but we want it to built in without having to pull such levers behind the scenes. However, that’s not a default ability for agents - they just dont have goals naturally.
What we can do today is give them goals via their prompt that they have to accomplish and are reminded to accomplish through further prompting. This requires another agent to adjudicate whether the goal has been accomplished. Here’s an example using an interviewing agent. Notice how similar it is to what Quazel does to humans.
  1. Prompt: “Your goal is to find out what Bob, the user, did during the Vietnam War. Be careful though because Bob may have bad memories of that time.”
  1. Conversation ensues.
  1. At each step of the conversation:
    1. Have another agent attest to whether they have completed the job or to what degree they have completed the job.
    2. Accordingly include a system prompt reminding them of their goal.
What we’d actually like to achieve though is an agent that needs neither the helping prompt nor the agent to adjudicate nor the reminder, but yet still leads the conversation rather than being a response engine. A “goal-oriented” agent that satisfies this would reliably be providing its own, possibly implicit, goals in order to accomplish its tasks. A “surprise-oriented” agent, whose goal was to reduce its surprise, could also accomplish this attribute because it would be taking the lead in order to further reduce its surprise about the world. It would plausibly be stupifying to humans though because there are so many directions in which one could make the world be less surprising and it’s hard to know which of those it would take. We could get a BFS agent just as much as a DFS agent and everything in between.

Experiment

In the context of an interviewing agent, an experiment worth trying is to see if we can imbue an agent with a singular intention of being an empathetic interviewer. The North Star would to give a medium-length paragraph detailing the user’s past, say “James Doe was born in 1940 in Charleston, South Carolina. He grew up there until he went to college in Chapel Hill, …” and then expect the agent to perform the interview immediately and completely. Most of the time, this means asking incisive questions while remaining carefully attuned to the interviewee’s emotional state in order to understand whether to go deeper, switch to another area, or take a different tact such as adding more back history. We know that GPT* can’t do this out of the box because it’s unable to be its own orchestrator (what prompts to use, when to chime in, when something is answered enough, …). Perhaps it could be a meta orchestrator for another agent, but that still means there is need to stitch the two together.
The experiment here is asking if we can obviate the need for the meta orchestrator by fine-tuning on what interviewers say, then running reward modeling towards being more of an interviewer personality. This will have worked when we can just give some background on the user and the agent knows then how to take the rest of the conversation. The goal direction comes largely from the reward modeling and almost all of the complexity is hidden there. Is RLHF enough? Maybe. Unclear until the experiment is run.

Question 11

Problem: I am talking with this party. How can we hone in on whom is speaking to the agent so that there aren’t problems in crowded areas?
The scenario here is that we have a group talking to the agent or talking in the vicinity of an agent. We expect it to know how to isolate who it’s talking to so that it ignores other inputs and that it talks to the right person. Humans do this through a combination of detecting that the voice is different, detecting that the voice is coming from elsewhere, and seeing faces moving. The first one requires only audio, the second multi-channel audio, and the third video. And that ordering is suggestive to how we should solve it as well.
The easiest approach to do given ubiquitous product constraints such is the single-stream audio solution. Here’s a survey paper on this direction. If I was to try and solve this problem, I’d first look into off the shelf techniques as there’s likely something that works well. It might not work well enough and it might not work fast enough on a small amount of streaming data.
This open source model for speaker signatures from Microsoft is used in research evaluations as seen in works like SoundStream. It seems pretty strong, but of course like all such models, it likely will need some finetuning depending on inference conditions. That’s not usually a problem to gather because either this is being inserted into an already existing app or we have some idea of what those conditions will be and can gather data from the wild. Once we have that, we can use existing tools (like speaker diarization, etc) to parse the audio and train / finetune the models.
There’s definitely some engineering to get right there because there’s a bunch of edge cases. For example, what if more than one person speaks at a time? Or what if one person speaks and then the other interrupts? Detecting these smoothly is a challenge and will require a lot of testing, but it seems quite doable today.

Experiment

In the context of an interviewing agent, the experiment would be to run WavLM or a similar open-sourced speaker signature model in parallel with the main thread running the interviewing agent. In the first couple of turns, it would assess the signature of the speaker based on cues like response time as well as coherence of the response with the question. Once gathered, it would then use that to suppress other voices. Background noise unrelated to speaking is another matter, but there are other algorithms for doing that that’s out of scope for this question.

Question 12

Problem: I feel a certain way and that feeling comes out in what I say. How do we add intonation that is properly expressive? In other words, how do I add the emphasis to connote excitement, curiosity, anger, etc. This is important because it allows for a lot more understanding to be put through the same channel.
One way to do this is to add SSML tags, like seen in Amazon Polly’s offering. I’ve tried doing that with GPT3.5 where I instructed it to add this notation in in order to properly intonate and it wasn’t easy to control, although the notation was included properly. I suspect that there isn’t that much training data out there for this and what there is isn’t enough to train the LLM given its lack of grounding in how intonation affects meaning.
In the short term, the answer is to use the LLM to generate the text, then a text-to-speech (TTS) model with proven understanding of how intonation affects meaning to generate the speech. Models in this domain like Soundstorm are getting very strong. Companies like ElevenLabs are focusing on this as well.
However, two problems come up with this approach:
  1. Latency: If we have to wait for the LLM to finish before starting the speech generation, that’s a challenging bottleneck that would be better squashed. There are engineering solutions, like from Vocode, that constantly generate LLM responses and the ensuing speech. This means results are ready and waiting all the time, however those results are both a) costly wrt computation and b) potentially volatile wrt intonation because of how information is distributed throughout the sentence, e.g. an exclamation mark will only appear at the end.
  1. What to intonate: The LLM’s output doesn’t have any intonation, so what does the TTS apply? It either has to be wizarded in or a ton of prior context has to be provided. Even if we can add intonation, and we can now, then this approach is punting the decision of what feeling to add to the larger system.
The ideal version is that we do this end to end in one model and it frankly feels like we’re pretty close to achieving this given the generative dialogue results that we see in models like Soundstorm. There (and I encourage listening to the examples), some audio prompting is input and the model will generate a full dialogue. Is this sufficient? There are a few issues remaining.
The first is that it’s likely a weak LLM. I say likely because it hasn’t been tested anywhere near as much as other LLMs and it’s quite small (~350m parameters). In order for this to be capable in many more scenarios, we would want to use a bigger model, which would in turn be able to eat a lot more data.
The second is where do we get that data? The LLMs we see are trained on the tremendous amount of text data. Just converting that text into audio isn’t enough because then the prosody is arbitrary and prosody is half of the input to the underlying generative model in Soundstorm.
The third is how do we control it to respond as another agent with just audio input? Our LLMs get this for free because of how instructive we can make the text, but it’s not clear that that would happen with audio.

Experiment

An approach for solving all of this could be to take their technique but power it up with a fat LLM in the middle that’s been trained to output neural audio codec tokens (of which Soundstream tokens are an instance) conditioned on either text or neural audio codec token input. How would that work? Here’s a recipe:
  1. Gather a dataset of captioned audio.
  1. Convert that into (text, audio, AudioLM tokens) tuples.
  1. Finetune your favorite trained LLM, e.g. Llama2, like so:
    1. Half the time run the normal training procedure with the text dataset it was trained/rlhfed on, e.g. CCC.
    2. The other half the time train the LLM by inputting f(neural audio codec token) embeddings instead of word embeddings and outputting g(final layer activations) instead of just word embeddings.
    3. The function f: neural audio codec → word embedding is a learned function that is meant to be a translation between audio codec space and text space. Note that it has to preserve more information that just the text in order to get the tone right on the output.
    4. The function g: word embedding → neural audio codec is a learned function that is meant to translate the last layer back to neural audio codec tokens rather than to word embeddings.
  1. Optimize this against either the expected neural audio codec tokens (from the Soundstream model) or the decodings of those tokens and the original audio. The former is likely fine.
  1. Note that we may in fact not even need to finetune the full LLM, but instead just train f and g. That is worth doing first because it is faster to test, easier to train, and would preserve the learnings of the LLM from its text. This would also let us remove 3a.
The purpose of this is to build an e2e LLM that inputs and outputs audio. Eventually, we want this to happen in real time so that we can interrupt or backchannel without having to go to another model, but our solution does not do that (and it’s out of scope for this question). The reason our solution will fail in that is because the computational cost will be too high; there needs to be some summary understanding as we go along.

Question 13

Problem: I feel a certain way and that feeling comes out in how my face acts. Conditioned on having a face, how do we respond to and interact with the speaker? This includes movements like raising eyebrows, shifting eyes, smiling, frowning, etc.
There are some ok product versions of this already, e.g. what CallAnnie or D-ID offers. Their algorithms produce somewhat realistic mouth and jaw movements (likely through generative viseme manipulations), as well as pseudo-randomly mock facial features like the eyes blinking and the head moving. The result is a little uncanny, which becomes especially true when the conversation requires more empathy. It also fails for new characters that aren’t on clear backgrounds. Finally, it becomes particularly notable with characters with hair or beards where the fine-grained facial movements don’t generate those features well.
While this ability does not appear to be necessary for a lot of experiences given Replika’s success in engendering empathic results, there is no question that better facial movements will improve the experience tremendously as human contact can attest to.
So how do we do this properly? The solution depends on if we are requiring that we do this with arbitrary faces or a given one, and if we are using cartoon or human faces.
  1. Cartoon with one face. Cartoon faces are easier to manipulate. Here, we can take a viseme approach where we build a library of between 10 and 30 visemes for the singular character, a detector that conditions on the generated audio to recognize the phoneme sequence, and then output the associated viseme as we traverse the sequence. This is done to high effect in the Adobe Character Animator and works because cartoons do not have the level of expressivity of human faces with all their lines and interacting parts. We can make this quite expressive, but as mentioned above, we’d lose some aspects of the real experience like how hair folds over the face as we move from viseme to viseme. This would have to either be ignored (as is done in most cartoons) or built in with careful engineering.
  1. Cartoon with arbitrary face. Compared to the above, the major difference is that we are conditioning on an arbitrary front-facing view input by users. TODO
  1. Human with one face. What if we want human faces instead of cartoon faces? For each increase in realism using the above approaches, we will lose quality because the viseme approach is discrete by nature and will find it challenging to express the micro-expressions found in high-quality realism. We see the same thing in video games using techniques like JALI. This can be done though, as seen in products like what Synthesia offers. However, it becomes uncanny when we seek more expressive offerings. Look at anywhere except the mouth and jaw area for too long, and the characters get a dead look to them.
  1. Human with arbitrary face. This is quite difficult and the best version today is D-ID’s offering, which requires a white background, isn’t real-time, and has the same issues as that of one face.

Experiment

Another idea though is to do this all in the embedding space. I expect that this will win out at the end of the day as it’s both a) the most general and b) scales with neural nets, which is a great bet. Here’s an experiment:
  1. Build a dataset of videos of people talking. We can do this by:
    1. Crawling through video sources like youtube and twitch, focusing on interviews, telecasts, and streams.
    2. Filtering using an off-the-shelf face detector like YoloV5 to keep only tubelets with front-facing people talking. We can detect the front-facing part by counting keypoints. We want front-facing in at least one part of the tubelet because we want to present front-facing faces as the final experience and it’s hard to grok most of the features of the face without that in some shot.
    3. Further filter away tubelets where it is too difficult to grok what the face is saying. We can try to do this with standard techniques in audio processing like estimating the signal-noise ratio of the recording. Another approach is to use an ASR approach (like Whisper) to get the transcript, then run a language model over this to attest whether it’s sensible. A lot of times it won’t be sensible when there is noise because it confuses the model. Some dirty data here isn’t the end of the world though.
  1. At this point, we have tubelets of people talking and likely moving their heads, with at least one frame in the shot having a front-facing face. We also have the matching audio. Note that we don’t really care about the transcript past using it to filter.
  1. We now have all the data to condition on a single face and a sequence of audio, then output a sequence of faces that would align convincingly with that audio.
  1. How to do that with the dataset we built above:
    1. Gather a short video clip sequence.
    2. Choose a random front-facing shot image I in that sequence of speaker S and crop I as tight as possible → C_i. We are going to condition on that C_i. Note that we could allow for non front-facing shots as well, which would be great for letting people use weird angles as the starting point. I suspect that that will be much harder from a modeling perspective and we should only do that after this approach works.
    3. Then take T = min(5 seconds, duration of that clip) and choose a random starting point k \in [0, duration - T] in the video.
    4. Grab the audio A and associated cropped talking faces from [k, j=k+T]. We are going to condition on this audio to predict a function of these faces.
    5. At inference time, what we want is to learn f(C_i, A_kj) → Faces_kj where Faces_kj are high quality representations of the sequence of faces that S traverses when it’s talking from k → j.
    6. At training, we are going to first run C_i through an invertible (frozen) image model M_i to get an embedding E_{C_i}; This can be done with something like CLIP or similar.
    7. Then we are going to do similar with the audio to get an embedding from a frozen model like SoundStorm → E_{A_kj}; It generates 30 seconds of audio in 0.5 seconds.
    8. Train a model f to predict f(E_{C_i}, E_{A_kj}) → E’_{Faces_kj}, a sequence of embeddings of the Faces_kj, and optimize that against the actual embeddings M(Faces_kj). For loss, we could use L2 or contrastive loss. Both are worth trying, although I have a bias towards contrastive loss.
    9. At inference time, invert the f(…) embeddings to get our images. This may be something that we have to build, and that’s very doable given that we have a pretty clear dataset here of inputs and outputs.