o If the robot showed its own stress/mirrored the stress of the participant, the
team (robot + human) were more successful in their task.
▪ So, they were more successful in their task when they shared the
stress.
Vall-E
- Remember Dal-E (Dal`ı)?
Vall-E is an OpenAI (Microsoft c.s.) made Vall-E, a model that can speak in any voice
(including its emotion) if given 3 s of example speech.
- Demo: https://valle-demo.github.io/
o Look at this for the exam to get an idea of Vall-E
- Ground truth
o Human speaking target. (this is what it should sound like -> the demo)
- Baseline
o Simple text-to-speech
- Prompt
o 3s example of speaker
Wavenet is Google and Vall-E is open AI
Is this enough?
- Do actors simply copy a whole emotion? To be or not to be + doubt/despair?
- Are focus/tail (etc.), i.e., common ground, predictable?
- Can emotion be generated/set?
- What about body language?
- What about other non-verbal speech cues?
Lecture 6: Chunking
Previously…
Information structure
- Information structure manifests itself in 2 ways:
48
, o Important information is distinguished from unimportant information
(accents, prominence, emphasis, ...) [lecture 3]
o Sentences “that belong together” constitute discourse units
- (chunking, phrasing, boundary marking,...) [today]
- Today focus on how nonverbal features and variations in voice can mark the
boundaries of discourse units
Programme
- Boundary marking
- Turn-taking
Boundary marking
- It is a general finding that speakers mark the end of information units (for example, a
sentence, phrase, a turn.)
- Compare to lay-out of texts (visualizes the structure of a text):
o Punctuation (full stop, comma)
o Indentation, line breaks, page breaks
o Capitalized words in the beginning of a sentence
➔ If we don’t have cues like this, it is more difficult to read a text. Because the visual
cues help us to read the text more easily.
- These visual cues facilitate the reading process, and the writing process. How about
speech?
We need a volunteer: Describe from the left to the right (just say the name of the
animals)
➔ Leeuw, ezel, reiger, kikker
➔ Leeuw, kicker, reiger, ezel
49
,The teacher didn’t instruct the volunteer how to say it, but you pronounce it in a certain
way.
➔ You hear when you say ‘Leeuw’ that you are not done yet with saying all the names.
o This is explained below.
Difference between local and global cues
- Local cues (boundary tones)
o These cues go up and down: it goes up during the pitch, and it goes down on
the last word.
▪ So, by the first three words, you can hear that the speaker is not done
yet. And by the last word, you can hear that the boundary tone goes
down because the speaker is done.
▪ By the word “A bird”, you can hear that the speaker is almost done
with speaking.
- Global cues (overall pattern)
o It works like a predictor
o The global cues are stretched over the whole utterance that is being
produced
o The global cue keeps going down in the pitch
Difference between local and global cues
- Local cues are encoded at the very edge of a speech unit.
o So at the boundary of a unit.
- Global cues are stretched over a whole unit.
o The global cues allow to predict an upcoming boundary much more.
- Latter would allow prediction of upcoming boundary.
- Compare with turn-taking (predictive capacity of prosody): Turn-switches often
proceed very smoothly (without much overlap, without much delay)
Auditory cues for boundary marking
- Intonation
o Boundary tones: so that you go up and down with your pitch.
50
, o Declination: the global stretch
- Pitch reset
o So, when you say “lion, horse, bird, frog” you reset your pitch after each
word.
- Durational lengthening (final word)
o When you have;
▪ A dog, a horse, a lion, and a frooog
▪ Then, the last word is pronounced longer because it is the last word.
- Pauses
o Silent or filled pauses.
- Voice quality (creaky voice)
o Your voice becomes a bit noisier.
o Scandinavian people have this (for example, Norwegians and people from
Sweden).
Prosodic chunking can disambiguate
- Mathematical formula:
o 2 + (3 x 5) vs. (2 + 3) x 5
▪ When you use your voice, it signals how this is structured. So, you can
hear that (3 + 5) is a unit in the first example and that (2 + 3) is a unit.
- Differently phrased utterances
o “The man said: the girl is ill” vs. “The man, said the girl, is ill”
▪ Here, you can also here the units.
• In the first example: “The man said” is a unit
• In the second example: “The man” is a unit
Developmental aspects of prosodic chunking
- Taken from Peps-C
o Peps-C is a program with teaching materials to learn children how to produce
or interpret prosody, such as chunking.
▪ Children with certain disorders, like autism, have problems with
producing prosody. So, they can practice/teach this by using the
program Peps-C.
- Take a sentence like: “Chicken fingers and fries”
o Chicken fingers and fries can be “chicken fingers” and “fries”, but it can also
be “chicken”, “fingers”, and “fries”.
▪ Children need to learn when they have to use what. So, they need to
understand the boundaries between the speech units.
51