Coach Ana - AI powered speech training
Coach Ana - AI powered speech training
A sound-to-sound AI engine
During my time working at Giving Tech Labs, a social impact tech incubator, I had the honor to work closely with Dr.Ying Li, a world-renowned data scientist recognized for her contributions to pracice of data mining. SuvoTek[1] is a specialized privacy-fist, speech-to-text-free voice technology she designed to help people communicate, especially by identifying the tone of speech. Our initial research shows SuvoTek stook out to teachers/parents. This user group are most interested in understanding their tone when communicating with kids, and care most about privacy.
A sound-to-sound AI engine
During my time working at Giving Tech Labs, a social impact tech incubator, I had the honor to work closely with Dr.Ying Li, a world-renowned data scientist recognized for her contributions to pracice of data mining. SuvoTek[1] is a specialized privacy-fist, speech-to-text-free voice technology she designed to help people communicate, especially by identifying the tone of speech. Our initial research shows SuvoTek stook out to teachers/parents. This user group are most interested in understanding their tone when communicating with kids, and care most about privacy.
I was excited to do a quick prototype with one of our engs. The goal is to experiement, test how applicable the tech is solve their real problem.
I was excited to do a quick prototype with one of our engs. The goal is to experiement, test how applicable the tech is solve their real problem.
Speech tone visualization
The first question is how to effectively visualize the emotion/tone of a speech?
Speech tone visualization
The first question is how to effectively visualize the emotion/tone of a speech?
Plutchik's Wheel of Emotions
Surfacing specific emotions during a live-time training would be overwelming to users' cognitive load. The idea was to sythesize the emotions to a granuity that's easy to capture. Blue --> cold --> calm and yellow --> warm --> high energy is a universal color concept understood by diverse user group.
Surfacing specific emotions during a live-time training would be overwelming to users' cognitive load. The idea was to sythesize the emotions to a granuity that's easy to capture. Blue --> cold --> calm and yellow --> warm --> high energy is a universal color concept understood by diverse user group.












Color code to indicate tone
One thing to be careful. Both Joy and Anger could mean high energy. In the visual language, I wanted to be objective and not labeling something as negative or positive. So I chose a more objective tone "energy".
One thing to be careful. Both Joy and Anger could mean high energy. In the visual language, I wanted to be objective and not labeling something as negative or positive. So I chose a more objective tone "energy".


Summary screen after a training session


Summary screen after a training session
The summary dashboard offers an overview of how the emotion flows during the entire speech. It's designed to be glanceable.
The summary dashboard offers an overview of how the emotion flows during the entire speech. It's designed to be glanceable.
Real-time speed monitoring
Speed is another key indicator of speaker's cognitive clarity. Different context requires different pace. A controlled, deliberate pace suggests that the speaker is in command of the material as well as help the listener to process information. This is especially critical during high-stress situations, like for crisis line operators.
Real-time speed monitoring
Speed is another key indicator of speaker's cognitive clarity. Different context requires different pace. A controlled, deliberate pace suggests that the speaker is in command of the material as well as help the listener to process information. This is especially critical during high-stress situations, like for crisis line operators.

Underlining logic to set the desired target speed

Underlining logic to set the desired target speed
The premeters that's affecting what target speed is suggested is based on audience, intention, and setting. I translated these info as an onboarding experience, providing users with options to start.
The premeters that's affecting what target speed is suggested is based on audience, intention, and setting. I translated these info as an onboarding experience, providing users with options to start.

caption

caption
The next challenge is to find the way to show real-time speed data, on top of the speech tone.
The next challenge is to find the way to show real-time speed data, on top of the speech tone.

Common co-relations to speed

Common co-relations to speed
Speed info should be glanceable. I want to minimize the cognitive load needed for users to get feedback. Information needs to be glanceable.
Speed info should be glanceable. I want to minimize the cognitive load needed for users to get feedback. Information needs to be glanceable.

Speed wheel inspired by car meter

Speed wheel inspired by car meter
I landed on speedometer. It mimics real-life experience of driving while checking speed limit, to keep it on the targeted speed. A mental model familar to users. The goal is to keep the interface clean and minimal. Main actions "Pause" and "Stop" easily accessible.
I landed on speedometer. It mimics real-life experience of driving while checking speed limit, to keep it on the targeted speed. A mental model familar to users. The goal is to keep the interface clean and minimal. Main actions "Pause" and "Stop" easily accessible.
Onboarding
I tested with 9 users to check on flow clarity. 4 out of 9 were not 100% sure of what everything means, even they guessed it right. To give users full confidence before they commit to the app, we rolled out a short onboarding.
Onboarding
I tested with 9 users to check on flow clarity. 4 out of 9 were not 100% sure of what everything means, even they guessed it right. To give users full confidence before they commit to the app, we rolled out a short onboarding.

Speed wheel inspired by car meter

Speed wheel inspired by car meter
I tested with 9 users to check on flow clarity. 4 out of 9 were not 100% sure of what everything means, even they guessed it right. To give users full confidence before they commit to the app, we rolled out a short onboarding.
I tested with 9 users to check on flow clarity. 4 out of 9 were not 100% sure of what everything means, even they guessed it right. To give users full confidence before they commit to the app, we rolled out a short onboarding.
Reflection
We saw positive signals about the idea of tha app. However, to make it useful, just showing speed and tone seems not sufficient to win users' trust. More work needs to get done to get into naunce about defining what's a high-quality speech based on different contexts. At the time, the voice technology wasn't mature enough to do so. I hope one day it will. The mission of helping people with special needs to improve their life quality while protecting their privacy is something I want to stand behind, and felt lucky to get a chance to explore.
Reflection
We saw positive signals about the idea of tha app. However, to make it useful, just showing speed and tone seems not sufficient to win users' trust. More work needs to get done to get into naunce about defining what's a high-quality speech based on different contexts. At the time, the voice technology wasn't mature enough to do so. I hope one day it will. The mission of helping people with special needs to improve their life quality while protecting their privacy is something I want to stand behind, and felt lucky to get a chance to explore.
Acknowledgments Special thanks to Kyle Coburn and Dr. Ying Li for collaborating on this.
Acknowledgments Special thanks to Kyle Coburn and Dr. Ying Li for collaborating on this.
[1] If you are a nerd and wanted to read more about it, here's the research paper: https://dl.acm.org/doi/epdf/10.1145/3394486.3403326