Will Audio Let Gaming Down?

September 14th, 2008


closeThis post was published 3 years 8 months 4 days ago and as such probably does not reflect my current opinions, knowledge or ability.
Cantante by JulianRod

Cantante by JulianRod

There’s been a lot of Linux discussion here lately, so I thought it was about time I get back to talking about games. Something I’ve been thinking about lately is the way two seperate components of a game, graphics and audio, have evolved seperately. In particular I’m going to focus on speech and voice acting.
 

Look at it this way: which is more realistic, the current 3d models in games, or a character reading a piece of dialogue? Even if the acting isn’t great, you’d probably say the latter. The voice sounds human, for the obvious reason that it was read by a human. I want to consider whether achieving this realism by using pre-recorded phrases is really helping games, or whether it is preventing audio developing further.

 

The Evolution of Graphics

Game & Watch : Donkey Kong JR. by Frankeib

Game & Watch : Donkey Kong JR. by Frankeib

Game graphics have followed a fairly clear path to reach their current state. From simple monochrome shapes on the screen to 8-bit sprites, from a few polygons to the 3d models we see today. Each hardware improvement sees more realistic models, better textures, and more human animations.
 
Of course, that probably seems obvious. We can’t simply film someone and import their precise shape and actions into a game. At best we can base a model on them, but at then at the end of the day you still need to make the model up from a collection of polygons. The reason I’m pointing it out, is because audio doesn’t work this way.
 

The Evolution of Audio

Audio had a pretty similar beginning to graphics. Initially, all that was possible were the two beeps heard in Pong. Again, as hardware improved, more could be achieved. Midi soundtracks could be played. Basic synthesizers could be used to get a sound that almost resembled a word or grunt. Then storage space and audio hardware improved to an extent that pre-recorded audio tracks could be played back – first on PC and then later on consoles. Obviously, this was a huge leap in the realism of dialogue and speech. Now, a real human voice could be played back in cutscenes and throughout the game, instead of simply using text or being able to mimic a few words.
 
Audio has of course continued to improve. More can be stored, with games like Elder Scrolls Oblivion featuring hours of dialogue. The quality of recordings (although not necessarliy voice acting) has also continued to develop.
 

The Problem

Mic on Boom Arm by RalphBijker

Mic on Boom Arm by RalphBijker

The issue with this method of delivering speech lies in its inflexibility. You have a limit to how many phrases you can store, and they can only ever be played back. In fact, this is parallel with graphics – animations are stored and played back as part of the game.
 
However, this doesn’t help solve the problem. If, hypothetically, in the future we were able to create a program which would allow us to procedurally animate characters realistically, our current system of defining bones and vertices would support it. (This doesn’t seem an unlikely situation, already there are programs which can look at a creatures muscle and bone structure and find the most efficient way for it to move, just the calculations take hours rather than the fraction of a second that games require.)
 
With audio though, we can’t create a brand new phrase out of those we already have stored. To even start to compose a system like that we’d have to store thousands of words which could be arranged into sentences, without even thinking about expressing different emotions or tones.
 
This depth of this issue is shown when you think about your actions throughout an average day. You probably preform a lot of the same motions over and over: sitting down, walking, typing, eating. As for talking though, almost every sentence you say will be unique.
 

The Solution

Eventually, I believe the gaming industry will need to look to Text to Speech applications as a solution. Whilst we may suffer a temporary set back in terms of realism, this method will offer several benefits in the long run:
1. It will only be necessary to save a range of sounds, instead of a series of phrases, no matter how much speech is required from a character.
2. Localisation becomes easier. For a new language you simply need a new set of sounds, and to translate a set of text phrases, rather than re-recording all of the dialogue.
3. It could become possible for low budget games to purchase sound boards with all the audio required for the text-to-speech programs, rather than having to use low quality voice acting or text.
 
Initially, I’d imagine it would only be possible to play back phrases written in text (perhaps using phonetics to improve the accuracy of what is read back). However this system has a lot of space for evolution: as both hardware capabilites and AI improve, it could be possible to generate more and more sentences procedurally, giving more meaningful and realistic interations between characters.
 
The issue is… to develop a great system we need to start with one which is less convinving that what we have now.

Over to you…

What are your thoughts on dialogue in games?
Do you think a change is necessary?



Related Posts


Comments

  1. Cherez says:

    I personally value voice acting over graphics in recent games, as I find it easier to be immersed with mediocre graphics and voice than great graphics and just silence or music. Past that, I would say that the voice matters more than the words. For instance, anyone who has seen shows or movies subtitled from a language they don’t speak will find it more immersing with sound than mute because voice still relays a lot about a character’s emotions.

    I don’t think developing text to speech is presently a strong responsibility of the game industry. There are many companies interested for other reasons that are fueling TTS development. Unless TTS development reaches a flatline, just hire voice actors and focus on improving things that aren’t being researched already.

    On the Indy scene, there are definite advantages to making good TTS. I personally have some amount of acting skill and have several friends with the same, but I don’t think I would be satisfied with anything we could produce using affordable equipment and amateur voice editing skills. I would like to see garage developers make some brilliant TTS system, but I’m doubtful that anything very convincing could be made unless the game is aiming for robotic voices as part of the atmosphere.

  2. Hazel says:

    Hi,

    I agree that voice acting can be important – but it’s often done so wrong. I certainly don’t think current text to speech programs are up to scratch, because I agree that the phrases said not only need to pronounce syllables correctly but to convey various emotions convincingly. In future though, I think text-to-speech technologies will progress to the stage where they can initially be used for background characters – for dialogues such as the pointless and flat conversations between characters in Oblivion, and perhaps eventually even during emotional cut-scenes.
    I’m not saying this will come soon, but eventually I think it will carry a number of benefits.

    You’re right that it needn’t be the game industry which develops this technology – there are plenty of companies focussing on it already. However that doesn’t mean it will be effortless for a game to integrate it; there is a big difference in the aim of a program reading websites for the blind from the needs of a game developer creating a voice for a character.

    I don’t think this system could be used any time soon, unless as you say, we are making a game full of robots, but eventually I think the technology will develop to an extent that it will be suitable for use in games.

  3. Mike says:

    If memory serves there are 36 common sounds to everyday language; meaning combining them in different orders you can make any word in a particular voice. And around ~50 for all words. However these sounds don’t correspond to letters easily the letter “A” in ‘car’ is different to that in ‘cat’. Surely working out the difference is less process intensive than converting any speech pattern into any word? You are limited on your voice type by your collection of 50 sounds which I imagine in DVD capacity isn’t all that much. Surely this is a half way mark between a recorded vocabulary and a generating true speech?

Leave a Reply