There’s been a lot of Linux discussion here lately, so I thought it was about time I get back to talking about games. Something I’ve been thinking about lately is the way two seperate components of a game, graphics and audio, have evolved seperately. In particular I’m going to focus on speech and voice acting.
Look at it this way: which is more realistic, the current 3d models in games, or a character reading a piece of dialogue? Even if the acting isn’t great, you’d probably say the latter. The voice sounds human, for the obvious reason that it was read by a human. I want to consider whether achieving this realism by using pre-recorded phrases is really helping games, or whether it is preventing audio developing further.
The Evolution of Graphics
Game graphics have followed a fairly clear path to reach their current state. From simple monochrome shapes on the screen to 8-bit sprites, from a few polygons to the 3d models we see today. Each hardware improvement sees more realistic models, better textures, and more human animations.
Of course, that probably seems obvious. We can’t simply film someone and import their precise shape and actions into a game. At best we can base a model on them, but at then at the end of the day you still need to make the model up from a collection of polygons. The reason I’m pointing it out, is because audio doesn’t work this way.
The Evolution of Audio
Audio had a pretty similar beginning to graphics. Initially, all that was possible were the two beeps heard in Pong. Again, as hardware improved, more could be achieved. Midi soundtracks could be played. Basic synthesizers could be used to get a sound that almost resembled a word or grunt. Then storage space and audio hardware improved to an extent that pre-recorded audio tracks could be played back – first on PC and then later on consoles. Obviously, this was a huge leap in the realism of dialogue and speech. Now, a real human voice could be played back in cutscenes and throughout the game, instead of simply using text or being able to mimic a few words.
Audio has of course continued to improve. More can be stored, with games like Elder Scrolls Oblivion featuring hours of dialogue. The quality of recordings (although not necessarliy voice acting) has also continued to develop.
The Problem
The issue with this method of delivering speech lies in its inflexibility. You have a limit to how many phrases you can store, and they can only ever be played back. In fact, this is parallel with graphics – animations are stored and played back as part of the game.
However, this doesn’t help solve the problem. If, hypothetically, in the future we were able to create a program which would allow us to procedurally animate characters realistically, our current system of defining bones and vertices would support it. (This doesn’t seem an unlikely situation, already there are programs which can look at a creatures muscle and bone structure and find the most efficient way for it to move, just the calculations take hours rather than the fraction of a second that games require.)
With audio though, we can’t create a brand new phrase out of those we already have stored. To even start to compose a system like that we’d have to store thousands of words which could be arranged into sentences, without even thinking about expressing different emotions or tones.
This depth of this issue is shown when you think about your actions throughout an average day. You probably preform a lot of the same motions over and over: sitting down, walking, typing, eating. As for talking though, almost every sentence you say will be unique.
The Solution
Eventually, I believe the gaming industry will need to look to Text to Speech applications as a solution. Whilst we may suffer a temporary set back in terms of realism, this method will offer several benefits in the long run:
1. It will only be necessary to save a range of sounds, instead of a series of phrases, no matter how much speech is required from a character.
2. Localisation becomes easier. For a new language you simply need a new set of sounds, and to translate a set of text phrases, rather than re-recording all of the dialogue.
3. It could become possible for low budget games to purchase sound boards with all the audio required for the text-to-speech programs, rather than having to use low quality voice acting or text.
Initially, I’d imagine it would only be possible to play back phrases written in text (perhaps using phonetics to improve the accuracy of what is read back). However this system has a lot of space for evolution: as both hardware capabilites and AI improve, it could be possible to generate more and more sentences procedurally, giving more meaningful and realistic interations between characters.
The issue is… to develop a great system we need to start with one which is less convinving that what we have now.
Over to you…
What are your thoughts on dialogue in games?
Do you think a change is necessary?


