Voice synthesis on ISR

ページ 1/36
| 2 | 3 | 4 | 5 | 6

By ARTRAG

Enlighted (6976)

ARTRAG さんの画像

17-05-2016, 20:14

With WYZ I've been fiddling on low quality voice synthesis for a while.
The idea is to extract the largest frequencies in each voice segment and try to match them with the channels available on PSG or SCC.

This in attach is a "double standard" ROM.

If an SCC is detected in slot 2, it will play using 5 channels, without, it will use the PSG with 3 channels.
All using the very same 4KB of data ;-)

Pug this rom and type from basic
? USR(n)
with n in 0-4 to play the voices

ログイン/登録して投稿

By Manuel

Ascended (19677)

Manuel さんの画像

17-05-2016, 21:39

Wow, fantastic! Even on PSG I could really hear it quite clearly, especially number 0 Smile

By yzi

Champion (444)

yzi さんの画像

17-05-2016, 21:54

Tried it in OpenMSX, but I couldn't make any sense of any of the words. Maybe you should add subtitles? Wink

Do you try to do "error diffusion", switching between frequencies rapidly to give more than 3 or 5 frequencies at least some attention? Like, dithering sort of thing.

By ARTRAG

Enlighted (6976)

ARTRAG さんの画像

17-05-2016, 22:24

The player is linked to the isr resolution (50/60hz), so unless of linesplits, there is no option for temporal dithering.

By yzi

Champion (444)

yzi さんの画像

17-05-2016, 22:35

Well, temporal "dithering" works to some extent with music. If you arpeggiate a single voice rapidly, it will work as a chord, sort of, at least to some extent. Why not with speech? I don't know, I never tried it.
How about noise? Adding noise to fill abruptions can improve legibility of speech. I mean, if you have short blanked out segments in your speech audio, making it hard to understand, it can be made more legible by adding noise to the silent segments. The brain somehow interpolates and makes up stuff if there's noise, but not if there's total silence.

By yzi

Champion (444)

yzi さんの画像

17-05-2016, 22:39

And I meant just normal 50/60 Hz interrupt speed, nothing faster.

By ARTRAG

Enlighted (6976)

ARTRAG さんの画像

17-05-2016, 22:42

It is already changing frequencies at 50/60Hz selecting those that best fit the current chunk of signal to be approximated.

By yzi

Champion (444)

yzi さんの画像

17-05-2016, 23:02

I know it is changing frequencies. But I meant, for example, if the "target" signal has four constant frequencies, 1, 2, 3 and 4, with respective amplitudes 10, 8, 5 and 4, then you could for example, instead of only playing frequencies 1, 2 and 3, and not frequency 4 at all, dither between frequencies 3 and 4 with the third oscillator, so that both of them get at least some "air time".

Any frequencies that exist in the original (target) signal, but you do _not_ play during a frame, should be added to your "error". And try to diffuse the error over time, just like in error-diffusion dithering with graphics. Of course you should use some kind of temporal windowing, maybe something like 3-4 frames, and only try to make up for at most 10 of the loudest frequencies. Or something. And since you have not just 1, but 3 or 5 oscillators to "paint" with at any given time, you'd try to minimize the error using all of those together. For each frame, the "encoder" should try to find out the combination of 3 (or 5) frequencies that minimizes the cumulative error. And use coefficients when calculating the error diffusion, Floyd-Steinberg style.

Another way to look at the error is "frequency debt". You have very limited resources you try to balance. Or a frequency hospital. You have patients who are badly wounded, and try to let as few of them die as possible.

Just ideas that came to mind in a minute.

By ARTRAG

Enlighted (6976)

ARTRAG さんの画像

17-05-2016, 23:12

Yzi, you have to play the strongest frequencies at each frame, that is the best solution.
There is no time for dithering if the frame rate is 60Hz.
Dithering would make sense if the frame rate were higher than the frequencies you want to simulate.

By yzi

Champion (444)

yzi さんの画像

17-05-2016, 23:37

Like I said, I don't know what it would sound like, since I haven't tried it, but vsync-rate dithering works with chord arpeggios in music. So why not with speech. Noise can make up for missing signal segments, at least in some cases. And if you play it back with loudspeakers in a room (and not with headphones), the pitch segments will keep sounding (echoing) a short while because of room reflections and speaker resonance etc., overlapping with each other, even when using only one voice. You could have many more than three or five "virtual oscillators"?

By ARTRAG

Enlighted (6976)

ARTRAG さんの画像

17-05-2016, 23:45

Try an see yourself. Nyquist said something about that.

ページ 1/36
| 2 | 3 | 4 | 5 | 6