I don't understand the comments here at all. I played the audio and it sounds absolutely horrible, far worse than computer voices sounded fifteen years ago. Not even the most feeble minded person would mistake that as a human. Am I not hearing the same thing everyone else is hearing? It sounds straight up corrupted to me. Tested in different browsers, no difference.
I agree with the comment above. I have not logged into hacker news in _years_ but did so today just to weigh in here. If people are saying that the audio sounds great, then there is definitely something going on with a subset of users where we are only hearing garbled words with a LOT of distortion. This does not sound like natural speech to met at all. It sounds more like a warped cassette tape. And I do not mean to slight your work at all. I am actually incredibly puzzled here to understand why my perception of this is so radically different from others!
Also keep in mind the processing time. The ^ article above used a NVIDIA L4 with 24-GB VRAM. Sopro claims 7.5 second processing time on CPU for 30 seconds of audio!
If you want to get real good quality TTS, you should check out elevenlabs.io
Yes, if this selected piece is the best that was available to be used as a showcase, it's immediately off putting in distortion and mangling of pronunciation.