Generate speech from text

Generate voice clips up to 300 characters.
Login to generate voice clips up to 1000 characters.

What is
Spik.AI ?

Spik.AI allows you to generate realistic sounding audio from text. We use a mix of machine learning algorithms to bring you the best voice generation technology.

Spik.AI is a free app, produced by Oveit, a company focused on bringing cutting edge technology to closed loop payments.

As a non registered user you can generate files from text up to 300 characters. Login to generate longer audio files, up to 1000 characters.

Coming soon
Voice transcript

Easily generate transcripts from audio recordings.

Just upload the audio file, press a button and get a full transcript of voices in the audio file.

Enter your email below to find out when we launch this feature:

How to use SSML to generate
great voice clips

It doesn't matter if you are developing a voice chatbot or if you are using a cool text-to-speech app like Speak.ai. It's crucial that the final result does not sound like just words thrown together. Voice and tone are more important than words. Or, to put it this way, the tone, pauses, and speech tempo will help your words make an impact.

And if we agree that not just what you say matters, but also how you say it, it's obvious why SSML has become a thing. Here’s a list of 4 Markups that will help you give a human touch to your computer-generated voice. To help you better connect to the client, friend, partner, or web surfer that interacts with your work.

The power of a simple pause

We all know a great story-teller. A person that has the power to use words that simply lift us from the chair and put us into the middle of the action. A person that right before the peak of the story makes a pause that makes want to shout "and then what happened?" Because you know that something important is about to happen.

Yes, used right, speech pauses have the power of letting you know that something important is about to be mentioned. Is very common for great public speakers and one of the most efficient ways of communicating the importance of what is going to be said next.

SSML allows us to use this technique in the computer-generated speech by using the element, that has time and strength attributes.

Here’s an example:

See more: https://www.w3.org/TR/speech-synthesis/#S3.2.3

Make it tuneful

We can use technology to generate the voice, but the last thing we want is to have an impersonal result. A monotone voice will make audiences lose interest (or fall asleep) and will make no impact whatsoever. This is why we as humans, use tone, pitch, and speed to add more meaning to our words.

Ex: have you noticed how we use our voice to add questions mark? We raise the pitch toward the end of the sentence.

SSML has the <prosody> element, that allows you to change the pitch, rate, and volume of the speech. Use the attributes and change the speed of speech, the importance of critical words and the tone of the voice. It adds emphasis.

Here’s an example:

See more: https://www.w3.org/TR/speech-synthesis11/#S3.2.4

P.S.: A simpler way is to use the <emphasis> attribute, with its 4 levels:

      
        <speak>
          <emphasis level="reduced">
            I believe in the right of the people to rule.
          </emphasis>
        </speak>
      
    
      
        <speak>
          <emphasis level="moderate">
            I believe in the right of the people to rule.
          </emphasis>
        </speak>
      
    
      
        <speak>
          <emphasis level="strong">
            I believe in the right of the people to rule.
          </emphasis>
        </speak>
      
    

See more: https://www.w3.org/TR/speech-synthesis/#S3.2.2

Say it as it sounds

If I would have to choose one SSML element to take on a remote island that would surely be <say-as> . Why? Because it has the interpret-as attribute (no, that's not cheating, the attribute is part of the element) that tell the voice generator how to interpret your input. So you can enter a number and tell the generator if you want to be spoken as cardinal, ordinal or even as a telephone number. It works for date and time as well. Even for fractions. I tell you, you will love the <say-as> element. And it's not difficult at all to use it.

      
        <speak>
          <say-as interpret-as="date" format="dmy" detail="2">
            10-9-1901
          </say-as>
        </speak>
      
    
      
        <speak>
          <say-as interpret-as="fraction">3+1/2</say-as>
          <say-as interpret-as="fraction">15+1/3</say-as>
        </speak>
      
    

See more: https://www.w3.org/TR/speech-synthesis/#S3.1.9

Hope this helps you recognize the power of SSML. We live in a world where machines are able to engage and talk to humans, but also in a world that has not yet lost its feelings. Using the above examples you can use a text-to-speech app or develop a chatbot and still keep the passion alive. Because in the end, this is what keeps us going.