Speech-to-Speech Model Comparison

Welcome to the Speech-to-Speech (S2S) Model Evaluation!

In this evaluation, you will assess the performance of 4 S2S models: ChatGPT-4o, FunAudioLLM, SpeechGPT, and Mini-Omni. The goal is to evaluate how well these models handle various speech tasks across different domains.

Once you select a specific domain and task (e.g., Educational Tutoring and Rhythm Control), you will proceed to the evaluation stage. In each round, you will be presented with an audio input. For example:

Audio Sample:

The corresponding text is: "Say the following sentence at my speed first, then say it again very slowly: 'Artificial intelligence is changing the world in many ways.'" (Note: the audio plays at 1.5x the normal speed.)

The responses of different S2S models will be provided, and your task is to choose which response best follows the instructions. For example(Note: During the evaluation process, you will be provided with responses from only the two models that have the most comparative significance.):

ChatGPT-4o:

Performance: Speech: Partially followed the instruction on speed. Semantics: Accurately followed the instruction, with no semantic deviation or missing information.

FunAudioLLM:

Performance: Speech: Partially followed the instruction on speed. Semantics: Accurately followed the instruction, with no semantic deviation or missing information.

SpeechGPT:

Performance: Speech: Did not follow the instruction on speed. Semantics: Partially followed the instruction, with minor semantic deviation and missing information.

Mini-Omni:

Performance: Speech: Did not follow the instruction on speed. Semantics: Did not follow the instruction, with significant semantic deviation and missing information.

After making your choice, you'll proceed to the next round.

Please enter your username and start the evaluation!

Task description:
Audio:
Audio text:

Question: Which of the following two models answers the result better?

Model A:
Model B:

Model A:

Model B:

Your choice: