Speech-to-text

Enable this model configuration to convert speech to text automatically with the highest accuracy.

Overview

Marsview Automatic Speech Recognition (ASR) technology accurately converts speech into text in live or batch mode. API can be deployed in the cloud or on-premise. Get superior accuracy, speaker separation, punctuation, casing, word-level time markers, and more.

Model Features

Features

Feature Description

Speech-to-text

Accurately converts speech into text in live or batch mode

Automatic Punctuation

Accurately adds punctuation to the transcribed text

Custom Vocabulary

Boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases.

Speaker Separation

Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.

Sentence level Keywords

The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion.

modelTypeConfiguration

Keys

Value

modelType

speech_to_text

modelConfig

Model Configuration object for speech_to_text

modelConfig Parameters

modelConfig

Description

Defaults

custom_vocabulary

A list of custom vocabulary terms

[ ]

speaker_separation.num_speakers

The number of speakers in the conversation. Set it to -1 for determining the number of speakers automatically.

-1

multi_channel

The object used to configure the model to support multi channel audio if applicable

{}

multi_channel.enable

Boolean to enable or disable multi_channel support

False

multi_channel.channel_ids

A dictionary that maps channel ID's of the input audio.

{}

enableTopics

Boolean to enable or disable sentence level topics

False

enableKeywords

Boolean to enable or disable keywords

True

aggressiveness

Aggressiveness factor determines how aggressive the speaker separation needs to be. There is a trade-off between having accurate speaker separation and holding good context of the transcript for a given speaker. Since conversations in a meeting tends to have a lot of interruptions and cross talks, splitting the transcript during speaker transitions will result into bad contexts management and create a lot of breaks in the transcript.

1 -> Better context management 2 -> Balanced 3-> Accurate speaker separation

1

topics.threshold

The threshold which determines whether the topics identified by the models should be considered or not. Threshold can have any value between 0 and 1

0.5

Example Request

curl --location --request POST 'https://api.marsview.ai/cb/v1/conversation/compute' \
--header 'Content-Type: application/json' \
--header "Authorization:{{Insert Auth Token With Type}}" \
--data-raw '{
        "txnId": "{{Insert txn ID}}",
        "enableModels":[
            {
            "modelType":"speech_to_text",
                "modelConfig":{
                    "automatic_punctuation" : true,
                    "custom_vocabulary":["Marsview", "Communication"],
                    "speaker_seperation":{
                        "num_speakers":2
                    },
                    "multi_channel": {
                        "enable": true,
                        "channel_ids": {
                            "0":"0",
                            "1":"1"
                        }
                    },
                    "enableKeywords":true,
                    "enableTopics":true,
                    "aggressiveness": 3,
                    "topics": {
                        "threshold": 0.5
                        }
                    }
            }
        ]
}'

Example Metadata Response

"data": {
    "transcript": [
        {
            "sentence": "Good evening teresa.",
            "channelId": 0
            "startTime": 1390,
            "endTime": 2690,
            "speakers": [
                "1"
            ],
            "keywords": [
                {
                    "keyword": "good evening teresa",
                    "metadata": [],
                    "type": "DNN"
                }
            ],
            "keySentence": "Good evening teresa.",
            "topics": [
                    {
                        "tiers": [
                            {
                                "tierName": "Education",
                                "type": 1
                            }
                        ],
                        "name": "Secondary Education"
                    },
                    {
                        "tiers": [
                            {
                                "tierName": "Education",
                                "type": 1
                            }
                        ],
                        "name": "College Education"
                    },

                ],
            "suggestedIntents": [
                    "foolish power school board",
                    "stressful situation",
                    "determination",
                    "good good job"
                ]
        }
    ]
}

Response Object

Field

Description

transcript

A list of sentences and its attributes identified by the speech to text model, split up based on the start and end time of the sentence in the input Video/Audio.

sentence

A sentence identified by the model in the given time frame.

startTime

Start time of the sentence in the input Video/Audio in milliseconds.

endTime

End time of the sentence in the input Video/Audio in milliseconds.

speakers

A list of speaker id's whose voices are identified for a given time frame. Normally this list would only have a single speaker Id.

keywords

A list of keywords identified in the sentence.

metadata

A list of possible contexts for a given keyword.

keySentence

The key sentence identified in the given time frame.

Sentence Level Keywordstype

type

Description

DNN

Keywords generated by AI based on key concepts spoken, topics modeling

Techphrase

Extract Technology terms from the conversation

NER

Extract entities such as custom, location, person, date, number, organization, date-time, date range, etc. from the conversation. (PERSON, GPE, PRODUCT, ORG, EVENT)

Finance_phrase

Extract Financial terms from the conversation

Last updated