Speech-to-text

Enable this model configuration to convert speech to text automatically with the highest accuracy.

Overview

Marsview Automatic Speech Recognition (ASR) technology accurately converts speech into text in live or batch mode. API can be deployed in the cloud or on-premise. Get superior accuracy, speaker separation, punctuation, casing, word-level time markers, and more.

Model Features

Features
Feature Description
Speech-to-text
Accurately converts speech into text in live or batch mode
Automatic Punctuation
Accurately adds punctuation to the transcribed text
Custom Vocabulary
Boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases.
Speaker Separation
Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.
Sentence level Keywords
The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion.

modelTypeConfiguration

Keys
Value
modelType
speech_to_text
modelConfig
Model Configuration object for speech_to_text

modelConfig Parameters

modelConfig
Description
Defaults
custom_vocabulary
A list of custom vocabulary terms
[ ]
speaker_separation.num_speakers
The number of speakers in the conversation. Set it to -1 for determining the number of speakers automatically.
-1
multi_channel
The object used to configure the model to support multi channel audio if applicable
{}
multi_channel.enable
Boolean to enable or disable multi_channel support
False
multi_channel.channel_ids
A dictionary that maps channel ID's of the input audio.
{}
enableTopics
Boolean to enable or disable sentence level topics
False
enableKeywords
Boolean to enable or disable keywords
True
aggressiveness
Aggressiveness factor determines how aggressive the speaker separation needs to be. There is a trade-off between having accurate speaker separation and holding good context of the transcript for a given speaker. Since conversations in a meeting tends to have a lot of interruptions and cross talks, splitting the transcript during speaker transitions will result into bad contexts management and create a lot of breaks in the transcript.
1 -> Better context management 2 -> Balanced 3-> Accurate speaker separation
1
topics.threshold
The threshold which determines whether the topics identified by the models should be considered or not. Threshold can have any value between 0 and 1
0.5

Example Request

CURL
Python
curl --location --request POST 'https://api.marsview.ai/cb/v1/conversation/compute' \
--header 'Content-Type: application/json' \
--header "Authorization:{{Insert Auth Token With Type}}" \
--data-raw '{
"txnId": "{{Insert txn ID}}",
"enableModels":[
{
"modelType":"speech_to_text",
"modelConfig":{
"automatic_punctuation" : true,
"custom_vocabulary":["Marsview", "Communication"],
"speaker_seperation":{
"num_speakers":2
},
"multi_channel": {
"enable": true,
"channel_ids": {
"0":"0",
"1":"1"
}
},
"enableKeywords":true,
"enableTopics":true,
"aggressiveness": 3,
"topics": {
"threshold": 0.5
}
}
}
]
}'
import requests
auth_token = "replace this with your auth token"
txn_id = "Replace this with your transaction ID"
#Note: the speech to text model does not depends on any other models, hence
#can be used independently
def get_speech_to_text():
url = "https://api.marsview.ai/cb/v1/conversation/compute"
payload={
"txnId": txn_id,
"enableModels":[
{
"modelType":"speech_to_text",
"modelConfig":{
"automatic_punctuation" : True,
"custom_vocabulary":["Marsview", "Communication"],
"speaker_seperation":{
"num_speakers":2
},
"multi_channel": {
"enable": true,
"channel_ids": {
"0":"0",
"1":"1"
}
},
"enableKeywords":True,
"enableTopics":False,
"aggressiveness": 2
}
}
]
}
headers = {'authorization': '{}'.format(auth_token)}
response = requests.request("POST", headers=headers, json=payload)
print(response.text)
if response.status_code == 200 and response.json()["status"] == "true":
return response.json()["data"]["enableModels"]["state"]["status"]
else:
raise Exception("Custom exception")
if __name__ == "__main__":
get_speech_to_text()

Example Metadata Response

"data": {
"transcript": [
{
"sentence": "Good evening teresa.",
"channelId": 0
"startTime": 1390,
"endTime": 2690,
"speakers": [
"1"
],
"keywords": [
{
"keyword": "good evening teresa",
"metadata": [],
"type": "DNN"
}
],
"keySentence": "Good evening teresa.",
"topics": [
{
"tiers": [
{
"tierName": "Education",
"type": 1
}
],
"name": "Secondary Education"
},
{
"tiers": [
{
"tierName": "Education",
"type": 1
}
],
"name": "College Education"
},
],
"suggestedIntents": [
"foolish power school board",
"stressful situation",
"determination",
"good good job"
]
}
]
}

Response Object

Field
Description
transcript
A list of sentences and its attributes identified by the speech to text model, split up based on the start and end time of the sentence in the input Video/Audio.
sentence
A sentence identified by the model in the given time frame.
startTime
Start time of the sentence in the input Video/Audio in milliseconds.
endTime
End time of the sentence in the input Video/Audio in milliseconds.
speakers
A list of speaker id's whose voices are identified for a given time frame. Normally this list would only have a single speaker Id.
keywords
A list of keywords identified in the sentence.
metadata
A list of possible contexts for a given keyword.
keySentence
The key sentence identified in the given time frame.

Sentence Level Keywordstype

type
Description
DNN
Keywords generated by AI based on key concepts spoken, topics modeling
Techphrase
Extract Technology terms from the conversation
NER
Extract entities such as custom, location, person, date, number, organization, date-time, date range, etc. from the conversation. (PERSON, GPE, PRODUCT, ORG, EVENT)
Finance_phrase
Extract Financial terms from the conversation