Speech-to-text
Enable this model configuration to convert speech to text automatically with the highest accuracy.

Overview

Marsview Automatic Speech Recognition (ASR) technology accurately converts speech into text in live or batch mode. API can be deployed in the cloud or on-premise. Get superior accuracy, speaker separation, punctuation, casing, word-level time markers, and more.

Model Features

Features
Feature Description
Speech-to-text
Accurately converts speech into text in live or batch mode
Automatic Punctuation
Accurately adds punctuation to the transcribed text
Custom Vocabulary
Boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases.
Speaker Separation
Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.
Sentence level Keywords
The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion.

modelTypeConfiguration

Keys
Value
modelType
speech_to_text
modelConfig
Model Configuration object for speech_to_text

modelConfig Parameters

modelConfig
Description
Defaults
custom_vocabulary
A list of custom vocabulary terms
[ ]
speaker_separation.num_speakers
The number of speakers in the conversation. Set it to -1 for determining the number of speakers automatically.
-1
enableTopics
Boolean to enable or disable sentence level topics
False
enableKeywords
Boolean to enable or disable keywords
True
topics.threshold
The threshold which determies whether the topics identified by the models should be consdered or not. Threshold can have any value between 0 and 1
0.5

Example Request

CURL
Python
1
curl --location --request POST 'https://api.marsview.ai/cb/v1/conversation/compute' \
2
--header 'Content-Type: application/json' \
3
--header "Authorization:{{Insert Auth Token With Type}}" \
4
--data-raw '{
5
"txnId": "{{Insert txn ID}}",
6
"enableModels":[
7
{
8
"modelType":"speech_to_text",
9
"modelConfig":{
10
"automatic_punctuation" : true,
11
"custom_vocabulary":["Marsview", "Communication"],
12
"speaker_seperation":{
13
"num_speakers":2
14
},
15
"enableKeywords":true,
16
"enableTopics":true
17
},
18
"topics": {
19
"threshold": 0.5
20
}
21
}
22
]
23
}'
Copied!
1
import requests
2
auth_token = "replace this with your auth token"
3
txn_id = "Replace this with your transaction ID"
4
5
#Note: the speech to text model does not depends on any other models, hence
6
#can be used independently
7
8
def get_speech_to_text():
9
url = "https://api.marsview.ai/cb/v1/conversation/compute"
10
payload={
11
"txnId": txn_id,
12
"enableModels":[
13
{
14
"modelType":"speech_to_text",
15
"modelConfig":{
16
"automatic_punctuation" : True,
17
"custom_vocabulary":["Marsview", "Communication"],
18
"speaker_seperation":{
19
"num_speakers":2
20
},
21
"enableKeywords":True,
22
"enableTopics":False
23
}
24
}
25
]
26
}
27
28
headers = {'authorization': '{}'.format(auth_token)}
29
30
response = requests.request("POST", headers=headers, json=payload)
31
print(response.text)
32
if response.status_code == 200 and response.json()["status"] == "true":
33
return response.json()["data"]["enableModels"]["state"]["status"]
34
else:
35
raise Exception("Custom exception")
36
37
if __name__ == "__main__":
38
get_speech_to_text()
Copied!

Example Metadata Response

1
"data": {
2
"transcript": [
3
{
4
"sentence": "Good evening teresa.",
5
"startTime": 1390,
6
"endTime": 2690,
7
"speakers": [
8
"1"
9
],
10
"keywords": [
11
{
12
"keyword": "good evening teresa",
13
"metadata": [],
14
"type": "DNN"
15
}
16
],
17
"keySentence": "Good evening teresa.",
18
"topics": [
19
{
20
"tiers": [
21
{
22
"tierName": "Education",
23
"type": 1
24
}
25
],
26
"name": "Secondary Education"
27
},
28
{
29
"tiers": [
30
{
31
"tierName": "Education",
32
"type": 1
33
}
34
],
35
"name": "College Education"
36
},
37
38
],
39
"suggestedIntents": [
40
"foolish power school board",
41
"stressful situation",
42
"determination",
43
"good good job"
44
]
45
}
46
]
47
}
Copied!

Response Object

Field
Description
transcript
A list of sentences and its attributes identified by the speech to text model, split up based on the start and end time of the sentence in the input Video/Audio.
sentence
A sentence identified by the model in the given time frame.
startTime
Start time of the sentence in the input Video/Audio in milliseconds.
endTime
End time of the sentence in the input Video/Audio in milliseconds.
speakers
A list of speaker id's whose voices are identified for a given time frame. Normally this list would only have a single speaker Id.
keywords
A list of keywords identified in the sentence.
metadata
A list of possible contexts for a given keyword.
keySentence
The key sentence identified in the given time frame.

Sentence Level Keywordstype

type
Description
DNN
Keywords generated by AI based on key concepts spoken, topics modeling
Techphrase
Extract Technology terms from the conversation
NER
Extract entities such as custom, location, person, date, number, organization, date-time, date range, etc. from the conversation. (PERSON, GPE, PRODUCT, ORG, EVENT)
Finance_phrase
Extract Financial terms from the conversation
Last modified 1mo ago