Speech-to-text
Enable this model configuration to convert speech to text automatically with the highest accuracy.
Marsview Automatic Speech Recognition (ASR) technology accurately converts speech into text in live or batch mode. API can be deployed in the cloud or on-premise. Get superior accuracy, speaker separation, punctuation, casing, word-level time markers, and more.
Features | Feature Description |
Speech-to-text | Accurately converts speech into text in live or batch mode |
Automatic Punctuation | Accurately adds punctuation to the transcribed text |
Custom Vocabulary | Boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases. |
Speaker Separation | Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker. |
Sentence level Keywords | The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion. |
Keys | Value |
modelType | speech_to_text |
modelConfig | Model Configuration object for speech_to_text |
modelConfig | Description | Defaults |
custom_vocabulary | A list of custom vocabulary terms | [ ] |
speaker_separation.num_speakers | The number of speakers in the conversation. Set it to -1 for determining the number of speakers automatically. | -1 |
multi_channel | The object used to configure the model to support multi channel audio if applicable | {} |
multi_channel.enable | Boolean to enable or disable multi_channel support | False |
multi_channel.channel_ids | A dictionary that maps channel ID's of the input audio. | {} |
enableTopics | Boolean to enable or disable sentence level topics | False |
enableKeywords | Boolean to enable or disable keywords | True |
aggressiveness | Aggressiveness factor determines how aggressive the speaker separation needs to be. There is a trade-off between having accurate speaker separation and holding good context of the transcript for a given speaker. Since conversations in a meeting tends to have a lot of interruptions and cross talks, splitting the transcript during speaker transitions will result into bad contexts management and create a lot of breaks in the transcript. 1 -> Better context management 2 -> Balanced 3-> Accurate speaker separation | 1 |
topics.threshold | The threshold which determines whether the topics identified by the models should be considered or not. Threshold can have any value between 0 and 1 | 0.5 |
CURL
Python
curl --location --request POST 'https://api.marsview.ai/cb/v1/conversation/compute' \
--header 'Content-Type: application/json' \
--header "Authorization:{{Insert Auth Token With Type}}" \
--data-raw '{
"txnId": "{{Insert txn ID}}",
"enableModels":[
{
"modelType":"speech_to_text",
"modelConfig":{
"automatic_punctuation" : true,
"custom_vocabulary":["Marsview", "Communication"],
"speaker_seperation":{
"num_speakers":2
},
"multi_channel": {
"enable": true,
"channel_ids": {
"0":"0",
"1":"1"
}
},
"enableKeywords":true,
"enableTopics":true,
"aggressiveness": 3,
"topics": {
"threshold": 0.5
}
}
}
]
}'
import requests
auth_token = "replace this with your auth token"
txn_id = "Replace this with your transaction ID"
#Note: the speech to text model does not depends on any other models, hence
#can be used independently
def get_speech_to_text():
url = "https://api.marsview.ai/cb/v1/conversation/compute"
payload={
"txnId": txn_id,
"enableModels":[
{
"modelType":"speech_to_text",
"modelConfig":{
"automatic_punctuation" : True,
"custom_vocabulary":["Marsview", "Communication"],
"speaker_seperation":{
"num_speakers":2
},
"multi_channel": {
"enable": true,
"channel_ids": {
"0":"0",
"1":"1"
}
},
"enableKeywords":True,
"enableTopics":False,
"aggressiveness": 2
}
}
]
}
headers = {'authorization': '{}'.format(auth_token)}
response = requests.request("POST", headers=headers, json=payload)
print(response.text)
if response.status_code == 200 and response.json()["status"] == "true":
return response.json()["data"]["enableModels"]["state"]["status"]
else:
raise Exception("Custom exception")
if __name__ == "__main__":
get_speech_to_text()
"data": {
"transcript": [
{
"sentence": "Good evening teresa.",
"channelId": 0
"startTime": 1390,
"endTime": 2690,
"speakers": [
"1"
],
"keywords": [
{
"keyword": "good evening teresa",
"metadata": [],
"type": "DNN"
}
],
"keySentence": "Good evening teresa.",
"topics": [
{
"tiers": [
{
"tierName": "Education",
"type": 1
}
],
"name": "Secondary Education"
},
{
"tiers": [
{
"tierName": "Education",
"type": 1
}
],
"name": "College Education"
},
],
"suggestedIntents": [
"foolish power school board",
"stressful situation",
"determination",
"good good job"
]
}
]
}
Field | Description |
transcript | A list of sentences and its attributes identified by the speech to text model, split up based on the start and end time of the sentence in the input Video/Audio. |
sentence | A sentence identified by the model in the given time frame. |
startTime | Start time of the sentence in the input Video/Audio in milliseconds. |
endTime | End time of the sentence in the input Video/Audio in milliseconds. |
speakers | A list of speaker id's whose voices are identified for a given time frame. Normally this list would only have a single speaker Id. |
keywords | A list of keywords identified in the sentence. |
metadata | A list of possible contexts for a given keyword. |
keySentence | The key sentence identified in the given time frame. |
type | Description |
DNN | Keywords generated by AI based on key concepts spoken, topics modeling |
Techphrase | Extract Technology terms from the conversation |
NER | Extract entities such as custom, location, person, date, number, organization, date-time, date range, etc. from the conversation. (PERSON, GPE, PRODUCT, ORG, EVENT) |
Finance_phrase | Extract Financial terms from the conversation |
Last modified 1yr ago