Speech-to-text
Enable this model configuration to convert speech to text automatically with the highest accuracy.
Overview
Marsview Automatic Speech Recognition (ASR) technology accurately converts speech into text in live or batch mode. API can be deployed in the cloud or on-premise. Get superior accuracy, speaker separation, punctuation, casing, word-level time markers, and more.
Model Features
Features | Feature Description |
Speech-to-text | Accurately converts speech into text in live or batch mode |
Automatic Punctuation | Accurately adds punctuation to the transcribed text |
Custom Vocabulary | Boost domain-specific terminology, proper nouns, abbreviations by adding a simple list/taxonomy of words/phrases. |
Speaker Separation | Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker. |
Sentence level Keywords | The most relevant topics, concepts, discussion points from the conversation are generated based on the overall scope of the discussion. |
modelType
Configuration
modelType
Configuration Keys | Value |
|
|
| Model Configuration object for speech_to_text |
modelConfig
Parameters
modelConfig
ParametersmodelConfig | Description | Defaults |
| A list of custom vocabulary terms | [ ] |
| The number of speakers in the conversation. Set it to -1 for determining the number of speakers automatically. | -1 |
multi_channel | The object used to configure the model to support multi channel audio if applicable | {} |
multi_channel.enable | Boolean to enable or disable multi_channel support | False |
multi_channel.channel_ids | A dictionary that maps channel ID's of the input audio. | {} |
| Boolean to enable or disable sentence level topics | False |
| Boolean to enable or disable keywords | True |
| Aggressiveness factor determines how aggressive the speaker separation needs to be. There is a trade-off between having accurate speaker separation and holding good context of the transcript for a given speaker. Since conversations in a meeting tends to have a lot of interruptions and cross talks, splitting the transcript during speaker transitions will result into bad contexts management and create a lot of breaks in the transcript. 1 -> Better context management 2 -> Balanced 3-> Accurate speaker separation | 1 |
| The threshold which determines whether the topics identified by the models should be considered or not. Threshold can have any value between 0 and 1 | 0.5 |
Example Request
Example Metadata Response
Response Object
Field | Description |
| A list of sentences and its attributes identified by the speech to text model, split up based on the start and end time of the sentence in the input Video/Audio. |
| A sentence identified by the model in the given time frame. |
| Start time of the sentence in the input Video/Audio in milliseconds. |
| End time of the sentence in the input Video/Audio in milliseconds. |
| A list of speaker id's whose voices are identified for a given time frame. Normally this list would only have a single speaker Id. |
| A list of keywords identified in the sentence. |
| A list of possible contexts for a given keyword. |
| The key sentence identified in the given time frame. |
Sentence Level Keywordstype
type
| Description |
| Keywords generated by AI based on key concepts spoken, topics modeling |
| Extract Technology terms from the conversation |
| Extract entities such as custom, location, person, date, number, organization, date-time, date range, etc. from the conversation. (PERSON, GPE, PRODUCT, ORG, EVENT) |
| Extract Financial terms from the conversation |
Last updated