ASR Data¶
This tutorials discusses how to deal with automatic speech recognition(ASR) tasks on the basis of DELTA.
Data descripition¶
For data preparing, you can refer to directory:'egs/hkust/asr/v1'.
By simply using ./run.sh
, an open source dataset, HKUST, can be quickly downloaded and reformated like below:
uttID: {
"input": [
{
"feat": the file and the position while the feats of current utterance is sorted
"name": "input1"
"shape" : [
number_frames
dimension_feats
]
}
],
"output": [
{
"name": "target",
"shape": [
number_words
number_classes
],
"text":
"token":
"tokenid":
}
],
"utt2spk": speaker index
}
It should be noted that num_classes = size_vocabulary + 2, where size_vocabulary is the size of the vocabulary. The zero value and the largest value (num_classes - 1) is reserved for the blank and sos/eos label respectively. For Example, the vocabulary is consist of 3 different labels [a, b, c]. Then, num_classes = 5 and the labels indexing is {blank:0, a:1, b:2, c:3, sos/eos:4}
Model training¶
- For ASR tasks, a default config file is written in
conf/asr-ctc.yml
. Two different CTC-based model,CTCAsrModel
andCTC5BlstmAsrModel
, are supported in DELTA. The details of them can be seen indelta/models/asr_model.py
. - After setting the config file, the following script can be executed to train a ASR model:
python3 delta/main.py --config egs/hkust/asr/v1/conf/asr-ctc.yml --cmd train_and_eval
- Same as the Espnet, the class index of blank label is set to be 0 in AsrSeqTask. However, the default blank label used in Tensorflow.nn.ctc_loss is num_classes - 1.
To solve this problem, the
ctc_data_transform
interface is supported indelta/utils/loss/loss_utils.py
. For logits generated by the ASR model, this interfance moves the blank_label cloumn to the end of it. For input labels, this interface changes the value of blank_label elements to num_classes - 1, and the value of other labels whose class index is greater than blank_label is reduced by 1. - In
delta/utils/decode/tf_ctc.py
, two different methods,ctc_greedy_decode
andctc_beam_search_decode
, are supported to perform greedy and beam search decoding on the logits respectively. In this stage, the mismatch between the blank label index in input logits and num_classes - 1 could also occur. Thus, we provide thectc_decode_blankid_to_last
method to address this issue. Specially, in order to eliminate the effect of the change of blank label index, thectc_decode_last_to_blankid
should be applied on the decode result which removing repeated labels and blank symbols to adjust the index of blank label back.