ASR Data

This tutorials discusses how to deal with automatic speech recognition(ASR) tasks on the basis of DELTA.

Data descripition

For data preparing, you can refer to directory:'egs/hkust/asr/v1'. By simply using ./run.sh, an open source dataset, HKUST, can be quickly downloaded and reformated like below:

uttID: {
   "input": [
      {
        "feat": the file and the position while the feats of current utterance is sorted
        "name": "input1"
        "shape" : [
            number_frames
            dimension_feats
        ]
      }
   ],
   "output": [
       {
           "name": "target",
           "shape": [
               number_words
               number_classes
           ],
           "text":
           "token":
           "tokenid":
       }
   ],
   "utt2spk": speaker index
}

It should be noted that num_classes = size_vocabulary + 2, where size_vocabulary is the size of the vocabulary. The zero value and the largest value (num_classes - 1) is reserved for the blank and sos/eos label respectively. For Example, the vocabulary is consist of 3 different labels [a, b, c]. Then, num_classes = 5 and the labels indexing is {blank:0, a:1, b:2, c:3, sos/eos:4}

Model training

  1. For ASR tasks, a default config file is written in conf/asr-ctc.yml. Two different CTC-based model, CTCAsrModel and CTC5BlstmAsrModel, are supported in DELTA. The details of them can be seen in delta/models/asr_model.py.
  2. After setting the config file, the following script can be executed to train a ASR model:
python3 delta/main.py --config egs/hkust/asr/v1/conf/asr-ctc.yml --cmd train_and_eval 
  1. Same as the Espnet, the class index of blank label is set to be 0 in AsrSeqTask. However, the default blank label used in Tensorflow.nn.ctc_loss is num_classes - 1. To solve this problem, the ctc_data_transform interface is supported in delta/utils/loss/loss_utils.py. For logits generated by the ASR model, this interfance moves the blank_label cloumn to the end of it. For input labels, this interface changes the value of blank_label elements to num_classes - 1, and the value of other labels whose class index is greater than blank_label is reduced by 1.
  2. In delta/utils/decode/tf_ctc.py, two different methods, ctc_greedy_decode and ctc_beam_search_decode, are supported to perform greedy and beam search decoding on the logits respectively. In this stage, the mismatch between the blank label index in input logits and num_classes - 1 could also occur. Thus, we provide the ctc_decode_blankid_to_last method to address this issue. Specially, in order to eliminate the effect of the change of blank label index, the ctc_decode_last_to_blankid should be applied on the decode result which removing repeated labels and blank symbols to adjust the index of blank label back.