Released Models

NLP Models

Sequence classification

We provide the sequence classification models using CNN, LSTM, HAN (hierarchical attention networks), transformer, etc.

Sequence labeling

We provide the LSTM based sequence labeling and an LSTM with CRF based method.

Pairwise modeling

We implement the match of text pairwise models computing similarity across sentence representation encoded with two LSTM.

Sequence-to-sequence (seq2seq) modeling

We implement the standard seq2seq models using LSTM with attention and transformers. Note that, the seq2seq structure is also used for speech recognition. In DELTA, this part is shared between NLP and ASR tasks.

Multi-task modeling

We implement a multi-task model for sequence classification and labeling, where the sequence level loss and the step level loss are computed simultaneously. This model is used to jointly train an intent recognizer and named entity recognizer together.

Pretraining integration

We implement an interface to integrate a pretrained model into a DELTA model, where the pretrained model is used to dynamically generate embedding which is concatenated with the word embedding for the different task. To be specific, a user can pretrain an ELMO or BERT model first and then build a DELTA model with the prertained model. Both model will be combined into a TensorFlow graph for training and inference. The ELMO or BERT models trained from the official open-sourced libraries can be directly used in DELTA.

Speech models

Automatic speech recognition (ASR)

We provide an attention based seq2seq ASR model. We also implement another popular type of ASR model using connectionist temporal classification (CTC).

Speaker Verification/Identification

We provide an X-vector text-independent model and an end-to-end model.

Speech emotion recognition

Recently several deep learning based approaches have been successfully used in speech emotion recognition and we implement some models the in DELTA.

Multimodal models


In our implementation, we use two sequential models (e.g.,CNNs or LSTMs) to learn the sequence embedding for speech and text separately, and thenconcatenates the learned embedding vectors for classification.


We implement the direct concatenation data fusion in data processing stage,therefore this type of multimodal training can be directly used for existing models in DELTA.