espnet2.asr package¶

espnet2.asr.maskctc_model¶

class espnet2.asr.maskctc_model.MaskCTCInference(asr_model: espnet2.asr.maskctc_model.MaskCTCModel, n_iterations: int, threshold_probability: float)[source]¶

Bases: torch.nn.modules.module.Module

Mask-CTC-based non-autoregressive inference

Initialize Mask-CTC inference

forward(enc_out: torch.Tensor) → List[espnet.nets.beam_search.Hypothesis][source]¶: Perform Mask-CTC inference

ids2text(ids: List[int])[source]¶

class espnet2.asr.maskctc_model.MaskCTCModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.mlm_decoder.MLMDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_mask: str = '<mask>', extract_feats_in_collect_stats: bool = True)[source]¶

Bases: espnet2.asr.espnet_model.ESPnetASRModel

Hybrid CTC/Masked LM Encoder-Decoder model (Mask-CTC)

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters

speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters

encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)

espnet2.asr.ctc¶

class espnet2.asr.ctc.CTC(odim: int, encoder_output_size: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: bool = None, zero_infinity: bool = True)[source]¶

Bases: torch.nn.modules.module.Module

CTC module.

Parameters

odim – dimension of outputs
encoder_output_size – number of encoder projection units
dropout_rate – dropout rate (0.0 ~ 1.0)
ctc_type – builtin or gtnctc
reduce – reduce the CTC loss into a scalar
ignore_nan_grad – Same as zero_infinity (keeping for backward compatiblity)
zero_infinity – Whether to zero infinite losses and the associated gradients.

argmax(hs_pad)[source]¶

argmax of frame activations

Parameters: hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)
Returns: argmax applied 2d tensor (B, Tmax)
Return type: torch.Tensor

forward(hs_pad, hlens, ys_pad, ys_lens)[source]¶

Calculate CTC loss.

Parameters

hs_pad – batch of padded hidden state sequences (B, Tmax, D)
hlens – batch of lengths of hidden state sequences (B)
ys_pad – batch of padded character id sequence tensor (B, Lmax)
ys_lens – batch of lengths of character sequence (B)

log_softmax(hs_pad)[source]¶

log_softmax of frame activations

Parameters: hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)
Returns: log softmax applied 3d tensor (B, Tmax, odim)
Return type: torch.Tensor

loss_fn(th_pred, th_target, th_ilen, th_olen) → torch.Tensor[source]¶

softmax(hs_pad)[source]¶

softmax of frame activations

Parameters: hs_pad (Tensor) – 3d tensor (B, Tmax, eprojs)
Returns: softmax applied 3d tensor (B, Tmax, odim)
Return type: torch.Tensor

espnet2.asr.init¶

espnet2.asr.espnet_model¶

class espnet2.asr.espnet_model.ESPnetASRModel(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True)[source]¶

Bases: espnet2.train.abs_espnet_model.AbsESPnetModel

CTC-attention hybrid Encoder-Decoder model

batchify_nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶

Compute negative log likelihood(nll) from transformer-decoder

To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,

you may change this to avoid OOM or increase GPU memory usage

collect_feats(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶

encode(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Frontend + Encoder. Note that this method is used by asr_inference.py

Parameters

speech – (Batch, Length, …)
speech_lengths – (Batch, )

forward(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶

Frontend + Encoder + Decoder + Calc loss

Parameters

speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.

nll(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶

Compute negative log likelihood(nll) from transformer-decoder

Normally, this function is called in batchify_nll.

Parameters

encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)

espnet2.asr.postencoder.init¶

espnet2.asr.postencoder.hugging_face_transformers_postencoder¶

Hugging Face Transformers PostEncoder.

class espnet2.asr.postencoder.hugging_face_transformers_postencoder.HuggingFaceTransformersPostEncoder(input_size: int, model_name_or_path: str)[source]¶

Bases: espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder

Hugging Face Transformers PostEncoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶: Forward.

output_size() → int[source]¶: Get the output size.

reload_pretrained_parameters()[source]¶

espnet2.asr.postencoder.abs_postencoder¶

class espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.preencoder.init¶

espnet2.asr.preencoder.sinc¶

Sinc convolutions for raw audio input.

class espnet2.asr.preencoder.sinc.LightweightSincConvs(fs: Union[int, str, float] = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')[source]¶

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Lightweight Sinc Convolutions.

Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597

To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:

Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder

Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.

Initialize the module.

Parameters

fs – Sample rate.
in_channels – Number of input channels.
out_channels – Number of output channels (for each input channel).
activation_type – Choice of activation function.
dropout_type – Choice of dropout function.
windowing_type – Choice of windowing function.
scale_type – Choice of filter-bank initialization scale.

espnet_initialization_fn()[source]¶: Initialize sinc filters with filterbank values.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Apply Lightweight Sinc Convolutions.

The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.

The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.

The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.

gen_lsc_block(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)[source]¶

Generate a convolutional block for Lightweight Sinc convolutions.

Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.

Parameters

in_channels – Number of input channels.
out_channels – Number of output channels.
depthwise_kernel_size – Kernel size of the depthwise convolution.
depthwise_stride – Stride of the depthwise convolution.
depthwise_groups – Number of groups of the depthwise convolution.
pointwise_groups – Number of groups of the pointwise convolution.
dropout_probability – Dropout probability in the block.
avgpool – If True, an AvgPool layer is inserted.

Returns

Neural network building block.

Return type

torch.nn.Sequential

output_size() → int[source]¶: Get the output size.

class espnet2.asr.preencoder.sinc.SpatialDropout(dropout_probability: float = 0.15, shape: Union[tuple, list, None] = None)[source]¶

Bases: torch.nn.modules.module.Module

Spatial dropout module.

Apply dropout to full channels on tensors of input (B, C, D)

Initialize.

Parameters

dropout_probability – Dropout probability.
shape (tuple, list) – Shape of input tensors.

forward(x: torch.Tensor) → torch.Tensor[source]¶: Forward of spatial dropout module.

espnet2.asr.preencoder.abs_preencoder¶

class espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.preencoder.linear¶

Linear Projection.

class espnet2.asr.preencoder.linear.LinearProjection(input_size: int, output_size: int)[source]¶

Bases: espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder

Linear Projection Preencoder.

Initialize the module.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶: Forward.

output_size() → int[source]¶: Get the output size.

espnet2.asr.specaug.abs_specaug¶

class espnet2.asr.specaug.abs_specaug.AbsSpecAug[source]¶

Bases: torch.nn.modules.module.Module

Abstract class for the augmentation of spectrogram

The process-flow:

Frontend -> SpecAug -> Normalization -> Encoder -> Decoder

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.specaug.init¶

espnet2.asr.specaug.specaug¶

SpecAugment module.

class espnet2.asr.specaug.specaug.SpecAug(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int], None] = None, time_mask_width_ratio_range: Union[float, Sequence[float], None] = None, num_time_mask: int = 2)[source]¶

Bases: espnet2.asr.specaug.abs_specaug.AbsSpecAug

Implementation of SpecAug.

Reference:: Daniel S. Park et al. “SpecAugment: A Simple Data

Augmentation Method for Automatic Speech Recognition”

Warning

When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.

forward(x, x_lengths=None)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.layers.init¶

espnet2.asr.layers.cgmlp¶

MLP with convolutional gating (cgMLP) definition.

References

https://openreview.net/forum?id=RA-zVvZLYIy https://arxiv.org/abs/2105.08050

class espnet2.asr.layers.cgmlp.ConvolutionalGatingMLP(size: int, linear_units: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶

Bases: torch.nn.modules.module.Module

Convolutional Gating MLP (cgMLP).

forward(x, mask)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class espnet2.asr.layers.cgmlp.ConvolutionalSpatialGatingUnit(size: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶

Bases: torch.nn.modules.module.Module

Convolutional Spatial Gating Unit (CSGU).

espnet_initialization_fn()[source]¶

forward(x, gate_add=None)[source]¶

Forward method

Parameters

x (torch.Tensor) – (N, T, D)
gate_add (torch.Tensor) – (N, T, D/2)

Returns

(N, T, D/2)

Return type

out (torch.Tensor)

espnet2.asr.layers.fastformer¶

Fastformer attention definition.

Reference:: Wu et al., “Fastformer: Additive Attention Can Be All You Need” https://arxiv.org/abs/2108.09084 https://github.com/wuch15/Fastformer

class espnet2.asr.layers.fastformer.FastSelfAttention(size, attention_heads, dropout_rate)[source]¶

Bases: torch.nn.modules.module.Module

Fast self-attention used in Fastformer.

espnet_initialization_fn()[source]¶

forward(xs_pad, mask)[source]¶

Forward method.

Parameters

xs_pad – (batch, time, size = n_heads * attn_dim)
mask – (batch, 1, time), nonpadding is 1, padding is 0

Returns

(batch, time, size)

Return type

torch.Tensor

init_weights(module)[source]¶

transpose_for_scores(x)[source]¶

Reshape and transpose to compute scores.

Parameters: x – (batch, time, size = n_heads * attn_dim)
Returns: (batch, n_heads, time, attn_dim)

espnet2.asr.encoder.branchformer_encoder¶

Branchformer encoder definition.

Reference:: Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe, “Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding,” in Proceedings of ICML, 2022.

class espnet2.asr.encoder.branchformer_encoder.BranchformerEncoder(input_size: int, output_size: int = 256, use_attn: bool = True, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', use_cgmlp: bool = True, cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', merge_method: str = 'concat', cgmlp_weight: Union[float, List[float]] = 0.5, attn_branch_drop_rate: Union[float, List[float]] = 0.0, num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Branchformer encoder module.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.

Returns

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type

torch.Tensor

output_size() → int[source]¶

class espnet2.asr.encoder.branchformer_encoder.BranchformerEncoderLayer(size: int, attn: Optional[torch.nn.modules.module.Module], cgmlp: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_method: str, cgmlp_weight: float = 0.5, attn_branch_drop_rate: float = 0.0, stochastic_depth_rate: float = 0.0)[source]¶

Bases: torch.nn.modules.module.Module

Branchformer encoder layer module.

Parameters

size (int) – model dimension
attn – standard self-attention or efficient attention, optional
cgmlp – ConvolutionalGatingMLP, optional
dropout_rate (float) – dropout probability
merge_method (str) – concat, learned_ave, fixed_ave
cgmlp_weight (float) – weight of the cgmlp branch, between 0 and 1, used if merge_method is fixed_ave
attn_branch_drop_rate (float) – probability of dropping the attn branch, used if merge_method is learned_ave
stochastic_depth_rate (float) – stochastic depth probability

forward(x_input, mask, cache=None)[source]¶

Compute encoded features.

Parameters

x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).

Returns

Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).

Return type

torch.Tensor

espnet2.asr.encoder.init¶

espnet2.asr.encoder.longformer_encoder¶

Conformer encoder definition.

class espnet2.asr.encoder.longformer_encoder.LongformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')[source]¶

Bases: espnet2.asr.encoder.conformer_encoder.ConformerEncoder

Longformer SA Conformer encoder module.

Parameters

input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
attention_windows (list) – Layer-wise attention window sizes for longformer self-attn
attention_dilation (list) – Layer-wise attention dilation sizes for longformer self-attn
attention_mode (str) – Implementation for longformer self-attn. Default=”sliding_chunks” Choose ‘n2’, ‘tvm’ or ‘sliding_chunks’. More details in https://github.com/allenai/longformer

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.

Returns

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type

torch.Tensor

output_size() → int[source]¶

espnet2.asr.encoder.vgg_rnn_encoder¶

class espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

VGGRNNEncoder class.

Parameters

input_size – The number of expected features in the input
bidirectional – If True becomes a bidirectional LSTM
use_projection – Use projection layer or not
num_layers – Number of recurrent layers
hidden_size – The number of hidden features
output_size – The number of output features
dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.encoder.abs_encoder¶

class espnet2.asr.encoder.abs_encoder.AbsEncoder[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.encoder.transformer_encoder¶

Transformer encoder definition.

class espnet2.asr.encoder.transformer_encoder.TransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Transformer encoder module.

Parameters

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.encoder.wav2vec2_encoder¶

Encoder definition.

class espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Wav2Vec2 encoder module.

Parameters

input_size – input dim
output_size – dimension of attention
w2v_url – url to Wav2Vec2.0 pretrained model
w2v_dir_path – directory to download the Wav2Vec2.0 pretrained model.
normalize_before – whether to use layer_norm before the first block
finetune_last_n_layers – last n layers to be finetuned in Wav2Vec2.0 0 means to finetune every layer if freeze_w2v=False.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward FairSeqWav2Vec2 Encoder.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

espnet2.asr.encoder.wav2vec2_encoder.download_w2v(model_url, dir_path)[source]¶

espnet2.asr.encoder.rnn_encoder¶

class espnet2.asr.encoder.rnn_encoder.RNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

RNNEncoder class.

Parameters

input_size – The number of expected features in the input
output_size – The number of output features
hidden_size – The number of hidden features
bidirectional – If True becomes a bidirectional LSTM
use_projection – Use projection layer or not
num_layers – Number of recurrent layers
dropout – dropout probability

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.encoder.contextual_block_conformer_encoder¶

Created on Sat Aug 21 17:27:16 2021.

@author: Keqi Deng (UCAS)

class espnet2.asr.encoder.contextual_block_conformer_encoder.ContextualBlockConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Conformer encoder module.

Parameters

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.encoder.conformer_encoder¶

Conformer encoder definition.

class espnet2.asr.encoder.conformer_encoder.ConformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Conformer encoder module.

Parameters

input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Calculate forward propagation.

Parameters

xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.

Returns

Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.

Return type

torch.Tensor

output_size() → int[source]¶

espnet2.asr.encoder.hubert_encoder¶

Encoder definition.

class espnet2.asr.encoder.hubert_encoder.FairseqHubertEncoder(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert encoder module, used for loading pretrained weight and finetuning

Parameters

input_size – input dim
hubert_url – url to Hubert pretrained model
hubert_dir_path – directory to download the Wav2Vec2.0 pretrained model.
output_size – dimension of attention
normalize_before – whether to use layer_norm before the first block
freeze_finetune_updates – steps that freeze all layers except output layer before tuning the whole model (nessasary to prevent overfit).
dropout_rate – dropout rate
activation_dropout – dropout rate in activation function
attention_dropout – dropout rate in attention

Hubert specific Args:: Please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward Hubert ASR Encoder.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

class espnet2.asr.encoder.hubert_encoder.FairseqHubertPretrainEncoder(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

FairSeq Hubert pretrain encoder module, only used for pretraining stage

Parameters

input_size – input dim
output_size – dimension of attention
linear_units – dimension of feedforward layers
attention_heads – the number of heads of multi head attention
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
hubert_dict – target dictionary for Hubert pretraining
label_rate – label frame rate. -1 for sequence label
sample_rate – target sample rate.
use_amp – whether to use automatic mixed precision
normalize_before – whether to use layer_norm before the first block

cast_mask_emb()[source]¶

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_length: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Forward Hubert Pretrain Encoder.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

espnet2.asr.encoder.hubert_encoder.download_hubert(model_url, dir_path)[source]¶

espnet2.asr.encoder.contextual_block_transformer_encoder¶

Encoder definition.

class espnet2.asr.encoder.contextual_block_transformer_encoder.ContextualBlockTransformerEncoder(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶

Bases: espnet2.asr.encoder.abs_encoder.AbsEncoder

Contextual Block Transformer encoder module.

Details in Tsunoo et al. “Transformer ASR with contextual block processing” (https://arxiv.org/abs/1910.07204)

Parameters

input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors

forward(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).

Returns

position embedded tensor and mask

forward_infer(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

forward_train(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶

Embed positions in tensor.

Parameters

xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.

Returns

position embedded tensor and mask

output_size() → int[source]¶

espnet2.asr.transducer.init¶

espnet2.asr.transducer.beam_search_transducer¶

Search algorithms for Transducer models.

class espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, score_norm: bool = True, nbest: int = 1, token_list: Optional[List[str]] = None)[source]¶

Bases: object

Beam search implementation for Transducer.

Initialize Transducer search module.

Parameters

decoder – Decoder module.
joint_network – Joint network module.
beam_size – Beam size.
lm – LM class.
lm_weight – LM weight for soft fusion.
search_type – Search algorithm to use during inference.
max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
u_max – Maximum output sequence length. (ALSD)
nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
score_norm – Normalize final scores by length. (“default”)
nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Alignment-length synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters: h – Encoder output sequences. (T, D)
Returns: N-best hypothesis.
Return type: nbest_hyps

default_beam_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Beam search implementation.

Modified from https://arxiv.org/pdf/1211.3711.pdf

Parameters: enc_out – Encoder output sequence. (T, D)
Returns: N-best hypothesis.
Return type: nbest_hyps

greedy_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Greedy search implementation.

Parameters: enc_out – Encoder output sequence. (T, D_enc)
Returns: 1-best hypotheses.
Return type: hyp

modified_adaptive_expansion_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶

It’s the modified Adaptive Expansion Search (mAES) implementation.

Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.

Parameters: enc_out – Encoder output sequence. (T, D_enc)
Returns: N-best hypothesis.
Return type: nbest_hyps

nsc_beam_search(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶

N-step constrained beam search implementation.

Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.

Parameters: enc_out – Encoder output sequence. (T, D_enc)
Returns: N-best hypothesis.
Return type: nbest_hyps

prefix_search(hyps: List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis], enc_out_t: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶

Prefix search for NSC and mAES strategies.

Based on https://arxiv.org/pdf/1211.3711.pdf

sort_nbest(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]][source]¶

Sort hypotheses by score or score given sequence length.

Parameters: hyps – Hypothesis.
Returns: Sorted hypothesis.
Return type: hyps

time_sync_decoding(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶

Time synchronous beam search implementation.

Based on https://ieeexplore.ieee.org/document/9053040

Parameters: enc_out – Encoder output sequence. (T, D)
Returns: N-best hypothesis.
Return type: nbest_hyps

class espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]¶

Bases: espnet2.asr.transducer.beam_search_transducer.Hypothesis

Extended hypothesis definition for NSC beam search and mAES.

dec_out = None¶

lm_scores = None¶

class espnet2.asr.transducer.beam_search_transducer.Hypothesis(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]¶

Bases: object

Default hypothesis definition for Transducer search algorithms.

lm_state = None¶

espnet2.asr.transducer.error_calculator¶

Error Calculator module for Transducer.

class espnet2.asr.transducer.error_calculator.ErrorCalculatorTransducer(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: torch.nn.modules.module.Module, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]¶

Bases: object

Calculate CER and WER for transducer models.

Parameters

decoder – Decoder module.
token_list – List of tokens.
sym_space – Space symbol.
sym_blank – Blank symbol.
report_cer – Whether to compute CER.
report_wer – Whether to compute WER.

Construct an ErrorCalculatorTransducer.

calculate_cer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶

Calculate sentence-level CER score.

Parameters

char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)

Returns

Average sentence-level CER score.

calculate_wer(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶

Calculate sentence-level WER score.

Parameters

char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)

Returns

Average sentence-level WER score

convert_to_char(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]¶

Convert label ID sequences to character sequences.

Parameters

pred – Prediction label ID sequences. (B, U)
target – Target label ID sequences. (B, L)

Returns

Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)

Return type

char_pred

espnet2.asr.decoder.init¶

espnet2.asr.decoder.abs_decoder¶

class espnet2.asr.decoder.abs_decoder.AbsDecoder[source]¶

Bases: torch.nn.modules.module.Module, espnet.nets.scorer_interface.ScorerInterface, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

espnet2.asr.decoder.rnn_decoder¶

class espnet2.asr.decoder.rnn_decoder.RNNDecoder(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_state(x)[source]¶

Get an initial state for decoding (optional).

Parameters: x (torch.Tensor) – The encoded feature tensor

Returns: initial state

rnn_forward(ey, z_list, c_list, z_prev, c_prev)[source]¶

score(yseq, state, x)[source]¶

Score new token (required).

Parameters

y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.

Returns

Tuple of: scores for next token that has a shape of (n_vocab) and next state for ys

Return type

tuple[torch.Tensor, Any]

zero_state(hs_pad)[source]¶

espnet2.asr.decoder.rnn_decoder.build_attention_list(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]¶

espnet2.asr.decoder.mlm_decoder¶

Masked LM Decoder definition.

class espnet2.asr.decoder.mlm_decoder.MLMDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)

Returns

tuple containing: x: decoded token score before softmax (batch, maxlen_out, token)

if use_output_layer is True,

olens: (batch, )

Return type

(tuple)

espnet2.asr.decoder.transducer_decoder¶

(RNN-)Transducer decoder definition.

class espnet2.asr.decoder.transducer_decoder.TransducerDecoder(vocab_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, dropout: float = 0.0, dropout_embed: float = 0.0, embed_pad: int = 0)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder

(RNN-)Transducer decoder module.

Parameters

vocab_size – Output dimension.
layers_type – (RNN-)Decoder layers type.
num_layers – Number of decoder layers.
hidden_size – Number of decoder units per layer.
dropout – Dropout rate for decoder layers.
dropout_embed – Dropout rate for embedding layer.
embed_pad – Embed/Blank symbol ID.

batch_score(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]¶

One-step forward hypotheses.

Parameters

hyps – Hypotheses.
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)
use_lm – Whether to compute label ID sequences for LM.

Returns

Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)

Return type

dec_out

create_batch_states(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶

Create decoder hidden states.

Parameters

states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]

Returns

Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Return type

states

forward(labels: torch.Tensor) → torch.Tensor[source]¶

Encode source label sequences.

Parameters: labels – Label ID sequences. (B, L)
Returns: Decoder output sequences. (B, T, U, D_dec)
Return type: dec_out

init_state(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]¶

Initialize decoder states.

Parameters: batch_size – Batch size.
Returns: Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

rnn_forward(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶

Encode source label sequences.

Parameters

sequence – RNN input sequences. (B, D_emb)
state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))

Returns

RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))

Return type

sequence

score(hyp: espnet2.asr.transducer.beam_search_transducer.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]¶

One-step forward hypothesis.

Parameters

hyp – Hypothesis.
cache – Pairs of (dec_out, state) for each label sequence. (key)

Returns

Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)

Return type

dec_out

select_state(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶

Get specified ID state from decoder hidden states.

Parameters

states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
idx – State ID to extract.

Returns

Decoder hidden state for given ID.: ((N, 1, D_dec), (N, 1, D_dec))

set_device(device: torch.device)[source]¶

Set GPU device to use.

Parameters: device – Device ID.

espnet2.asr.decoder.transformer_decoder¶

Decoder definition.

class espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]¶

Bases: espnet2.asr.decoder.abs_decoder.AbsDecoder, espnet.nets.scorer_interface.BatchScorerInterface

Base class of Transfomer decoder module.

Parameters

vocab_size – output dim
encoder_output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
self_attention_dropout_rate – dropout rate for attention
input_layer – input layer type
use_output_layer – whether to use output layer
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)

batch_score(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶

Score new token batch.

Parameters

ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).

Returns

Tuple of: batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.

Return type

tuple[torch.Tensor, List[Any]]

forward(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Forward decoder.

Parameters

hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)

Returns

tuple containing:

x: decoded token score before softmax (batch, maxlen_out, token): if use_output_layer is True,

olens: (batch, )

Return type

(tuple)

forward_one_step(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶

Forward one step.

Parameters

tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)

Returns

NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)

Return type

y, cache

score(ys, state, x)[source]¶: Score.

class espnet2.asr.decoder.transformer_decoder.DynamicConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.DynamicConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolution2DTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.LightweightConvolutionTransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

class espnet2.asr.decoder.transformer_decoder.TransformerDecoder(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]¶: Bases: espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder

espnet2.asr.frontend.init¶

espnet2.asr.frontend.abs_frontend¶

class espnet2.asr.frontend.abs_frontend.AbsFrontend[source]¶

Bases: torch.nn.modules.module.Module, abc.ABC

Initializes internal Module state, shared by both nn.Module and ScriptModule.

abstract forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

abstract output_size() → int[source]¶

espnet2.asr.frontend.fused¶

class espnet2.asr.frontend.fused.FusedFrontends(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.frontend.s3prl¶

class espnet2.asr.frontend.s3prl.S3prlFrontend(fs: Union[int, str] = 16000, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str = None, multilayer_feature: bool = False)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Speech Pretrained Representation frontend structure for ASR.

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

reload_pretrained_parameters()[source]¶

espnet2.asr.frontend.default¶

class espnet2.asr.frontend.default.DefaultFrontend(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int = None, fmax: int = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Conventional frontend structure for ASR.

Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

output_size() → int[source]¶

espnet2.asr.frontend.windowing¶

Sliding Window for raw audio input data.

class espnet2.asr.frontend.windowing.SlidingWindow(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: int = None, fs=None)[source]¶

Bases: espnet2.asr.frontend.abs_frontend.AbsFrontend

Sliding Window.

Provides a sliding window over a batched continuous raw audio tensor. Optionally, provides padding (Currently not implemented). Combine this module with a pre-encoder compatible with raw audio data, for example Sinc convolutions.

Known issues: Output length is calculated incorrectly if audio shorter than win_length. WARNING: trailing values are discarded - padding not implemented yet. There is currently no additional window function applied to input values.

Initialize.

Parameters

win_length – Length of frame.
hop_length – Relative starting point of next frame.
channels – Number of input channels.
padding – Padding (placeholder, currently not implemented).
fs – Sampling rate (placeholder for compatibility, not used).

forward(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

Apply a sliding window on the input.

Parameters

input – Input (B, T, C*D) or (B, T*C*D), with D=C=1.
input_lengths – Input lengths within batch.

Returns

Output with dimensions (B, T, C, D), with D=win_length. Tensor: Output lengths within batch.

Return type

Tensor

output_size() → int[source]¶: Return output length of feature dimension D, i.e. the window length.

espnet2.asr package¶

espnet2.asr.maskctc_model¶

espnet2.asr.ctc¶

espnet2.asr.__init__¶

espnet2.asr.espnet_model¶

espnet2.asr.postencoder.__init__¶

espnet2.asr.postencoder.hugging_face_transformers_postencoder¶

espnet2.asr.postencoder.abs_postencoder¶

espnet2.asr.preencoder.__init__¶

espnet2.asr.preencoder.sinc¶

espnet2.asr.preencoder.abs_preencoder¶

espnet2.asr.preencoder.linear¶

espnet2.asr.specaug.abs_specaug¶

espnet2.asr.specaug.__init__¶

espnet2.asr.specaug.specaug¶

espnet2.asr.layers.__init__¶

espnet2.asr.layers.cgmlp¶

espnet2.asr.layers.fastformer¶

espnet2.asr.encoder.branchformer_encoder¶

espnet2.asr.encoder.__init__¶

espnet2.asr.encoder.longformer_encoder¶

espnet2.asr.encoder.vgg_rnn_encoder¶

espnet2.asr.encoder.abs_encoder¶

espnet2.asr.encoder.transformer_encoder¶

espnet2.asr.encoder.wav2vec2_encoder¶

espnet2.asr.encoder.rnn_encoder¶

espnet2.asr.encoder.contextual_block_conformer_encoder¶

espnet2.asr.encoder.conformer_encoder¶

espnet2.asr.encoder.hubert_encoder¶

espnet2.asr.encoder.contextual_block_transformer_encoder¶

espnet2.asr.transducer.__init__¶

espnet2.asr.transducer.beam_search_transducer¶

espnet2.asr.transducer.error_calculator¶

espnet2.asr.decoder.__init__¶

espnet2.asr.decoder.abs_decoder¶

espnet2.asr.decoder.rnn_decoder¶

espnet2.asr.decoder.mlm_decoder¶

espnet2.asr.decoder.transducer_decoder¶

espnet2.asr.decoder.transformer_decoder¶

espnet2.asr.frontend.__init__¶

espnet2.asr.frontend.abs_frontend¶

espnet2.asr.frontend.fused¶

espnet2.asr.frontend.s3prl¶

espnet2.asr.frontend.default¶

espnet2.asr.frontend.windowing¶

espnet2.asr.init¶

espnet2.asr.postencoder.init¶

espnet2.asr.preencoder.init¶

espnet2.asr.specaug.init¶

espnet2.asr.layers.init¶

espnet2.asr.encoder.init¶

espnet2.asr.transducer.init¶

espnet2.asr.decoder.init¶

espnet2.asr.frontend.init¶