espnet2.asr package¶
espnet2.asr.maskctc_model¶
-
class
espnet2.asr.maskctc_model.
MaskCTCInference
(asr_model: espnet2.asr.maskctc_model.MaskCTCModel, n_iterations: int, threshold_probability: float)[source]¶ Bases:
torch.nn.modules.module.Module
Mask-CTC-based non-autoregressive inference
Initialize Mask-CTC inference
-
class
espnet2.asr.maskctc_model.
MaskCTCModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.mlm_decoder.MLMDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module] = None, ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', sym_mask: str = '<mask>', extract_feats_in_collect_stats: bool = True)[source]¶ Bases:
espnet2.asr.espnet_model.ESPnetASRModel
Hybrid CTC/Masked LM Encoder-Decoder model (Mask-CTC)
-
batchify_nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶ Compute negative log likelihood(nll) from transformer-decoder
To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,
you may change this to avoid OOM or increase GPU memory usage
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters
speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
-
nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶ Compute negative log likelihood(nll) from transformer-decoder
Normally, this function is called in batchify_nll.
- Parameters
encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)
-
espnet2.asr.ctc¶
-
class
espnet2.asr.ctc.
CTC
(odim: int, encoder_output_size: int, dropout_rate: float = 0.0, ctc_type: str = 'builtin', reduce: bool = True, ignore_nan_grad: bool = None, zero_infinity: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
CTC module.
- Parameters
odim – dimension of outputs
encoder_output_size – number of encoder projection units
dropout_rate – dropout rate (0.0 ~ 1.0)
ctc_type – builtin or gtnctc
reduce – reduce the CTC loss into a scalar
ignore_nan_grad – Same as zero_infinity (keeping for backward compatiblity)
zero_infinity – Whether to zero infinite losses and the associated gradients.
-
argmax
(hs_pad)[source]¶ argmax of frame activations
- Parameters
hs_pad (torch.Tensor) – 3d tensor (B, Tmax, eprojs)
- Returns
argmax applied 2d tensor (B, Tmax)
- Return type
torch.Tensor
-
forward
(hs_pad, hlens, ys_pad, ys_lens)[source]¶ Calculate CTC loss.
- Parameters
hs_pad – batch of padded hidden state sequences (B, Tmax, D)
hlens – batch of lengths of hidden state sequences (B)
ys_pad – batch of padded character id sequence tensor (B, Lmax)
ys_lens – batch of lengths of character sequence (B)
espnet2.asr.__init__¶
espnet2.asr.espnet_model¶
-
class
espnet2.asr.espnet_model.
ESPnetASRModel
(vocab_size: int, token_list: Union[Tuple[str, ...], List[str]], frontend: Optional[espnet2.asr.frontend.abs_frontend.AbsFrontend], specaug: Optional[espnet2.asr.specaug.abs_specaug.AbsSpecAug], normalize: Optional[espnet2.layers.abs_normalize.AbsNormalize], preencoder: Optional[espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder], encoder: espnet2.asr.encoder.abs_encoder.AbsEncoder, postencoder: Optional[espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder], decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, ctc: espnet2.asr.ctc.CTC, joint_network: Optional[torch.nn.modules.module.Module], ctc_weight: float = 0.5, interctc_weight: float = 0.0, ignore_id: int = -1, lsm_weight: float = 0.0, length_normalized_loss: bool = False, report_cer: bool = True, report_wer: bool = True, sym_space: str = '<space>', sym_blank: str = '<blank>', extract_feats_in_collect_stats: bool = True)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
CTC-attention hybrid Encoder-Decoder model
-
batchify_nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)[source]¶ Compute negative log likelihood(nll) from transformer-decoder
To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,
you may change this to avoid OOM or increase GPU memory usage
-
collect_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶
-
encode
(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by asr_inference.py
- Parameters
speech – (Batch, Length, …)
speech_lengths – (Batch, )
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor, text: torch.Tensor, text_lengths: torch.Tensor, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters
speech – (Batch, Length, …)
speech_lengths – (Batch, )
text – (Batch, Length)
text_lengths – (Batch,)
kwargs – “utt_id” is among the input.
-
nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶ Compute negative log likelihood(nll) from transformer-decoder
Normally, this function is called in batchify_nll.
- Parameters
encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)
-
espnet2.asr.postencoder.__init__¶
espnet2.asr.postencoder.hugging_face_transformers_postencoder¶
Hugging Face Transformers PostEncoder.
-
class
espnet2.asr.postencoder.hugging_face_transformers_postencoder.
HuggingFaceTransformersPostEncoder
(input_size: int, model_name_or_path: str)[source]¶ Bases:
espnet2.asr.postencoder.abs_postencoder.AbsPostEncoder
Hugging Face Transformers PostEncoder.
Initialize the module.
espnet2.asr.postencoder.abs_postencoder¶
-
class
espnet2.asr.postencoder.abs_postencoder.
AbsPostEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.asr.preencoder.__init__¶
espnet2.asr.preencoder.sinc¶
Sinc convolutions for raw audio input.
-
class
espnet2.asr.preencoder.sinc.
LightweightSincConvs
(fs: Union[int, str, float] = 16000, in_channels: int = 1, out_channels: int = 256, activation_type: str = 'leakyrelu', dropout_type: str = 'dropout', windowing_type: str = 'hamming', scale_type: str = 'mel')[source]¶ Bases:
espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder
Lightweight Sinc Convolutions.
Instead of using precomputed features, end-to-end speech recognition can also be done directly from raw audio using sinc convolutions, as described in “Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions” by Kürzinger et al. https://arxiv.org/abs/2010.07597
To use Sinc convolutions in your model instead of the default f-bank frontend, set this module as your pre-encoder with preencoder: sinc and use the input of the sliding window frontend with frontend: sliding_window in your yaml configuration file. So that the process flow is:
Frontend (SlidingWindow) -> SpecAug -> Normalization -> Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder
Note that this method also performs data augmentation in time domain (vs. in spectral domain in the default frontend). Use plot_sinc_filters.py to visualize the learned Sinc filters.
Initialize the module.
- Parameters
fs – Sample rate.
in_channels – Number of input channels.
out_channels – Number of output channels (for each input channel).
activation_type – Choice of activation function.
dropout_type – Choice of dropout function.
windowing_type – Choice of windowing function.
scale_type – Choice of filter-bank initialization scale.
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Apply Lightweight Sinc Convolutions.
The input shall be formatted as (B, T, C_in, D_in) with B as batch size, T as time dimension, C_in as channels, and D_in as feature dimension.
The output will then be (B, T, C_out*D_out) with C_out and D_out as output dimensions.
The current module structure only handles D_in=400, so that D_out=1. Remark for the multichannel case: C_out is the number of out_channels given at initialization multiplied with C_in.
-
gen_lsc_block
(in_channels: int, out_channels: int, depthwise_kernel_size: int = 9, depthwise_stride: int = 1, depthwise_groups=None, pointwise_groups=0, dropout_probability: float = 0.15, avgpool=False)[source]¶ Generate a convolutional block for Lightweight Sinc convolutions.
Each block consists of either a depthwise or a depthwise-separable convolutions together with dropout, (batch-)normalization layer, and an optional average-pooling layer.
- Parameters
in_channels – Number of input channels.
out_channels – Number of output channels.
depthwise_kernel_size – Kernel size of the depthwise convolution.
depthwise_stride – Stride of the depthwise convolution.
depthwise_groups – Number of groups of the depthwise convolution.
pointwise_groups – Number of groups of the pointwise convolution.
dropout_probability – Dropout probability in the block.
avgpool – If True, an AvgPool layer is inserted.
- Returns
Neural network building block.
- Return type
torch.nn.Sequential
-
class
espnet2.asr.preencoder.sinc.
SpatialDropout
(dropout_probability: float = 0.15, shape: Union[tuple, list, None] = None)[source]¶ Bases:
torch.nn.modules.module.Module
Spatial dropout module.
Apply dropout to full channels on tensors of input (B, C, D)
Initialize.
- Parameters
dropout_probability – Dropout probability.
shape (tuple, list) – Shape of input tensors.
espnet2.asr.preencoder.abs_preencoder¶
-
class
espnet2.asr.preencoder.abs_preencoder.
AbsPreEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.asr.preencoder.linear¶
Linear Projection.
-
class
espnet2.asr.preencoder.linear.
LinearProjection
(input_size: int, output_size: int)[source]¶ Bases:
espnet2.asr.preencoder.abs_preencoder.AbsPreEncoder
Linear Projection Preencoder.
Initialize the module.
espnet2.asr.specaug.abs_specaug¶
-
class
espnet2.asr.specaug.abs_specaug.
AbsSpecAug
[source]¶ Bases:
torch.nn.modules.module.Module
Abstract class for the augmentation of spectrogram
The process-flow:
Frontend -> SpecAug -> Normalization -> Encoder -> Decoder
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
forward
(x: torch.Tensor, x_lengths: torch.Tensor = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet2.asr.specaug.__init__¶
espnet2.asr.specaug.specaug¶
SpecAugment module.
-
class
espnet2.asr.specaug.specaug.
SpecAug
(apply_time_warp: bool = True, time_warp_window: int = 5, time_warp_mode: str = 'bicubic', apply_freq_mask: bool = True, freq_mask_width_range: Union[int, Sequence[int]] = (0, 20), num_freq_mask: int = 2, apply_time_mask: bool = True, time_mask_width_range: Union[int, Sequence[int], None] = None, time_mask_width_ratio_range: Union[float, Sequence[float], None] = None, num_time_mask: int = 2)[source]¶ Bases:
espnet2.asr.specaug.abs_specaug.AbsSpecAug
Implementation of SpecAug.
- Reference:
Daniel S. Park et al. “SpecAugment: A Simple Data
Augmentation Method for Automatic Speech Recognition”
Warning
When using cuda mode, time_warp doesn’t have reproducibility due to torch.nn.functional.interpolate.
-
forward
(x, x_lengths=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.asr.layers.__init__¶
espnet2.asr.layers.cgmlp¶
MLP with convolutional gating (cgMLP) definition.
References
https://openreview.net/forum?id=RA-zVvZLYIy https://arxiv.org/abs/2105.08050
-
class
espnet2.asr.layers.cgmlp.
ConvolutionalGatingMLP
(size: int, linear_units: int, kernel_size: int, dropout_rate: float, use_linear_after_conv: bool, gate_activation: str)[source]¶ Bases:
torch.nn.modules.module.Module
Convolutional Gating MLP (cgMLP).
-
forward
(x, mask)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet2.asr.layers.fastformer¶
Fastformer attention definition.
- Reference:
Wu et al., “Fastformer: Additive Attention Can Be All You Need” https://arxiv.org/abs/2108.09084 https://github.com/wuch15/Fastformer
-
class
espnet2.asr.layers.fastformer.
FastSelfAttention
(size, attention_heads, dropout_rate)[source]¶ Bases:
torch.nn.modules.module.Module
Fast self-attention used in Fastformer.
espnet2.asr.encoder.branchformer_encoder¶
Branchformer encoder definition.
- Reference:
Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe, “Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding,” in Proceedings of ICML, 2022.
-
class
espnet2.asr.encoder.branchformer_encoder.
BranchformerEncoder
(input_size: int, output_size: int = 256, use_attn: bool = True, attention_heads: int = 4, attention_layer_type: str = 'rel_selfattn', pos_enc_layer_type: str = 'rel_pos', rel_pos_type: str = 'latest', use_cgmlp: bool = True, cgmlp_linear_units: int = 2048, cgmlp_conv_kernel: int = 31, use_linear_after_conv: bool = False, gate_activation: str = 'identity', merge_method: str = 'concat', cgmlp_weight: Union[float, List[float]] = 0.5, attn_branch_drop_rate: Union[float, List[float]] = 0.0, num_blocks: int = 12, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', zero_triu: bool = False, padding_idx: int = -1, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Branchformer encoder module.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
- Returns
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type
torch.Tensor
-
-
class
espnet2.asr.encoder.branchformer_encoder.
BranchformerEncoderLayer
(size: int, attn: Optional[torch.nn.modules.module.Module], cgmlp: Optional[torch.nn.modules.module.Module], dropout_rate: float, merge_method: str, cgmlp_weight: float = 0.5, attn_branch_drop_rate: float = 0.0, stochastic_depth_rate: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module
Branchformer encoder layer module.
- Parameters
size (int) – model dimension
attn – standard self-attention or efficient attention, optional
cgmlp – ConvolutionalGatingMLP, optional
dropout_rate (float) – dropout probability
merge_method (str) – concat, learned_ave, fixed_ave
cgmlp_weight (float) – weight of the cgmlp branch, between 0 and 1, used if merge_method is fixed_ave
attn_branch_drop_rate (float) – probability of dropping the attn branch, used if merge_method is learned_ave
stochastic_depth_rate (float) – stochastic depth probability
-
forward
(x_input, mask, cache=None)[source]¶ Compute encoded features.
- Parameters
x_input (Union[Tuple, torch.Tensor]) – Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size).
mask (torch.Tensor) – Mask tensor for the input (#batch, time).
cache (torch.Tensor) – Cache tensor of the input (#batch, time - 1, size).
- Returns
Output tensor (#batch, time, size). torch.Tensor: Mask tensor (#batch, time).
- Return type
torch.Tensor
espnet2.asr.encoder.__init__¶
espnet2.asr.encoder.longformer_encoder¶
Conformer encoder definition.
-
class
espnet2.asr.encoder.longformer_encoder.
LongformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'abs_pos', selfattention_layer_type: str = 'lf_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, attention_windows: list = [100, 100, 100, 100, 100, 100], attention_dilation: list = [1, 1, 1, 1, 1, 1], attention_mode: str = 'sliding_chunks')[source]¶ Bases:
espnet2.asr.encoder.conformer_encoder.ConformerEncoder
Longformer SA Conformer encoder module.
- Parameters
input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
attention_windows (list) – Layer-wise attention window sizes for longformer self-attn
attention_dilation (list) – Layer-wise attention dilation sizes for longformer self-attn
attention_mode (str) – Implementation for longformer self-attn. Default=”sliding_chunks” Choose ‘n2’, ‘tvm’ or ‘sliding_chunks’. More details in https://github.com/allenai/longformer
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
- Returns
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type
torch.Tensor
espnet2.asr.encoder.vgg_rnn_encoder¶
-
class
espnet2.asr.encoder.vgg_rnn_encoder.
VGGRNNEncoder
(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
VGGRNNEncoder class.
- Parameters
input_size – The number of expected features in the input
bidirectional – If
True
becomes a bidirectional LSTMuse_projection – Use projection layer or not
num_layers – Number of recurrent layers
hidden_size – The number of hidden features
output_size – The number of output features
dropout – dropout probability
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.asr.encoder.abs_encoder¶
-
class
espnet2.asr.encoder.abs_encoder.
AbsEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.asr.encoder.transformer_encoder¶
Transformer encoder definition.
-
class
espnet2.asr.encoder.transformer_encoder.
TransformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Transformer encoder module.
- Parameters
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
espnet2.asr.encoder.wav2vec2_encoder¶
Encoder definition.
-
class
espnet2.asr.encoder.wav2vec2_encoder.
FairSeqWav2Vec2Encoder
(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
FairSeq Wav2Vec2 encoder module.
- Parameters
input_size – input dim
output_size – dimension of attention
w2v_url – url to Wav2Vec2.0 pretrained model
w2v_dir_path – directory to download the Wav2Vec2.0 pretrained model.
normalize_before – whether to use layer_norm before the first block
finetune_last_n_layers – last n layers to be finetuned in Wav2Vec2.0 0 means to finetune every layer if freeze_w2v=False.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward FairSeqWav2Vec2 Encoder.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
espnet2.asr.encoder.rnn_encoder¶
-
class
espnet2.asr.encoder.rnn_encoder.
RNNEncoder
(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Optional[Sequence[int]] = (2, 2, 1, 1))[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
RNNEncoder class.
- Parameters
input_size – The number of expected features in the input
output_size – The number of output features
hidden_size – The number of hidden features
bidirectional – If
True
becomes a bidirectional LSTMuse_projection – Use projection layer or not
num_layers – Number of recurrent layers
dropout – dropout probability
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.asr.encoder.contextual_block_conformer_encoder¶
Created on Sat Aug 21 17:27:16 2021.
@author: Keqi Deng (UCAS)
-
class
espnet2.asr.encoder.contextual_block_conformer_encoder.
ContextualBlockConformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, cnn_module_kernel: int = 31, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Contextual Block Conformer encoder module.
- Parameters
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).
- Returns
position embedded tensor and mask
-
forward_infer
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
-
forward_train
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
espnet2.asr.encoder.conformer_encoder¶
Conformer encoder definition.
-
class
espnet2.asr.encoder.conformer_encoder.
ConformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: str = 'conv2d', normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 3, macaron_style: bool = False, rel_pos_type: str = 'legacy', pos_enc_layer_type: str = 'rel_pos', selfattention_layer_type: str = 'rel_selfattn', activation_type: str = 'swish', use_cnn_module: bool = True, zero_triu: bool = False, cnn_module_kernel: int = 31, padding_idx: int = -1, interctc_layer_idx: List[int] = [], interctc_use_conditioning: bool = False, stochastic_depth_rate: Union[float, List[float]] = 0.0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Conformer encoder module.
- Parameters
input_size (int) – Input dimension.
output_size (int) – Dimension of attention.
attention_heads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
num_blocks (int) – The number of decoder blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. If True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) If False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
rel_pos_type (str) – Whether to use the latest relative positional encoding or the legacy one. The legacy relative positional encoding will be deprecated in the future. More Details can be found in https://github.com/espnet/espnet/pull/2816.
encoder_pos_enc_layer_type (str) – Encoder positional encoding layer type.
encoder_attn_layer_type (str) – Encoder attention layer type.
activation_type (str) – Encoder activation function type.
macaron_style (bool) – Whether to use macaron style for positionwise layer.
use_cnn_module (bool) – Whether to use convolution module.
zero_triu (bool) – Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, ctc: espnet2.asr.ctc.CTC = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Calculate forward propagation.
- Parameters
xs_pad (torch.Tensor) – Input tensor (#batch, L, input_size).
ilens (torch.Tensor) – Input length (#batch).
prev_states (torch.Tensor) – Not to be used now.
- Returns
Output tensor (#batch, L, output_size). torch.Tensor: Output length (#batch). torch.Tensor: Not to be used now.
- Return type
torch.Tensor
espnet2.asr.encoder.hubert_encoder¶
Encoder definition.
-
class
espnet2.asr.encoder.hubert_encoder.
FairseqHubertEncoder
(input_size: int, hubert_url: str = './', hubert_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0, dropout_rate: float = 0.0, activation_dropout: float = 0.1, attention_dropout: float = 0.0, mask_length: int = 10, mask_prob: float = 0.75, mask_selection: str = 'static', mask_other: int = 0, apply_mask: bool = True, mask_channel_length: int = 64, mask_channel_prob: float = 0.5, mask_channel_other: int = 0, mask_channel_selection: str = 'static', layerdrop: float = 0.1, feature_grad_mult: float = 0.0)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
FairSeq Hubert encoder module, used for loading pretrained weight and finetuning
- Parameters
input_size – input dim
hubert_url – url to Hubert pretrained model
hubert_dir_path – directory to download the Wav2Vec2.0 pretrained model.
output_size – dimension of attention
normalize_before – whether to use layer_norm before the first block
freeze_finetune_updates – steps that freeze all layers except output layer before tuning the whole model (nessasary to prevent overfit).
dropout_rate – dropout rate
activation_dropout – dropout rate in activation function
attention_dropout – dropout rate in attention
- Hubert specific Args:
Please refer to: https://github.com/pytorch/fairseq/blob/master/fairseq/models/hubert/hubert.py
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward Hubert ASR Encoder.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
-
class
espnet2.asr.encoder.hubert_encoder.
FairseqHubertPretrainEncoder
(input_size: int = 1, output_size: int = 1024, linear_units: int = 1024, attention_heads: int = 12, num_blocks: int = 12, dropout_rate: float = 0.0, attention_dropout_rate: float = 0.0, activation_dropout_rate: float = 0.0, hubert_dict: str = './dict.txt', label_rate: int = 100, checkpoint_activations: bool = False, sample_rate: int = 16000, use_amp: bool = False, **kwargs)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
FairSeq Hubert pretrain encoder module, only used for pretraining stage
- Parameters
input_size – input dim
output_size – dimension of attention
linear_units – dimension of feedforward layers
attention_heads – the number of heads of multi head attention
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
hubert_dict – target dictionary for Hubert pretraining
label_rate – label frame rate. -1 for sequence label
sample_rate – target sample rate.
use_amp – whether to use automatic mixed precision
normalize_before – whether to use layer_norm before the first block
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_length: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Forward Hubert Pretrain Encoder.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
espnet2.asr.encoder.contextual_block_transformer_encoder¶
Encoder definition.
-
class
espnet2.asr.encoder.contextual_block_transformer_encoder.
ContextualBlockTransformerEncoder
(input_size: int, output_size: int = 256, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.0, input_layer: Optional[str] = 'conv2d', pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.StreamPositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, padding_idx: int = -1, block_size: int = 40, hop_size: int = 16, look_ahead: int = 16, init_average: bool = True, ctx_pos_enc: bool = True)[source]¶ Bases:
espnet2.asr.encoder.abs_encoder.AbsEncoder
Contextual Block Transformer encoder module.
Details in Tsunoo et al. “Transformer ASR with contextual block processing” (https://arxiv.org/abs/1910.07204)
- Parameters
input_size – input dim
output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of encoder blocks
dropout_rate – dropout rate
attention_dropout_rate – dropout rate in attention
positional_dropout_rate – dropout rate after adding positional encoding
input_layer – input layer type
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type – linear of conv1d
positionwise_conv_kernel_size – kernel size of positionwise conv1d layer
padding_idx – padding_idx for input_layer=embed
block_size – block size for contextual block processing
hop_Size – hop size for block processing
look_ahead – look-ahead size for block_processing
init_average – whether to use average as initial context (otherwise max values)
ctx_pos_enc – whether to use positional encoding to the context vectors
-
forward
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final=True, infer_mode=False) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
infer_mode – whether to be used for inference. This is used to distinguish between forward_train (train and validate) and forward_infer (decode).
- Returns
position embedded tensor and mask
-
forward_infer
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None, is_final: bool = True) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
-
forward_train
(xs_pad: torch.Tensor, ilens: torch.Tensor, prev_states: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]][source]¶ Embed positions in tensor.
- Parameters
xs_pad – input tensor (B, L, D)
ilens – input length (B)
prev_states – Not to be used now.
- Returns
position embedded tensor and mask
espnet2.asr.transducer.__init__¶
espnet2.asr.transducer.beam_search_transducer¶
Search algorithms for Transducer models.
-
class
espnet2.asr.transducer.beam_search_transducer.
BeamSearchTransducer
(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: espnet2.asr_transducer.joint_network.JointNetwork, beam_size: int, lm: torch.nn.modules.module.Module = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, score_norm: bool = True, nbest: int = 1, token_list: Optional[List[str]] = None)[source]¶ Bases:
object
Beam search implementation for Transducer.
Initialize Transducer search module.
- Parameters
decoder – Decoder module.
joint_network – Joint network module.
beam_size – Beam size.
lm – LM class.
lm_weight – LM weight for soft fusion.
search_type – Search algorithm to use during inference.
max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
u_max – Maximum output sequence length. (ALSD)
nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
score_norm – Normalize final scores by length. (“default”)
nbest – Number of final hypothesis.
-
align_length_sync_decoding
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Alignment-length synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters
h – Encoder output sequences. (T, D)
- Returns
N-best hypothesis.
- Return type
nbest_hyps
-
default_beam_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Beam search implementation.
Modified from https://arxiv.org/pdf/1211.3711.pdf
- Parameters
enc_out – Encoder output sequence. (T, D)
- Returns
N-best hypothesis.
- Return type
nbest_hyps
-
greedy_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Greedy search implementation.
- Parameters
enc_out – Encoder output sequence. (T, D_enc)
- Returns
1-best hypotheses.
- Return type
hyp
-
modified_adaptive_expansion_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶ It’s the modified Adaptive Expansion Search (mAES) implementation.
Based on/modified from https://ieeexplore.ieee.org/document/9250505 and NSC.
- Parameters
enc_out – Encoder output sequence. (T, D_enc)
- Returns
N-best hypothesis.
- Return type
nbest_hyps
-
nsc_beam_search
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶ N-step constrained beam search implementation.
Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf. Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet until further modifications.
- Parameters
enc_out – Encoder output sequence. (T, D_enc)
- Returns
N-best hypothesis.
- Return type
nbest_hyps
-
prefix_search
(hyps: List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis], enc_out_t: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis][source]¶ Prefix search for NSC and mAES strategies.
Based on https://arxiv.org/pdf/1211.3711.pdf
-
sort_nbest
(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]]) → Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]][source]¶ Sort hypotheses by score or score given sequence length.
- Parameters
hyps – Hypothesis.
- Returns
Sorted hypothesis.
- Return type
hyps
-
time_sync_decoding
(enc_out: torch.Tensor) → List[espnet2.asr.transducer.beam_search_transducer.Hypothesis][source]¶ Time synchronous beam search implementation.
Based on https://ieeexplore.ieee.org/document/9053040
- Parameters
enc_out – Encoder output sequence. (T, D)
- Returns
N-best hypothesis.
- Return type
nbest_hyps
-
class
espnet2.asr.transducer.beam_search_transducer.
ExtendedHypothesis
(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None, dec_out: List[torch.Tensor] = None, lm_scores: torch.Tensor = None)[source]¶ Bases:
espnet2.asr.transducer.beam_search_transducer.Hypothesis
Extended hypothesis definition for NSC beam search and mAES.
-
dec_out
= None¶
-
lm_scores
= None¶
-
-
class
espnet2.asr.transducer.beam_search_transducer.
Hypothesis
(score: float, yseq: List[int], dec_state: Union[Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]], torch.Tensor], lm_state: Union[Dict[str, Any], List[Any]] = None)[source]¶ Bases:
object
Default hypothesis definition for Transducer search algorithms.
-
lm_state
= None¶
-
espnet2.asr.transducer.error_calculator¶
Error Calculator module for Transducer.
-
class
espnet2.asr.transducer.error_calculator.
ErrorCalculatorTransducer
(decoder: espnet2.asr.decoder.abs_decoder.AbsDecoder, joint_network: torch.nn.modules.module.Module, token_list: List[int], sym_space: str, sym_blank: str, report_cer: bool = False, report_wer: bool = False)[source]¶ Bases:
object
Calculate CER and WER for transducer models.
- Parameters
decoder – Decoder module.
token_list – List of tokens.
sym_space – Space symbol.
sym_blank – Blank symbol.
report_cer – Whether to compute CER.
report_wer – Whether to compute WER.
Construct an ErrorCalculatorTransducer.
-
calculate_cer
(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶ Calculate sentence-level CER score.
- Parameters
char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)
- Returns
Average sentence-level CER score.
-
calculate_wer
(char_pred: torch.Tensor, char_target: torch.Tensor) → float[source]¶ Calculate sentence-level WER score.
- Parameters
char_pred – Prediction character sequences. (B, ?)
char_target – Target character sequences. (B, ?)
- Returns
Average sentence-level WER score
-
convert_to_char
(pred: torch.Tensor, target: torch.Tensor) → Tuple[List, List][source]¶ Convert label ID sequences to character sequences.
- Parameters
pred – Prediction label ID sequences. (B, U)
target – Target label ID sequences. (B, L)
- Returns
Prediction character sequences. (B, ?) char_target: Target character sequences. (B, ?)
- Return type
char_pred
espnet2.asr.decoder.__init__¶
espnet2.asr.decoder.abs_decoder¶
-
class
espnet2.asr.decoder.abs_decoder.
AbsDecoder
[source]¶ Bases:
torch.nn.modules.module.Module
,espnet.nets.scorer_interface.ScorerInterface
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.asr.decoder.rnn_decoder¶
-
class
espnet2.asr.decoder.rnn_decoder.
RNNDecoder
(vocab_size: int, encoder_output_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, sampling_probability: float = 0.0, dropout: float = 0.0, context_residual: bool = False, replace_sos: bool = False, num_encs: int = 1, att_conf: dict = {'aconv_chans': 10, 'aconv_filts': 100, 'adim': 320, 'aheads': 4, 'atype': 'location', 'awin': 5, 'han_conv_chans': -1, 'han_conv_filts': 100, 'han_dim': 320, 'han_heads': 4, 'han_mode': False, 'han_type': None, 'han_win': 5, 'num_att': 1, 'num_encs': 1})[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
-
forward
(hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
init_state
(x)[source]¶ Get an initial state for decoding (optional).
- Parameters
x (torch.Tensor) – The encoded feature tensor
Returns: initial state
-
score
(yseq, state, x)[source]¶ Score new token (required).
- Parameters
y (torch.Tensor) – 1D torch.int64 prefix tokens.
state – Scorer state for prefix tokens
x (torch.Tensor) – The encoder feature that generates ys.
- Returns
- Tuple of
scores for next token that has a shape of (n_vocab) and next state for ys
- Return type
tuple[torch.Tensor, Any]
-
-
espnet2.asr.decoder.rnn_decoder.
build_attention_list
(eprojs: int, dunits: int, atype: str = 'location', num_att: int = 1, num_encs: int = 1, aheads: int = 4, adim: int = 320, awin: int = 5, aconv_chans: int = 10, aconv_filts: int = 100, han_mode: bool = False, han_type=None, han_heads: int = 4, han_dim: int = 320, han_conv_chans: int = -1, han_conv_filts: int = 100, han_win: int = 5)[source]¶
espnet2.asr.decoder.mlm_decoder¶
Masked LM Decoder definition.
-
class
espnet2.asr.decoder.mlm_decoder.
MLMDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)
- Returns
tuple containing: x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type
(tuple)
-
espnet2.asr.decoder.transducer_decoder¶
(RNN-)Transducer decoder definition.
-
class
espnet2.asr.decoder.transducer_decoder.
TransducerDecoder
(vocab_size: int, rnn_type: str = 'lstm', num_layers: int = 1, hidden_size: int = 320, dropout: float = 0.0, dropout_embed: float = 0.0, embed_pad: int = 0)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
(RNN-)Transducer decoder module.
- Parameters
vocab_size – Output dimension.
layers_type – (RNN-)Decoder layers type.
num_layers – Number of decoder layers.
hidden_size – Number of decoder units per layer.
dropout – Dropout rate for decoder layers.
dropout_embed – Dropout rate for embedding layer.
embed_pad – Embed/Blank symbol ID.
-
batch_score
(hyps: Union[List[espnet2.asr.transducer.beam_search_transducer.Hypothesis], List[espnet2.asr.transducer.beam_search_transducer.ExtendedHypothesis]], dec_states: Tuple[torch.Tensor, Optional[torch.Tensor]], cache: Dict[str, Any], use_lm: bool) → Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor][source]¶ One-step forward hypotheses.
- Parameters
hyps – Hypotheses.
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
cache – Pairs of (dec_out, dec_states) for each label sequences. (keys)
use_lm – Whether to compute label ID sequences for LM.
- Returns
Decoder output sequences. (B, D_dec) dec_states: Decoder hidden states. ((N, B, D_dec), (N, B, D_dec)) lm_labels: Label ID sequences for LM. (B,)
- Return type
dec_out
-
create_batch_states
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], new_states: List[Tuple[torch.Tensor, Optional[torch.Tensor]]], check_list: Optional[List] = None) → List[Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Create decoder hidden states.
- Parameters
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
new_states – Decoder hidden states. [N x ((1, D_dec), (1, D_dec))]
- Returns
Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
- Return type
states
-
forward
(labels: torch.Tensor) → torch.Tensor[source]¶ Encode source label sequences.
- Parameters
labels – Label ID sequences. (B, L)
- Returns
Decoder output sequences. (B, T, U, D_dec)
- Return type
dec_out
-
init_state
(batch_size: int) → Tuple[torch.Tensor, Optional[torch._VariableFunctionsClass.tensor]][source]¶ Initialize decoder states.
- Parameters
batch_size – Batch size.
- Returns
Initial decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
-
rnn_forward
(sequence: torch.Tensor, state: Tuple[torch.Tensor, Optional[torch.Tensor]]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]][source]¶ Encode source label sequences.
- Parameters
sequence – RNN input sequences. (B, D_emb)
state – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
- Returns
RNN output sequences. (B, D_dec) (h_next, c_next): Decoder hidden states. (N, B, D_dec), (N, B, D_dec))
- Return type
sequence
-
score
(hyp: espnet2.asr.transducer.beam_search_transducer.Hypothesis, cache: Dict[str, Any]) → Tuple[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]], torch.Tensor][source]¶ One-step forward hypothesis.
- Parameters
hyp – Hypothesis.
cache – Pairs of (dec_out, state) for each label sequence. (key)
- Returns
Decoder output sequence. (1, D_dec) new_state: Decoder hidden states. ((N, 1, D_dec), (N, 1, D_dec)) label: Label ID for LM. (1,)
- Return type
dec_out
-
select_state
(states: Tuple[torch.Tensor, Optional[torch.Tensor]], idx: int) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Get specified ID state from decoder hidden states.
- Parameters
states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec))
idx – State ID to extract.
- Returns
- Decoder hidden state for given ID.
((N, 1, D_dec), (N, 1, D_dec))
espnet2.asr.decoder.transformer_decoder¶
Decoder definition.
-
class
espnet2.asr.decoder.transformer_decoder.
BaseTransformerDecoder
(vocab_size: int, encoder_output_size: int, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True)[source]¶ Bases:
espnet2.asr.decoder.abs_decoder.AbsDecoder
,espnet.nets.scorer_interface.BatchScorerInterface
Base class of Transfomer decoder module.
- Parameters
vocab_size – output dim
encoder_output_size – dimension of attention
attention_heads – the number of heads of multi head attention
linear_units – the number of units of position-wise feed forward
num_blocks – the number of decoder blocks
dropout_rate – dropout rate
self_attention_dropout_rate – dropout rate for attention
input_layer – input layer type
use_output_layer – whether to use output layer
pos_enc_class – PositionalEncoding or ScaledPositionalEncoding
normalize_before – whether to use layer_norm before the first block
concat_after – whether to concat attention layer’s input and output if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
-
batch_score
(ys: torch.Tensor, states: List[Any], xs: torch.Tensor) → Tuple[torch.Tensor, List[Any]][source]¶ Score new token batch.
- Parameters
ys (torch.Tensor) – torch.int64 prefix tokens (n_batch, ylen).
states (List[Any]) – Scorer states for prefix tokens.
xs (torch.Tensor) – The encoder feature that generates ys (n_batch, xlen, n_feat).
- Returns
- Tuple of
batchfied scores for next token with shape of (n_batch, n_vocab) and next state list for ys.
- Return type
tuple[torch.Tensor, List[Any]]
-
forward
(hs_pad: torch.Tensor, hlens: torch.Tensor, ys_in_pad: torch.Tensor, ys_in_lens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Forward decoder.
- Parameters
hs_pad – encoded memory, float32 (batch, maxlen_in, feat)
hlens – (batch)
ys_in_pad – input token ids, int64 (batch, maxlen_out) if input_layer == “embed” input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens – (batch)
- Returns
tuple containing:
- x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
- Return type
(tuple)
-
forward_one_step
(tgt: torch.Tensor, tgt_mask: torch.Tensor, memory: torch.Tensor, cache: List[torch.Tensor] = None) → Tuple[torch.Tensor, List[torch.Tensor]][source]¶ Forward one step.
- Parameters
tgt – input token ids, int64 (batch, maxlen_out)
tgt_mask – input token mask, (batch, maxlen_out) dtype=torch.uint8 in PyTorch 1.2- dtype=torch.bool in PyTorch 1.2+ (include 1.2)
memory – encoded memory, float32 (batch, maxlen_in, feat)
cache – cached output list of (batch, max_time_out-1, size)
- Returns
NN output value and cache per self.decoders. y.shape` is (batch, maxlen_out, token)
- Return type
y, cache
-
class
espnet2.asr.decoder.transformer_decoder.
DynamicConvolution2DTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
DynamicConvolutionTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
LightweightConvolution2DTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
LightweightConvolutionTransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False, conv_wshare: int = 4, conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11), conv_usebias: int = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
-
class
espnet2.asr.decoder.transformer_decoder.
TransformerDecoder
(vocab_size: int, encoder_output_size: int, attention_heads: int = 4, linear_units: int = 2048, num_blocks: int = 6, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, self_attention_dropout_rate: float = 0.0, src_attention_dropout_rate: float = 0.0, input_layer: str = 'embed', use_output_layer: bool = True, pos_enc_class=<class 'espnet.nets.pytorch_backend.transformer.embedding.PositionalEncoding'>, normalize_before: bool = True, concat_after: bool = False)[source]¶ Bases:
espnet2.asr.decoder.transformer_decoder.BaseTransformerDecoder
espnet2.asr.frontend.__init__¶
espnet2.asr.frontend.abs_frontend¶
-
class
espnet2.asr.frontend.abs_frontend.
AbsFrontend
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.asr.frontend.fused¶
-
class
espnet2.asr.frontend.fused.
FusedFrontends
(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet2.asr.frontend.s3prl¶
-
class
espnet2.asr.frontend.s3prl.
S3prlFrontend
(fs: Union[int, str] = 16000, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, download_dir: str = None, multilayer_feature: bool = False)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Speech Pretrained Representation frontend structure for ASR.
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet2.asr.frontend.default¶
-
class
espnet2.asr.frontend.default.
DefaultFrontend
(fs: Union[int, str] = 16000, n_fft: int = 512, win_length: int = None, hop_length: int = 128, window: Optional[str] = 'hann', center: bool = True, normalized: bool = False, onesided: bool = True, n_mels: int = 80, fmin: int = None, fmax: int = None, htk: bool = False, frontend_conf: Optional[dict] = {'badim': 320, 'bdropout_rate': 0.0, 'blayers': 3, 'bnmask': 2, 'bprojs': 320, 'btype': 'blstmp', 'bunits': 300, 'delay': 3, 'ref_channel': -1, 'taps': 5, 'use_beamformer': False, 'use_dnn_mask_for_wpe': True, 'use_wpe': False, 'wdropout_rate': 0.0, 'wlayers': 3, 'wprojs': 320, 'wtype': 'blstmp', 'wunits': 300}, apply_stft: bool = True)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Conventional frontend structure for ASR.
Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
espnet2.asr.frontend.windowing¶
Sliding Window for raw audio input data.
-
class
espnet2.asr.frontend.windowing.
SlidingWindow
(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: int = None, fs=None)[source]¶ Bases:
espnet2.asr.frontend.abs_frontend.AbsFrontend
Sliding Window.
Provides a sliding window over a batched continuous raw audio tensor. Optionally, provides padding (Currently not implemented). Combine this module with a pre-encoder compatible with raw audio data, for example Sinc convolutions.
Known issues: Output length is calculated incorrectly if audio shorter than win_length. WARNING: trailing values are discarded - padding not implemented yet. There is currently no additional window function applied to input values.
Initialize.
- Parameters
win_length – Length of frame.
hop_length – Relative starting point of next frame.
channels – Number of input channels.
padding – Padding (placeholder, currently not implemented).
fs – Sampling rate (placeholder for compatibility, not used).
-
forward
(input: torch.Tensor, input_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Apply a sliding window on the input.
- Parameters
input – Input (B, T, C*D) or (B, T*C*D), with D=C=1.
input_lengths – Input lengths within batch.
- Returns
Output with dimensions (B, T, C, D), with D=win_length. Tensor: Output lengths within batch.
- Return type
Tensor