espnet2.enh package¶
espnet2.enh.__init__¶
espnet2.enh.espnet_enh_s2t_model¶
-
class
espnet2.enh.espnet_enh_s2t_model.
ESPnetEnhS2TModel
(enh_model: espnet2.enh.espnet_model.ESPnetEnhancementModel, s2t_model: Union[espnet2.asr.espnet_model.ESPnetASRModel, espnet2.st.espnet_model.ESPnetSTModel, espnet2.diar.espnet_model.ESPnetDiarizationModel], calc_enh_loss: bool = True, bypass_enh_prob: float = 0)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
Joint model Enhancement and Speech to Text.
-
batchify_nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor, batch_size: int = 100)¶ Compute negative log likelihood(nll) from transformer-decoder
To avoid OOM, this fuction seperate the input into batches. Then call nll for each batch and combine and return results. :param encoder_out: (Batch, Length, Dim) :param encoder_out_lens: (Batch,) :param ys_pad: (Batch, Length) :param ys_pad_lens: (Batch,) :param batch_size: int, samples each batch contain when computing nll,
you may change this to avoid OOM or increase GPU memory usage
-
collect_feats
(speech: torch.Tensor, speech_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶
-
encode
(speech: torch.Tensor, speech_lengths: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by asr_inference.py
- Parameters
speech – (Batch, Length, …)
speech_lengths – (Batch, )
-
encode_diar
(speech: torch.Tensor, speech_lengths: torch.Tensor, num_spk: int) → Tuple[torch.Tensor, torch.Tensor][source]¶ Frontend + Encoder. Note that this method is used by diar_inference.py
- Parameters
speech – (Batch, Length, …)
speech_lengths – (Batch, )
num_spk – int
-
forward
(speech: torch.Tensor, speech_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters
speech – (Batch, Length, …)
speech_lengths – (Batch, ) default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
Enh+ASR task (For) – text_spk1: (Batch, Length) text_spk2: (Batch, Length) … text_spk1_lengths: (Batch,) text_spk2_lengths: (Batch,) …
other tasks (For) –
text: (Batch, Length) default None just to keep the argument order text_lengths: (Batch,)
default None for the same reason as speech_lengths
-
inherite_attributes
(inherite_enh_attrs: List[str] = [], inherite_s2t_attrs: List[str] = [])[source]¶
-
nll
(encoder_out: torch.Tensor, encoder_out_lens: torch.Tensor, ys_pad: torch.Tensor, ys_pad_lens: torch.Tensor) → torch.Tensor[source]¶ Compute negative log likelihood(nll) from transformer-decoder
Normally, this function is called in batchify_nll.
- Parameters
encoder_out – (Batch, Length, Dim)
encoder_out_lens – (Batch,)
ys_pad – (Batch, Length)
ys_pad_lens – (Batch,)
-
espnet2.enh.espnet_model¶
Enhancement model module.
-
class
espnet2.enh.espnet_model.
ESPnetEnhancementModel
(encoder: espnet2.enh.encoder.abs_encoder.AbsEncoder, separator: espnet2.enh.separator.abs_separator.AbsSeparator, decoder: espnet2.enh.decoder.abs_decoder.AbsDecoder, mask_module: Optional[espnet2.diar.layers.abs_mask.AbsMask], loss_wrappers: List[espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper], stft_consistency: bool = False, loss_type: str = 'mask_mse', mask_type: Optional[str] = None)[source]¶ Bases:
espnet2.train.abs_espnet_model.AbsESPnetModel
Speech enhancement or separation Frontend model
-
collect_feats
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor, **kwargs) → Dict[str, torch.Tensor][source]¶
-
forward
(speech_mix: torch.Tensor, speech_mix_lengths: torch.Tensor = None, **kwargs) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶ Frontend + Encoder + Decoder + Calc loss
- Parameters
speech_mix – (Batch, samples) or (Batch, samples, channels)
speech_ref – (Batch, num_speaker, samples) or (Batch, num_speaker, samples, channels)
speech_mix_lengths – (Batch,), default None for chunk interator, because the chunk-iterator does not have the speech_lengths returned. see in espnet2/iterators/chunk_iter_factory.py
kwargs – “utt_id” is among the input.
-
forward_enhance
(speech_mix: torch.Tensor, speech_lengths: torch.Tensor, additional: Optional[Dict] = None) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶
-
forward_loss
(speech_pre: torch.Tensor, speech_lengths: torch.Tensor, feature_mix: torch.Tensor, feature_pre: torch.Tensor, others: OrderedDict, speech_ref: torch.Tensor, noise_ref: torch.Tensor = None, dereverb_speech_ref: torch.Tensor = None) → Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor][source]¶
-
static
sort_by_perm
(nn_output, perm)[source]¶ Sort the input list of tensors by the specified permutation.
- Parameters
nn_output – List[torch.Tensor(Batch, …)], len(nn_output) == num_spk
perm – (Batch, num_spk) or List[torch.Tensor(num_spk)]
- Returns
List[torch.Tensor(Batch, …)]
- Return type
nn_output_new
-
espnet2.enh.abs_enh¶
-
class
espnet2.enh.abs_enh.
AbsEnhancement
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, collections.OrderedDict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.enh.layers.__init__¶
espnet2.enh.layers.beamformer¶
Beamformer module.
-
espnet2.enh.layers.beamformer.
apply_beamforming_vector
(beamform_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶
-
espnet2.enh.layers.beamformer.
blind_analytic_normalization
(ws, psd_noise, eps=1e-08)[source]¶ Blind analytic normalization (BAN) for post-filtering
- Parameters
ws (torch.complex64/ComplexTensor) – beamformer vector (…, F, C)
psd_noise (torch.complex64/ComplexTensor) – noise PSD matrix (…, F, C, C)
eps (float) –
- Returns
normalized beamformer vector (…, F)
- Return type
ws_ban (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
generalized_eigenvalue_decomposition
(a: torch.Tensor, b: torch.Tensor, eps=1e-06)[source]¶ Solves the generalized eigenvalue decomposition through Cholesky decomposition.
ported from https://github.com/asteroid-team/asteroid/blob/master/asteroid/dsp/beamforming.py#L464
a @ e_vec = e_val * b @ e_vec | | Cholesky decomposition on b: | b = L @ L^H, where L is a lower triangular matrix | | Let C = L^-1 @ a @ L^-H, it is Hermitian. | => C @ y = lambda * y => e_vec = L^-H @ y
Reference: https://www.netlib.org/lapack/lug/node54.html
- Parameters
a – A complex Hermitian or real symmetric matrix whose eigenvalues and eigenvectors will be computed. (…, C, C)
b – A complex Hermitian or real symmetric definite positive matrix. (…, C, C)
- Returns
generalized eigenvalues (ascending order) e_vec: generalized eigenvectors
- Return type
e_val
-
espnet2.enh.layers.beamformer.
get_WPD_filter
(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the WPD vector.
WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:
h = (Rf^-1 @ Phi_{xx}) / tr[(Rf^-1) @ Phi_{xx}] @ u
- Reference:
T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481
- Parameters
Phi (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the PSD of zero-padded speech [x^T(t,f) 0 … 0]^T.
Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.
reference_vector (torch.Tensor) – (B, (btaps+1) * C) is the reference_vector.
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(B, F, (btaps + 1) * C)
- Return type
filter_matrix (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_WPD_filter_v2
(Phi: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Rf: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: torch.Tensor, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the WPD vector (v2).
- This implementation is more efficient than get_WPD_filter as
it skips unnecessary computation with zeros.
- Parameters
Phi (torch.complex64/ComplexTensor) – (B, F, C, C) is speech PSD.
Rf (torch.complex64/ComplexTensor) – (B, F, (btaps+1) * C, (btaps+1) * C) is the power normalized spatio-temporal covariance matrix.
reference_vector (torch.Tensor) – (B, C) is the reference_vector.
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(B, F, (btaps+1) * C)
- Return type
filter_matrix (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_WPD_filter_with_rtf
(psd_observed_bar: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, normalize_ref_channel: Optional[int] = None, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-15) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the WPD vector calculated with RTF.
WPD is the Weighted Power minimization Distortionless response convolutional beamformer. As follows:
h = (Rf^-1 @ vbar) / (vbar^H @ R^-1 @ vbar)
- Reference:
T. Nakatani and K. Kinoshita, “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation,” in IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019, doi: 10.1109/LSP.2019.2911179. https://ieeexplore.ieee.org/document/8691481
- Parameters
psd_observed_bar (torch.complex64/ComplexTensor) – stacked observation covariance matrix
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
iterations (int) – number of iterations in power method
reference_vector (torch.Tensor or int) – (…, C) or scalar
normalize_ref_channel (int) – reference channel for normalizing the RTF
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)r
-
espnet2.enh.layers.beamformer.
get_covariances
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, bdelay: int, btaps: int, get_vector: bool = False) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ - Calculates the power normalized spatio-temporal covariance
matrix of the framed signal.
- Parameters
Y – Complex STFT signal with shape (B, F, C, T)
inverse_power – Weighting factor with shape (B, F, T)
- Returns
(B, F, (btaps+1) * C, (btaps+1) * C) Correlation vector: (B, F, btaps + 1, C, C)
- Return type
Correlation matrix
-
espnet2.enh.layers.beamformer.
get_gev_vector
(psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Return the generalized eigenvalue (GEV) beamformer vector:
psd_speech @ h = lambda * psd_noise @ h
- Reference:
Blind acoustic beamforming based on generalized eigenvalue decomposition; E. Warsitz and R. Haeb-Umbach, 2007.
- Parameters
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition (only for torch builtin complex tensors)
reference_vector (torch.Tensor or int) – (…, C) or scalar
iterations (int) – number of iterations in power method
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_lcmv_vector_with_rtf
(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], rtf_mat: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], reference_vector: Union[int, torch.Tensor, None] = None, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ - Return the LCMV (Linearly Constrained Minimum Variance) vector
calculated with RTF:
h = (Npsd^-1 @ rtf_mat) @ (rtf_mat^H @ Npsd^-1 @ rtf_mat)^-1 @ p
- Reference:
H. L. Van Trees, “Optimum array processing: Part IV of detection, estimation, and modulation theory,” John Wiley & Sons, 2004. (Chapter 6.7)
- Parameters
psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)
rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (…, F, C, num_spk)
reference_vector (torch.Tensor or int) – (…, num_spk) or scalar
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_mvdr_vector
(psd_s, psd_n, reference_vector: torch.Tensor, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]¶ Return the MVDR (Minimum Variance Distortionless Response) vector:
h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u
- Reference:
On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
- Parameters
psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor) – (…, C)
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_mvdr_vector_with_rtf
(psd_n: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_speech: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], psd_noise: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], iterations: int = 3, reference_vector: Union[int, torch.Tensor, None] = None, normalize_ref_channel: Optional[int] = None, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ - Return the MVDR (Minimum Variance Distortionless Response) vector
calculated with RTF:
h = (Npsd^-1 @ rtf) / (rtf^H @ Npsd^-1 @ rtf)
- Reference:
On optimal frequency-domain multichannel linear filtering for noise reduction; M. Souden et al., 2010; https://ieeexplore.ieee.org/document/5089420
- Parameters
psd_n (torch.complex64/ComplexTensor) – observation/noise covariance matrix (…, F, C, C)
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
iterations (int) – number of iterations in power method
reference_vector (torch.Tensor or int) – (…, C) or scalar
normalize_ref_channel (int) – reference channel for normalizing the RTF
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_mwf_vector
(psd_s, psd_n, reference_vector: Union[torch.Tensor, int], use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]¶ Return the MWF (Minimum Multi-channel Wiener Filter) vector:
h = (Npsd^-1 @ Spsd) @ u
- Parameters
psd_s (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_n (torch.complex64/ComplexTensor) – power-normalized observation covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_power_spectral_density_matrix
(xs, mask, normalization=True, reduction='mean', eps: float = 1e-15)[source]¶ Return cross-channel power spectral density (PSD) matrix
- Parameters
xs (torch.complex64/ComplexTensor) – (…, F, C, T)
reduction (str) – “mean” or “median”
mask (torch.Tensor) – (…, F, C, T)
normalization (bool) –
eps (float) –
- Returns
psd (torch.complex64/ComplexTensor): (…, F, C, C)
-
espnet2.enh.layers.beamformer.
get_rank1_mwf_vector
(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]¶ Return the R1-MWF (Rank-1 Multi-channel Wiener Filter) vector
h = (Npsd^-1 @ Spsd) / (mu + Tr(Npsd^-1 @ Spsd)) @ u
- Reference:
[1] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal-01634449/document [2] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918
- Parameters
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. When denoising_weight = 0, it corresponds to MVDR beamformer.
approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [1]
iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_rtf
(psd_speech, psd_noise, mode='power', reference_vector: Union[int, torch.Tensor] = 0, iterations: int = 3, use_torch_solver: bool = True)[source]¶ Calculate the relative transfer function (RTF)
- Algorithm of power method:
rtf = reference_vector
- for i in range(iterations):
rtf = (psd_noise^-1 @ psd_speech) @ rtf rtf = rtf / ||rtf||_2 # this normalization can be skipped
rtf = psd_noise @ rtf
rtf = rtf / rtf[…, ref_channel, :]
Note: 4) Normalization at the reference channel is not performed here.
- Parameters
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
mode (str) – one of (“power”, “evd”) “power”: power method “evd”: eigenvalue decomposition
reference_vector (torch.Tensor or int) – (…, C) or scalar
iterations (int) – number of iterations in power method
use_torch_solver (bool) – Whether to use solve instead of inverse
- Returns
(…, F, C, 1)
- Return type
rtf (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
get_rtf_matrix
(psd_speeches, psd_noises, diagonal_loading: bool = True, ref_channel: int = 0, rtf_iterations: int = 3, use_torch_solver: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]¶ Calculate the RTF matrix with each column the relative transfer function of the corresponding source.
-
espnet2.enh.layers.beamformer.
get_sdw_mwf_vector
(psd_speech, psd_noise, reference_vector: Union[torch.Tensor, int], denoising_weight: float = 1.0, approx_low_rank_psd_speech: bool = False, iterations: int = 3, use_torch_solver: bool = True, diagonal_loading: bool = True, diag_eps: float = 1e-07, eps: float = 1e-08)[source]¶ Return the SDW-MWF (Speech Distortion Weighted Multi-channel Wiener Filter) vector
h = (Spsd + mu * Npsd)^-1 @ Spsd @ u
- Reference:
[1] Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction; A. Spriet et al, 2004 https://dl.acm.org/doi/abs/10.1016/j.sigpro.2004.07.028 [2] Rank-1 constrained multichannel Wiener filter for speech recognition in noisy environments; Z. Wang et al, 2018 https://hal.inria.fr/hal-01634449/document [3] Low-rank approximation based multichannel Wiener filter algorithms for noise reduction with application in cochlear implants; R. Serizel, 2014 https://ieeexplore.ieee.org/document/6730918
- Parameters
psd_speech (torch.complex64/ComplexTensor) – speech covariance matrix (…, F, C, C)
psd_noise (torch.complex64/ComplexTensor) – noise covariance matrix (…, F, C, C)
reference_vector (torch.Tensor or int) – (…, C) or scalar
denoising_weight (float) – a trade-off parameter between noise reduction and speech distortion. A larger value leads to more noise reduction at the expense of more speech distortion. The plain MWF is obtained with denoising_weight = 1 (by default).
approx_low_rank_psd_speech (bool) – whether to replace original input psd_speech with its low-rank approximation as in [2]
iterations (int) – number of iterations in power method, only used when approx_low_rank_psd_speech = True
use_torch_solver (bool) – Whether to use solve instead of inverse
diagonal_loading (bool) – Whether to add a tiny term to the diagonal of psd_n
diag_eps (float) –
eps (float) –
- Returns
(…, F, C)
- Return type
beamform_vector (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
gev_phase_correction
(vector)[source]¶ Phase correction to reduce distortions due to phase inconsistencies.
ported from https://github.com/fgnt/nn-gev/blob/master/fgnt/beamforming.py#L169
- Parameters
vector – Beamforming vector with shape (…, F, C)
- Returns
Phase corrected beamforming vectors
- Return type
w
-
espnet2.enh.layers.beamformer.
perform_WPD_filtering
(filter_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], bdelay: int, btaps: int) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Perform WPD filtering.
- Parameters
filter_matrix – Filter matrix (B, F, (btaps + 1) * C)
Y – Complex STFT signal with shape (B, F, C, T)
- Returns
(B, F, T)
- Return type
enhanced (torch.complex64/ComplexTensor)
-
espnet2.enh.layers.beamformer.
prepare_beamformer_stats
(signal, masks_speech, mask_noise, powers=None, beamformer_type='mvdr', bdelay=3, btaps=5, eps=1e-06)[source]¶ Prepare necessary statistics for constructing the specified beamformer.
- Parameters
signal (torch.complex64/ComplexTensor) – (…, F, C, T)
masks_speech (List[torch.Tensor]) – (…, F, C, T) masks for all speech sources
mask_noise (torch.Tensor) – (…, F, C, T) noise mask
powers (List[torch.Tensor]) – powers for all speech sources (…, F, T) used for wMPDR or WPD beamformers
beamformer_type (str) – one of the pre-defined beamformer types
bdelay (int) – delay factor, used for WPD beamformser
btaps (int) – number of filter taps, used for WPD beamformser
eps (torch.Tensor) – tiny constant
- Returns
- a dictionary containing all necessary statistics
e.g. “psd_n”, “psd_speech”, “psd_distortion” Note: * When masks_speech is a tensor or a single-element list, all returned
statistics are tensors;
When masks_speech is a multi-element list, some returned statistics can be a list, e.g., “psd_n” for MVDR, “psd_speech” and “psd_distortion”.
- Return type
beamformer_stats (dict)
-
espnet2.enh.layers.beamformer.
signal_framing
(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, bdelay: int, do_padding: bool = False, pad_value: int = 0, indices: List = None) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Expand signal into several frames, with each frame of length frame_length.
- Parameters
signal – (…, T)
frame_length – length of each segment
frame_step – step for selecting frames
bdelay – delay for WPD
do_padding – whether or not to pad the input signal at the beginning of the time dimension
pad_value – value to fill in the padding
- Returns
if do_padding: (…, T, frame_length) else: (…, T - bdelay - frame_length + 2, frame_length)
- Return type
torch.Tensor
-
espnet2.enh.layers.beamformer.
tik_reg
(mat, reg: float = 1e-08, eps: float = 1e-08)[source]¶ Perform Tikhonov regularization (only modifying real part).
- Parameters
mat (torch.complex64/ComplexTensor) – input matrix (…, C, C)
reg (float) – regularization factor
eps (float) –
- Returns
regularized matrix (…, C, C)
- Return type
ret (torch.complex64/ComplexTensor)
espnet2.enh.layers.dnn_wpe¶
-
class
espnet2.enh.layers.dnn_wpe.
DNN_WPE
(wtype: str = 'blstmp', widim: int = 257, wlayers: int = 3, wunits: int = 300, wprojs: int = 320, dropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask: bool = True, nmask: int = 1, nonlinear: str = 'sigmoid', iterations: int = 1, normalization: bool = False, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]¶ DNN_WPE forward function.
- Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq or Some dimension of the feature vector
- Parameters
data – (B, T, C, F)
ilens – (B,)
- Returns
(B, T, C, F) ilens: (B,) masks (torch.Tensor or List[torch.Tensor]): (B, T, C, F) power (List[torch.Tensor]): (B, F, T)
- Return type
enhanced (torch.Tensor or List[torch.Tensor])
-
predict_mask
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Predict mask for WPE dereverberation.
- Parameters
data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision
ilens (torch.Tensor) – (B,)
- Returns
(B, T, C, F) ilens (torch.Tensor): (B,)
- Return type
masks (torch.Tensor or List[torch.Tensor])
-
espnet2.enh.layers.dprnn¶
-
class
espnet2.enh.layers.dprnn.
DPRNN
(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]¶ Bases:
torch.nn.modules.module.Module
Deep dual-path RNN.
- Parameters
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
dropout – float, dropout ratio. Default is 0.
num_layers – int, number of stacked RNN layers. Default is 1.
bidirectional – bool, whether the RNN layers are bidirectional. Default is True.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.layers.dprnn.
DPRNN_TAC
(rnn_type, input_size, hidden_size, output_size, dropout=0, num_layers=1, bidirectional=True)[source]¶ Bases:
torch.nn.modules.module.Module
Deep duaL-path RNN with TAC applied to each layer/block.
- Parameters
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
dropout – float, dropout ratio. Default is 0.
num_layers – int, number of stacked RNN layers. Default is 1.
bidirectional – bool, whether the RNN layers are bidirectional. Default is False.
-
forward
(input, num_mic)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.layers.dprnn.
SingleRNN
(rnn_type, input_size, hidden_size, dropout=0, bidirectional=False)[source]¶ Bases:
torch.nn.modules.module.Module
Container module for a single RNN layer.
- Parameters
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
bidirectional – bool, whether the RNN layers are bidirectional. Default is False.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.skim¶
-
class
espnet2.enh.layers.skim.
MemLSTM
(hidden_size, dropout=0.0, bidirectional=False, mem_type='hc', norm_type='cLN')[source]¶ Bases:
torch.nn.modules.module.Module
the Mem-LSTM of SkiM
- Parameters
hidden_size – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.
mem_type – ‘hc’, ‘h’, ‘c’ or ‘id’. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned.
norm_type – gLN, cLN. cLN is for causal implementation.
-
extra_repr
() → str[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(hc, S)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.layers.skim.
SegLSTM
(input_size, hidden_size, dropout=0.0, bidirectional=False, norm_type='cLN')[source]¶ Bases:
torch.nn.modules.module.Module
the Seg-LSTM of SkiM
- Parameters
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
bidirectional – bool, whether the LSTM layers are bidirectional. Default is False.
norm_type – gLN, cLN. cLN is for causal implementation.
-
forward
(input, hc)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.layers.skim.
SkiM
(input_size, hidden_size, output_size, dropout=0.0, num_blocks=2, segment_size=20, bidirectional=True, mem_type='hc', norm_type='gLN', seg_overlap=False)[source]¶ Bases:
torch.nn.modules.module.Module
Skipping Memory Net
- Parameters
input_size – int, dimension of the input feature. Input shape shoud be (batch, length, input_size)
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
dropout – float, dropout ratio. Default is 0.
num_blocks – number of basic SkiM blocks
segment_size – segmentation size for splitting long features
bidirectional – bool, whether the RNN layers are bidirectional.
mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.
norm_type – gLN, cLN. cLN is for causal implementation.
seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments.Default is False.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.complexnn¶
-
class
espnet2.enh.layers.complexnn.
ComplexBatchNorm
(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, complex_axis=1)[source]¶ Bases:
torch.nn.modules.module.Module
-
extra_repr
()[source]¶ Set the extra representation of the module
To print customized extra information, you should re-implement this method in your own modules. Both single-line and multi-line strings are acceptable.
-
forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.layers.complexnn.
ComplexConv2d
(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), dilation=1, groups=1, causal=True, complex_axis=1)[source]¶ Bases:
torch.nn.modules.module.Module
ComplexConv2d.
in_channels: real+imag out_channels: real+imag kernel_size : input [B,C,D,T] kernel size in [D,T] padding : input [B,C,D,T] padding in [D,T] causal: if causal, will padding time dimension’s left side,
otherwise both
-
forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.layers.complexnn.
ComplexConvTranspose2d
(in_channels, out_channels, kernel_size=(1, 1), stride=(1, 1), padding=(0, 0), output_padding=(0, 0), causal=False, complex_axis=1, groups=1)[source]¶ Bases:
torch.nn.modules.module.Module
ComplexConvTranspose2d.
in_channels: real+imag out_channels: real+imag
-
forward
(inputs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
Bases:
torch.nn.modules.module.Module
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.conv_utils¶
-
espnet2.enh.layers.conv_utils.
conv2d_output_shape
(h_w, kernel_size=1, stride=1, pad=0, dilation=1)[source]¶
espnet2.enh.layers.fasnet¶
-
class
espnet2.enh.layers.fasnet.
BF_module
(input_dim, feature_dim, hidden_dim, output_dim, num_spk=2, layer=4, segment_size=100, bidirectional=True, dropout=0.0, fasnet_type='ifasnet')[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input, num_mic)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.layers.fasnet.
FaSNet_base
(enc_dim, feature_dim, hidden_dim, layer, segment_size=24, nspk=2, win_len=16, context_len=16, dropout=0.0, sr=16000)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input, num_mic)[source]¶ abstract forward function
input: shape (batch, max_num_ch, T) num_mic: shape (batch, ), the number of channels for each input.
Zero for fixed geometry configuration.
-
seg_signal_context
(x, window, context)[source]¶ Segmenting the signal into chunks with specific context.
- input:
x: size (B, ch, T) window: int context: int
-
espnet2.enh.layers.tcn¶
-
class
espnet2.enh.layers.tcn.
ChannelwiseLayerNorm
(channel_size, shape='BDT')[source]¶ Bases:
torch.nn.modules.module.Module
Channel-wise Layer Normalization (cLN).
-
class
espnet2.enh.layers.tcn.
Chomp1d
(chomp_size)[source]¶ Bases:
torch.nn.modules.module.Module
To ensure the output length is the same as the input.
-
class
espnet2.enh.layers.tcn.
DepthwiseSeparableConv
(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]¶ Bases:
torch.nn.modules.module.Module
-
class
espnet2.enh.layers.tcn.
GlobalLayerNorm
(channel_size, shape='BDT')[source]¶ Bases:
torch.nn.modules.module.Module
Global Layer Normalization (gLN).
-
class
espnet2.enh.layers.tcn.
TemporalBlock
(in_channels, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]¶ Bases:
torch.nn.modules.module.Module
-
class
espnet2.enh.layers.tcn.
TemporalConvNet
(N, B, H, P, X, R, C, norm_type='gLN', causal=False, mask_nonlinear='relu')[source]¶ Bases:
torch.nn.modules.module.Module
Basic Module of tasnet.
- Parameters
N – Number of filters in autoencoder
B – Number of channels in bottleneck 1 * 1-conv block
H – Number of channels in convolutional blocks
P – Kernel size in convolutional blocks
X – Number of convolutional blocks in each repeat
R – Number of repeats
C – Number of speakers
norm_type – BN, gLN, cLN
causal – causal or non-causal
mask_nonlinear – use which non-linear function to generate mask
espnet2.enh.layers.complex_utils¶
Beamformer module.
-
espnet2.enh.layers.complex_utils.
cat
(seq: Sequence[Union[torch_complex.tensor.ComplexTensor, torch.Tensor]], *args, **kwargs)[source]¶
-
espnet2.enh.layers.complex_utils.
complex_norm
(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=-1, keepdim=False) → torch.Tensor[source]¶
-
espnet2.enh.layers.complex_utils.
inverse
(c: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶
-
espnet2.enh.layers.complex_utils.
matmul
(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor]) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶
-
espnet2.enh.layers.complex_utils.
new_complex_like
(ref: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], real_imag: Tuple[torch.Tensor, torch.Tensor])[source]¶
-
espnet2.enh.layers.complex_utils.
reverse
(a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], dim=0)[source]¶
-
espnet2.enh.layers.complex_utils.
solve
(b: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], a: Union[torch.Tensor, torch_complex.tensor.ComplexTensor])[source]¶ Solve the linear equation ax = b.
espnet2.enh.layers.tcndenseunet¶
-
class
espnet2.enh.layers.tcndenseunet.
Conv2DActNorm
(in_channels, out_channels, ksz=(3, 3), stride=(1, 2), padding=(1, 0), upsample=False, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
Basic Conv2D + activation + instance norm building block.
-
forward
(inp)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.layers.tcndenseunet.
DenseBlock
(in_channels, out_channels, num_freqs, pre_blocks=2, freq_proc_blocks=1, post_blocks=2, ksz=(3, 3), activation=<class 'torch.nn.modules.activation.ELU'>, hid_chans=32)[source]¶ Bases:
torch.nn.modules.module.Module
single DenseNet block as used in iNeuBe model.
- Parameters
in_channels – number of input channels (image axis).
out_channels – number of output channels (image axis).
num_freqs – number of complex frequencies in the input STFT complex image-like tensor. The input is batch, image_channels, frames, freqs.
pre_blocks – dense block before point-wise convolution block over frequency axis.
freq_proc_blocks – number of frequency axis processing blocks.
post_blocks – dense block after point-wise convolution block over frequency axis.
ksz – kernel size used in densenet Conv2D layers.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
hid_chans – number of hidden channels in densenet Conv2D.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.layers.tcndenseunet.
FreqWiseBlock
(in_channels, num_freqs, out_channels, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
FreqWiseBlock, see iNeuBe paper.
Block that applies pointwise 2D convolution over STFT-like image tensor on frequency axis. The input is assumed to be [batch, image_channels, frames, freq].
-
forward
(inp)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.layers.tcndenseunet.
TCNDenseUNet
(n_spk=1, in_freqs=257, mic_channels=1, hid_chans=32, hid_chans_dense=32, ksz_dense=(3, 3), ksz_tcn=3, tcn_repeats=4, tcn_blocks=7, tcn_channels=384, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
TCNDenseNet block from iNeuBe
Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge. ICASSP 2022 p. 9201-9205.
- Parameters
n_spk – number of output sources/speakers.
in_freqs – number of complex STFT frequencies.
mic_channels – number of microphones channels (only fixed-array geometry supported).
hid_chans – number of channels in the subsampling/upsampling conv layers.
hid_chans_dense – number of channels in the densenet layers (reduce this to reduce VRAM requirements).
ksz_dense – kernel size in the densenet layers thorough iNeuBe.
ksz_tcn – kernel size in the TCN submodule.
tcn_repeats – number of repetitions of blocks in the TCN submodule.
tcn_blocks – number of blocks in the TCN submodule.
tcn_channels – number of channels in the TCN submodule.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
-
forward
(tf_rep)[source]¶ forward.
- Parameters
tf_rep (torch.Tensor) – 4D tensor (multi-channel complex STFT of mixture) of shape [B, T, C, F] batch, frames, microphones, frequencies.
- Returns
- complex 3D tensor monaural STFT of the targets
shape is [B, T, F] batch, frames, frequencies.
- Return type
out (torch.Tensor)
-
class
espnet2.enh.layers.tcndenseunet.
TCNResBlock
(in_chan, out_chan, ksz=3, stride=1, dilation=1, activation=<class 'torch.nn.modules.activation.ELU'>)[source]¶ Bases:
torch.nn.modules.module.Module
single depth-wise separable TCN block as used in iNeuBe TCN.
- Parameters
in_chan – number of input feature channels.
out_chan – number of output feature channels.
ksz – kernel size.
stride – stride in depth-wise convolution.
dilation – dilation in depth-wise convolution.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
-
forward
(inp)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.mask_estimator¶
-
class
espnet2.enh.layers.mask_estimator.
MaskEstimator
(type, idim, layers, units, projs, dropout, nmask=1, nonlinear='sigmoid')[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(xs: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]¶ Mask estimator forward function.
- Parameters
xs – (B, F, C, T)
ilens – (B,)
- Returns
The hidden vector (B, F, C, T) masks: A tuple of the masks. (B, F, C, T) ilens: (B,)
- Return type
hs (torch.Tensor)
-
espnet2.enh.layers.dnn_beamformer¶
DNN beamformer module.
-
class
espnet2.enh.layers.dnn_beamformer.
AttentionReference
(bidim, att_dim, eps=1e-06)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(psd_in: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, scaling: float = 2.0) → Tuple[torch.Tensor, torch.LongTensor][source]¶ Attention-based reference forward function.
- Parameters
psd_in (torch.complex64/ComplexTensor) – (B, F, C, C)
ilens (torch.Tensor) – (B,)
scaling (float) –
- Returns
(B, C) ilens (torch.Tensor): (B,)
- Return type
u (torch.Tensor)
-
-
class
espnet2.enh.layers.dnn_beamformer.
DNN_Beamformer
(bidim, btype: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, num_spk: int = 1, use_noise_mask: bool = True, nonlinear: str = 'sigmoid', dropout_rate: float = 0.0, badim: int = 320, ref_channel: int = -1, beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, mwf_mu: float = 1.0, eps: float = 1e-06, diagonal_loading: bool = True, diag_eps: float = 1e-07, mask_flooring: bool = False, flooring_thres: float = 1e-06, use_torch_solver: bool = True, btaps: int = 5, bdelay: int = 3)[source]¶ Bases:
torch.nn.modules.module.Module
DNN mask based Beamformer.
- Citation:
Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017; http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf
-
apply_beamforming
(data, ilens, psd_n, psd_speech, psd_distortion=None, rtf_mat=None, spk=0)[source]¶ Beamforming with the provided statistics.
- Parameters
data (torch.complex64/ComplexTensor) – (B, F, C, T)
ilens (torch.Tensor) – (B,)
psd_n (torch.complex64/ComplexTensor) – Noise covariance matrix for MVDR (B, F, C, C) Observation covariance matrix for MPDR/wMPDR (B, F, C, C) Stacked observation covariance for WPD (B,F,(btaps+1)*C,(btaps+1)*C)
psd_speech (torch.complex64/ComplexTensor) – Speech covariance matrix (B, F, C, C)
psd_distortion (torch.complex64/ComplexTensor) – Noise covariance matrix (B, F, C, C)
rtf_mat (torch.complex64/ComplexTensor) – RTF matrix (B, F, C, num_spk)
spk (int) – speaker index
- Returns
(B, F, T) ws (torch.complex64/ComplexTensor): (B, F) or (B, F, (btaps+1)*C)
- Return type
enhanced (torch.complex64/ComplexTensor)
-
forward
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor, powers: Optional[List[torch.Tensor]] = None, oracle_masks: Optional[List[torch.Tensor]] = None) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], torch.LongTensor, torch.Tensor][source]¶ DNN_Beamformer forward function.
- Notation:
B: Batch C: Channel T: Time or Sequence length F: Freq
- Parameters
data (torch.complex64/ComplexTensor) – (B, T, C, F)
ilens (torch.Tensor) – (B,)
powers (List[torch.Tensor] or None) – used for wMPDR or WPD (B, F, T)
oracle_masks (List[torch.Tensor] or None) – oracle masks (B, F, C, T) if not None, oracle_masks will be used instead of self.mask
- Returns
(B, T, F) ilens (torch.Tensor): (B,) masks (torch.Tensor): (B, T, C, F)
- Return type
enhanced (torch.complex64/ComplexTensor)
-
predict_mask
(data: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.LongTensor) → Tuple[Tuple[torch.Tensor, ...], torch.LongTensor][source]¶ Predict masks for beamforming.
- Parameters
data (torch.complex64/ComplexTensor) – (B, T, C, F), double precision
ilens (torch.Tensor) – (B,)
- Returns
(B, T, C, F) ilens (torch.Tensor): (B,)
- Return type
masks (torch.Tensor)
espnet2.enh.layers.dptnet¶
-
class
espnet2.enh.layers.dptnet.
DPTNet
(rnn_type, input_size, hidden_size, output_size, att_heads=4, dropout=0, activation='relu', num_layers=1, bidirectional=True, norm_type='gLN')[source]¶ Bases:
torch.nn.modules.module.Module
Dual-path transformer network.
- Parameters
rnn_type (str) – select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size (int) – dimension of the input feature. Input size must be a multiple of att_heads.
hidden_size (int) – dimension of the hidden state.
output_size (int) – dimension of the output size.
att_heads (int) – number of attention heads.
dropout (float) – dropout ratio. Default is 0.
activation (str) – activation function applied at the output of RNN.
num_layers (int) – number of stacked RNN layers. Default is 1.
bidirectional (bool) – whether the RNN layers are bidirectional. Default is True.
norm_type (str) – type of normalization to use after each inter- or intra-chunk Transformer block.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.layers.dptnet.
ImprovedTransformerLayer
(rnn_type, input_size, att_heads, hidden_size, dropout=0.0, activation='relu', bidirectional=True, norm='gLN')[source]¶ Bases:
torch.nn.modules.module.Module
Container module of the (improved) Transformer proposed in [1].
- Reference:
Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation; Chen et al, Interspeech 2020.
- Parameters
rnn_type (str) – select from ‘RNN’, ‘LSTM’ and ‘GRU’.
input_size (int) – Dimension of the input feature.
att_heads (int) – Number of attention heads.
hidden_size (int) – Dimension of the hidden state.
dropout (float) – Dropout ratio. Default is 0.
activation (str) – activation function applied at the output of RNN.
bidirectional (bool, optional) – True for bidirectional Inter-Chunk RNN (Intra-Chunk is always bidirectional).
norm (str, optional) – Type of normalization to use.
-
forward
(x, attn_mask=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
espnet2.enh.layers.ifasnet¶
espnet2.enh.layers.wpe¶
-
espnet2.enh.layers.wpe.
get_correlations
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], inverse_power: torch.Tensor, taps, delay) → Tuple[Union[torch.Tensor, torch_complex.tensor.ComplexTensor], Union[torch.Tensor, torch_complex.tensor.ComplexTensor]][source]¶ Calculates weighted correlations of a window of length taps
- Parameters
Y – Complex-valued STFT signal with shape (F, C, T)
inverse_power – Weighting factor with shape (F, T)
taps (int) – Lenghts of correlation window
delay (int) – Delay for the weighting factor
- Returns
Correlation matrix of shape (F, taps*C, taps*C) Correlation vector of shape (F, taps, C, C)
-
espnet2.enh.layers.wpe.
get_filter_matrix_conj
(correlation_matrix: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], correlation_vector: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], eps: float = 1e-10) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Calculate (conjugate) filter matrix based on correlations for one freq.
- Parameters
correlation_matrix – Correlation matrix (F, taps * C, taps * C)
correlation_vector – Correlation vector (F, taps, C, C)
eps –
- Returns
(F, taps, C, C)
- Return type
filter_matrix_conj (torch.complex/ComplexTensor)
-
espnet2.enh.layers.wpe.
get_power
(signal, dim=-2) → torch.Tensor[source]¶ Calculates power for signal
- Parameters
signal – Single frequency signal with shape (F, C, T).
axis – reduce_mean axis
- Returns
Power with shape (F, T)
-
espnet2.enh.layers.wpe.
is_torch_1_9_plus
= True¶ //github.com/fgnt/nara_wpe Many functions aren’t enough tested
- Type
WPE pytorch version
- Type
Ported from https
-
espnet2.enh.layers.wpe.
perform_filter_operation
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], filter_matrix_conj: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps, delay) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ - Parameters
Y – Complex-valued STFT signal of shape (F, C, T)
Matrix (filter) –
-
espnet2.enh.layers.wpe.
signal_framing
(signal: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], frame_length: int, frame_step: int, pad_value=0) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ Expands signal into frames of frame_length.
- Parameters
signal – (B * F, D, T)
- Returns
(B * F, D, T, W)
- Return type
torch.Tensor
-
espnet2.enh.layers.wpe.
wpe
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], taps=10, delay=3, iterations=3) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ WPE
- Parameters
Y – Complex valued STFT signal with shape (F, C, T)
taps – Number of filter taps
delay – Delay as a guard interval, such that X does not become zero.
iterations –
- Returns
(F, C, T)
- Return type
enhanced
-
espnet2.enh.layers.wpe.
wpe_one_iteration
(Y: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], power: torch.Tensor, taps: int = 10, delay: int = 3, eps: float = 1e-10, inverse_power: bool = True) → Union[torch.Tensor, torch_complex.tensor.ComplexTensor][source]¶ WPE for one iteration
- Parameters
Y – Complex valued STFT signal with shape (…, C, T)
power – : (…, T)
taps – Number of filter taps
delay – Delay as a guard interval, such that X does not become zero.
eps –
inverse_power (bool) –
- Returns
(…, C, T)
- Return type
enhanced
espnet2.enh.layers.dpmulcat¶
-
class
espnet2.enh.layers.dpmulcat.
DPMulCat
(input_size: int, hidden_size: int, output_size: int, num_spk: int, dropout: float = 0.0, num_layers: int = 4, bidirectional: bool = True, input_normalize: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Dual-path RNN module with MulCat blocks.
- Parameters
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
output_size – int, dimension of the output size.
num_spk – int, the number of speakers in the output.
dropout – float, the dropout rate in the LSTM layer. (Default: 0.0)
bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
num_layers – int, number of stacked MulCat blocks. (Default: 4)
input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)
-
forward
(input)[source]¶ Compute output after DPMulCat module.
- Parameters
input (torch.Tensor) – The input feature. Tensor of shape (batch, N, dim1, dim2) Apply RNN on dim1 first and then dim2
- Returns
- (list(torch.Tensor) or list(list(torch.Tensor))
In training mode, the module returns output of each DPMulCat block. In eval mode, the module only returns output in the last block.
-
class
espnet2.enh.layers.dpmulcat.
MulCatBlock
(input_size: int, hidden_size: int, dropout: float = 0.0, bidirectional: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
The MulCat block.
- Parameters
input_size – int, dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size – int, dimension of the hidden state.
dropout – float, the dropout rate in the LSTM layer. (Default: 0.0)
bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
espnet2.enh.layers.dc_crn¶
-
class
espnet2.enh.layers.dc_crn.
DC_CRN
(input_dim, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels=8, enc_kernel_size=(1, 3), enc_padding=(0, 1), enc_last_kernel_size=(1, 4), enc_last_stride=(1, 2), enc_last_padding=(0, 1), enc_layers=5, skip_last_kernel_size=(1, 3), skip_last_stride=(1, 1), skip_last_padding=(0, 1), glstm_groups=2, glstm_layers=2, glstm_bidirectional=False, glstm_rearrange=False, output_channels=2)[source]¶ Bases:
torch.nn.modules.module.Module
Densely-Connected Convolutional Recurrent Network (DC-CRN).
Reference: Fig. 3 and Section III-B in [1]
- Parameters
input_dim (int) – input feature dimension
input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers). It is recommended to use even number of channels to avoid AssertError when glstm_bidirectional=True.
enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder
enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder
enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder
enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder
skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways
glstm_groups (int) – number of groups in each Grouped LSTM layer
glstm_layers (int) – number of Grouped LSTM layers
glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers
glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer
output_channels (int) – number of output channels (must be an even number to recover both real and imaginary parts)
-
class
espnet2.enh.layers.dc_crn.
DenselyConnectedBlock
(in_channels, out_channels, hid_channels=8, kernel_size=(1, 3), padding=(0, 1), last_kernel_size=(1, 4), last_stride=(1, 2), last_padding=(0, 1), last_output_padding=(0, 0), layers=5, transposed=False)[source]¶ Bases:
torch.nn.modules.module.Module
Densely-Connected Convolutional Block.
- Parameters
in_channels (int) – number of input channels
out_channels (int) – number of output channels
hid_channels (int) – number of output channels in intermediate Conv layers
kernel_size (tuple) – kernel size for all but the last Conv layers
padding (tuple) – padding for all but the last Conv layers
last_kernel_size (tuple) – kernel size for the last GluConv layer
last_stride (tuple) – stride for the last GluConv layer
last_padding (tuple) – padding for the last GluConv layer
last_output_padding (tuple) – output padding for the last GluConvTranspose2d (only used when transposed=True)
layers (int) – total number of Conv layers
transposed (bool) – True to use GluConvTranspose2d in the last layer False to use GluConv2d in the last layer
-
class
espnet2.enh.layers.dc_crn.
GLSTM
(hidden_size=1024, groups=2, layers=2, bidirectional=False, rearrange=False)[source]¶ Bases:
torch.nn.modules.module.Module
Grouped LSTM.
- Reference:
Efficient Sequence Learning with Group Recurrent Networks; Gao et al., 2018
- Parameters
hidden_size (int) – total hidden size of all LSTMs in each grouped LSTM layer i.e., hidden size of each LSTM is hidden_size // groups
groups (int) – number of LSTMs in each grouped LSTM layer
layers (int) – number of grouped LSTM layers
bidirectional (bool) – whether to use BLSTM or unidirectional LSTM
rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer
-
class
espnet2.enh.layers.dc_crn.
GluConv2d
(in_channels, out_channels, kernel_size, stride, padding=0)[source]¶ Bases:
torch.nn.modules.module.Module
Conv2d with Gated Linear Units (GLU).
Input and output shapes are the same as regular Conv2d layers.
Reference: Section III-B in [1]
- Parameters
in_channels (int) – number of input channels
out_channels (int) – number of output channels
kernel_size (int/tuple) – kernel size in Conv2d
stride (int/tuple) – stride size in Conv2d
padding (int/tuple) – padding size in Conv2d
-
class
espnet2.enh.layers.dc_crn.
GluConvTranspose2d
(in_channels, out_channels, kernel_size, stride, padding=0, output_padding=(0, 0))[source]¶ Bases:
torch.nn.modules.module.Module
ConvTranspose2d with Gated Linear Units (GLU).
Input and output shapes are the same as regular ConvTranspose2d layers.
Reference: Section III-B in [1]
- Parameters
in_channels (int) – number of input channels
out_channels (int) – number of output channels
kernel_size (int/tuple) – kernel size in ConvTranspose2d
stride (int/tuple) – stride size in ConvTranspose2d
padding (int/tuple) – padding size in ConvTranspose2d
output_padding (int/tuple) – Additional size added to one side of each dimension in the output shape
espnet2.enh.separator.__init__¶
espnet2.enh.separator.dpcl_separator¶
-
class
espnet2.enh.separator.dpcl_separator.
DPCLSeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Deep Clustering Separator.
References
- [1] Deep clustering: Discriminative embeddings for segmentation and
separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631
- [2] Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding
Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.isca-speech.org/archive/interspeech_2021/tanaka21_interspeech.html
- Parameters
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
num_spk – number of speakers
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
emb_D – int, dimension of the feature vector for a tf-bin.
dropout – float, dropout ratio. Default is 0.
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. tf_embedding: OrderedDict[
’tf_embedding’: learned embedding of all T-F bins (B, T * F, D),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.transformer_separator¶
-
class
espnet2.enh.separator.transformer_separator.
TransformerSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, use_scaled_pos_enc: bool = True, nonlinear: str = 'relu')[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Transformer separator.
- Parameters
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
adim (int) – Dimension of attention.
aheads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
layers (int) – The number of transformer blocks.
dropout_rate (float) – Dropout rate.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
use_scaled_pos_enc (bool) – use scaled positional encoding or not
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.dprnn_separator¶
-
class
espnet2.enh.separator.dprnn_separator.
DPRNNSeparator
(input_dim: int, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Dual-Path RNN (DPRNN) Separator
- Parameters
input_dim – input feature dimension
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
segment_size – dual-path segment size
dropout – float, dropout ratio. Default is 0.
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.dan_separator¶
-
class
espnet2.enh.separator.dan_separator.
DANSeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Deep Attractor Network Separator
- Reference:
DEEP ATTRACTOR NETWORK FOR SINGLE-MICROPHONE SPEAKER SEPARATION; Zhuo Chen. et al., 2017; https://pubmed.ncbi.nlm.nih.gov/29430212/
- Parameters
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
num_spk – number of speakers
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
emb_D – int, dimension of the attribute vector for one tf-bin.
dropout – float, dropout ratio. Default is 0.
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model e.g. “feature_ref”: list of reference spectra List[(B, T, F)]
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.dc_crn_separator¶
-
class
espnet2.enh.separator.dc_crn_separator.
DC_CRNSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, input_channels: List = [2, 16, 32, 64, 128, 256], enc_hid_channels: int = 8, enc_kernel_size: Tuple = (1, 3), enc_padding: Tuple = (0, 1), enc_last_kernel_size: Tuple = (1, 4), enc_last_stride: Tuple = (1, 2), enc_last_padding: Tuple = (0, 1), enc_layers: int = 5, skip_last_kernel_size: Tuple = (1, 3), skip_last_stride: Tuple = (1, 1), skip_last_padding: Tuple = (0, 1), glstm_groups: int = 2, glstm_layers: int = 2, glstm_bidirectional: bool = False, glstm_rearrange: bool = False, mode: str = 'masking', ref_channel: int = 0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Densely-Connected Convolutional Recurrent Network (DC-CRN) Separator
- Reference:
Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones; Tan et al., 2020 https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf
- Parameters
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
input_channels (list) – number of input channels for the stacked DenselyConnectedBlock layers Its length should be (number of DenselyConnectedBlock layers).
enc_hid_channels (int) – common number of intermediate channels for all DenselyConnectedBlock of the encoder
enc_kernel_size (tuple) – common kernel size for all DenselyConnectedBlock of the encoder
enc_padding (tuple) – common padding for all DenselyConnectedBlock of the encoder
enc_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the encoder
enc_layers (int) – common total number of Conv layers for all DenselyConnectedBlock layers of the encoder
skip_last_kernel_size (tuple) – common kernel size for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_stride (tuple) – common stride for the last Conv layer in all DenselyConnectedBlock of the skip pathways
skip_last_padding (tuple) – common padding for the last Conv layer in all DenselyConnectedBlock of the skip pathways
glstm_groups (int) – number of groups in each Grouped LSTM layer
glstm_layers (int) – number of Grouped LSTM layers
glstm_bidirectional (bool) – whether to use BLSTM or unidirectional LSTM in Grouped LSTM layers
glstm_rearrange (bool) – whether to apply the rearrange operation after each grouped LSTM layer
output_channels (int) – number of output channels (even number)
mode (str) – one of (“mapping”, “masking”) “mapping”: complex spectral mapping “masking”: complex masking
ref_channel (int) – index of the reference microphone
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ DC-CRN Separator Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [Batch, T, F] or [Batch, T, C, F]
ilens (torch.Tensor) – input lengths [Batch,]
- Returns
[(Batch, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.neural_beamformer¶
-
class
espnet2.enh.separator.neural_beamformer.
NeuralBeamformer
(input_dim: int, num_spk: int = 1, loss_type: str = 'mask_mse', use_wpe: bool = False, wnet_type: str = 'blstmp', wlayers: int = 3, wunits: int = 300, wprojs: int = 320, wdropout_rate: float = 0.0, taps: int = 5, delay: int = 3, use_dnn_mask_for_wpe: bool = True, wnonlinear: str = 'crelu', multi_source_wpe: bool = True, wnormalization: bool = False, use_beamformer: bool = True, bnet_type: str = 'blstmp', blayers: int = 3, bunits: int = 300, bprojs: int = 320, badim: int = 320, ref_channel: int = -1, use_noise_mask: bool = True, bnonlinear: str = 'sigmoid', beamformer_type: str = 'mvdr_souden', rtf_iterations: int = 2, bdropout_rate: float = 0.0, shared_power: bool = True, diagonal_loading: bool = True, diag_eps_wpe: float = 1e-07, diag_eps_bf: float = 1e-07, mask_flooring: bool = False, flooring_thres_wpe: float = 1e-06, flooring_thres_bf: float = 1e-06, use_torch_solver: bool = True)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.complex64/ComplexTensor) – mixed speech [Batch, Frames, Channel, Freq]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
List[torch.complex64/ComplexTensor] output lengths other predcited data: OrderedDict[
’dereverb1’: ComplexTensor(Batch, Frames, Channel, Freq), ‘mask_dereverb1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_noise1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk1’: torch.Tensor(Batch, Frames, Channel, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Channel, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Channel, Freq),
]
- Return type
enhanced speech (single-channel)
-
property
num_spk
¶
-
espnet2.enh.separator.ineube_separator¶
-
class
espnet2.enh.separator.ineube_separator.
iNeuBe
(n_spk=1, n_fft=512, stride=128, window='hann', mic_channels=1, hid_chans=32, hid_chans_dense=32, ksz_dense=(3, 3), ksz_tcn=3, tcn_repeats=4, tcn_blocks=7, tcn_channels=384, activation='elu', output_from='dnn1', n_chunks=3, freeze_dnn1=False, tik_eps=1e-08)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
iNeuBe, iterative neural/beamforming enhancement
Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge. ICASSP 2022 p. 9201-9205.
NOTES: As outlined in the Reference, this model works best when coupled with the MultiResL1SpecLoss defined in criterions/time_domain.py. The model is trained with variance normalized mixture input and target. e.g. with mixture of shape [batch, microphones, samples] you normalize it by dividing with torch.std(mixture, (1, 2)). You must do the same for the target signal. In the Reference, the variance normalization was performed offline (we normalized by the std computed on the entire training set and not for each input separately). However we found out that also normalizing each input and target separately works well.
- Parameters
n_spk – number of output sources/speakers.
n_fft – stft window size.
stride – stft stride.
window – stft window type choose between ‘hamming’, ‘hanning’ or None.
mic_channels – number of microphones channels (only fixed-array geometry supported).
hid_chans – number of channels in the subsampling/upsampling conv layers.
hid_chans_dense – number of channels in the densenet layers (reduce this to reduce VRAM requirements).
ksz_dense – kernel size in the densenet layers thorough iNeuBe.
ksz_tcn – kernel size in the TCN submodule.
tcn_repeats – number of repetitions of blocks in the TCN submodule.
tcn_blocks – number of blocks in the TCN submodule.
tcn_channels – number of channels in the TCN submodule.
activation – activation function to use in the whole iNeuBe model, you can use any torch supported activation e.g. ‘relu’ or ‘elu’.
output_from – output the estimate from ‘dnn1’, ‘mfmcwf’ or ‘dnn2’.
n_chunks – number of future and past frames to consider for mfMCWF computation.
freeze_dnn1 – whether or not freezing dnn1 parameters during training of dnn2.
tik_eps – diagonal loading in the mfMCWF computation.
-
forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor) – batched multi-channel audio tensor with C audio channels and T samples [B, T, C]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data, currently unused in this model.
- Returns
- [(B, T), …] list of len n_spk
of mono audio tensors with T samples.
- ilens (torch.Tensor): (B,)
- others predicted data, e.g. masks: OrderedDict[
‘mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- additional (Dict or None): other data, currently unused in this model,
we return it also in output.
- Return type
enhanced (List[Union(torch.Tensor)])
-
static
mfmcwf
(mixture, estimate, n_chunks, tik_eps)[source]¶ multi-frame multi-channel wiener filter.
- Parameters
mixture (torch.Tensor) – multi-channel STFT complex mixture tensor, of shape [B, T, C, F] batch, frames, microphones, frequencies.
estimate (torch.Tensor) – monaural STFT complex estimate of target source [B, T, F] batch, frames, frequencies.
n_chunks (int) – number of past and future mfMCWF frames. If 0 then standard MCWF.
tik_eps (float) – diagonal loading for matrix inversion in MCWF computation.
- Returns
- monaural STFT complex estimate
of target source after MFMCWF [B, T, F] batch, frames, frequencies.
- Return type
beamformed (torch.Tensor)
-
property
num_spk
¶
-
static
unfold
(tf_rep, chunk_size)[source]¶ unfolding STFT representation to add context in the mics channel.
- Parameters
mixture (torch.Tensor) – 3D tensor (monaural complex STFT) of shape [B, T, F] batch, frames, microphones, frequencies.
n_chunks (int) – number of past and future to consider.
- Returns
- complex 3D tensor STFT with context channel.
shape now is [B, T, C, F] batch, frames, context, frequencies. Basically same shape as a multi-channel STFT with C microphones.
- Return type
est_unfolded (torch.Tensor)
espnet2.enh.separator.dpcl_e2e_separator¶
-
class
espnet2.enh.separator.dpcl_e2e_separator.
DPCLE2ESeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'tanh', layer: int = 2, unit: int = 512, emb_D: int = 40, dropout: float = 0.0, alpha: float = 5.0, max_iteration: int = 500, threshold: float = 1e-05)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Deep Clustering End-to-End Separator
References
Single-Channel Multi-Speaker Separation using Deep Clustering; Yusuf Isik. et al., 2016; https://www.isca-speech.org/archive/interspeech_2016/isik16_interspeech.html
- Parameters
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
emb_D – int, dimension of the feature vector for a tf-bin.
dropout – float, dropout ratio. Default is 0.
alpha – float, the clustering hardness parameter.
max_iteration – int, the max iterations of soft kmeans.
threshold – float, the threshold to end the soft k-means process.
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. V: OrderedDict[
others predicted data, e.g. masks: OrderedDict[ ‘mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.tcn_separator¶
-
class
espnet2.enh.separator.tcn_separator.
TCNSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, layer: int = 8, stack: int = 3, bottleneck_dim: int = 128, hidden_dim: int = 512, kernel: int = 3, causal: bool = False, norm_type: str = 'gLN', nonlinear: str = 'relu')[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Temporal Convolution Separator
- Parameters
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
layer – int, number of layers in each stack.
stack – int, number of stacks
bottleneck_dim – bottleneck dimension
hidden_dim – number of convolution channel
kernel – int, kernel size.
causal – bool, defalut False.
norm_type – str, choose from ‘BN’, ‘gLN’, ‘cLN’
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.conformer_separator¶
-
class
espnet2.enh.separator.conformer_separator.
ConformerSeparator
(input_dim: int, num_spk: int = 2, predict_noise: bool = False, adim: int = 384, aheads: int = 4, layers: int = 6, linear_units: int = 1536, positionwise_layer_type: str = 'linear', positionwise_conv_kernel_size: int = 1, normalize_before: bool = False, concat_after: bool = False, dropout_rate: float = 0.1, input_layer: str = 'linear', positional_dropout_rate: float = 0.1, attention_dropout_rate: float = 0.1, nonlinear: str = 'relu', conformer_pos_enc_layer_type: str = 'rel_pos', conformer_self_attn_layer_type: str = 'rel_selfattn', conformer_activation_type: str = 'swish', use_macaron_style_in_conformer: bool = True, use_cnn_in_conformer: bool = True, conformer_enc_kernel_size: int = 7, padding_idx: int = -1)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Conformer separator.
- Parameters
input_dim – input feature dimension
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
adim (int) – Dimension of attention.
aheads (int) – The number of heads of multi head attention.
linear_units (int) – The number of units of position-wise feed forward.
layers (int) – The number of transformer blocks.
dropout_rate (float) – Dropout rate.
input_layer (Union[str, torch.nn.Module]) – Input layer type.
attention_dropout_rate (float) – Dropout rate in attention.
positional_dropout_rate (float) – Dropout rate after adding positional encoding.
normalize_before (bool) – Whether to use layer_norm before the first block.
concat_after (bool) – Whether to concat attention layer’s input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x)
conformer_pos_enc_layer_type (str) – Encoder positional encoding layer type.
conformer_self_attn_layer_type (str) – Encoder attention layer type.
conformer_activation_type (str) – Encoder activation function type.
positionwise_layer_type (str) – “linear”, “conv1d”, or “conv1d-linear”.
positionwise_conv_kernel_size (int) – Kernel size of positionwise conv1d layer.
use_macaron_style_in_conformer (bool) – Whether to use macaron style for positionwise layer.
use_cnn_in_conformer (bool) – Whether to use convolution module.
conformer_enc_kernel_size (int) – Kernerl size of convolution module.
padding_idx (int) – Padding idx for input_layer=embed.
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.asteroid_models¶
-
class
espnet2.enh.separator.asteroid_models.
AsteroidModel_Converter
(encoder_output_dim: int, model_name: str, num_spk: int, pretrained_path: str = '', loss_type: str = 'si_snr', **model_related_kwargs)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
The class to convert the models from asteroid to AbsSeprator.
- Parameters
encoder_output_dim – input feature dimension, default=1 after the NullEncoder
num_spk – number of speakers
loss_type – loss type of enhancement
model_name – Asteroid model names, e.g. ConvTasNet, DPTNet. Refers to https://github.com/asteroid-team/asteroid/ blob/master/asteroid/models/__init__.py
pretrained_path – the name of pretrained model from Asteroid in HF hub. Refers to: https://github.com/asteroid-team/asteroid/ blob/master/docs/source/readmes/pretrained_models.md and https://huggingface.co/models?filter=asteroid
model_related_kwargs – more args towards each specific asteroid model.
-
forward
(input: torch.Tensor, ilens: torch.Tensor = None, additional: Optional[Dict] = None)[source]¶ Whole forward of asteroid models.
- Parameters
input (torch.Tensor) – Raw Waveforms [B, T]
ilens (torch.Tensor) – input lengths [B]
additional (Dict or None) – other data included in model
- Returns
[(B, T), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, T), ‘mask_spk2’: torch.Tensor(Batch, T), … ‘mask_spkn’: torch.Tensor(Batch, T),
]
- Return type
estimated Waveforms(List[Union(torch.Tensor])
-
forward_rawwav
(input: torch.Tensor, ilens: torch.Tensor = None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Output with waveforms.
-
property
num_spk
¶
espnet2.enh.separator.dptnet_separator¶
-
class
espnet2.enh.separator.dptnet_separator.
DPTNetSeparator
(input_dim: int, post_enc_relu: bool = True, rnn_type: str = 'lstm', bidirectional: bool = True, num_spk: int = 2, predict_noise: bool = False, unit: int = 256, att_heads: int = 4, dropout: float = 0.0, activation: str = 'relu', norm_type: str = 'gLN', layer: int = 6, segment_size: int = 20, nonlinear: str = 'relu')[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Dual-Path Transformer Network (DPTNet) Separator
- Parameters
input_dim – input feature dimension
rnn_type – string, select from ‘RNN’, ‘LSTM’ and ‘GRU’.
bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
unit – int, dimension of the hidden state.
att_heads – number of attention heads.
dropout – float, dropout ratio. Default is 0.
activation – activation function applied at the output of RNN.
norm_type – type of normalization to use after each inter- or intra-chunk Transformer block.
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
segment_size – dual-path segment size
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.dccrn_separator¶
-
class
espnet2.enh.separator.dccrn_separator.
DCCRNSeparator
(input_dim: int, num_spk: int = 1, rnn_layer: int = 2, rnn_units: int = 256, masking_mode: str = 'E', use_clstm: bool = True, bidirectional: bool = False, use_cbn: bool = False, kernel_size: int = 5, kernel_num: List[int] = [32, 64, 128, 256, 256, 256], use_builtin_complex: bool = True, use_noise_mask: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
DCCRN separator.
- Parameters
input_dim (int) – input dimension。
num_spk (int, optional) – number of speakers. Defaults to 1.
rnn_layer (int, optional) – number of lstm layers in the crn. Defaults to 2.
rnn_units (int, optional) – rnn units. Defaults to 128.
masking_mode (str, optional) – usage of the estimated mask. Defaults to “E”.
use_clstm (bool, optional) – whether use complex LSTM. Defaults to False.
bidirectional (bool, optional) – whether use BLSTM. Defaults to False.
use_cbn (bool, optional) – whether use complex BN. Defaults to False.
kernel_size (int, optional) – convolution kernel size. Defaults to 5.
kernel_num (list, optional) – output dimension of each layer of the encoder.
use_builtin_complex (bool, optional) – torch.complex if True, else ComplexTensor.
use_noise_mask (bool, optional) – whether to estimate the mask of noise.
-
apply_masks
(masks: List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], real: torch.Tensor, imag: torch.Tensor)[source]¶ apply masks
- Parameters
masks – est_masks, [(B, T, F), …]
real (torch.Tensor) – real part of the noisy spectrum, (B, F, T)
imag (torch.Tensor) – imag part of the noisy spectrum, (B, F, T)
- Returns
[(B, T, F), …]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
create_masks
(mask_tensor: torch.Tensor)[source]¶ create estimated mask for each speaker
- Parameters
mask_tensor (torch.Tensor) – output of decoder, shape(B, 2*num_spk, F-1, T)
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, F]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, F), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.skim_separator¶
-
class
espnet2.enh.separator.skim_separator.
SkiMSeparator
(input_dim: int, causal: bool = True, num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0, mem_type: str = 'hc', seg_overlap: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Skipping Memory (SkiM) Separator
- Parameters
input_dim – input feature dimension
causal – bool, whether the system is causal.
num_spk – number of target speakers.
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of SkiM blocks. Default is 3.
unit – int, dimension of the hidden state.
segment_size – segmentation size for splitting long features
dropout – float, dropout ratio. Default is 0.
mem_type – ‘hc’, ‘h’, ‘c’, ‘id’ or None. It controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.
seg_overlap – Bool, whether the segmentation will reserve 50% overlap for adjacent segments. Default is False.
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.fasnet_separator¶
-
class
espnet2.enh.separator.fasnet_separator.
FaSNetSeparator
(input_dim: int, enc_dim: int, feature_dim: int, hidden_dim: int, layer: int, segment_size: int, num_spk: int, win_len: int, context_len: int, fasnet_type: str, dropout: float = 0.0, sr: int = 16000, predict_noise: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
Filter-and-sum Network (FaSNet) Separator
- Parameters
input_dim – required by AbsSeparator. Not used in this model.
enc_dim – encoder dimension
feature_dim – feature dimension
hidden_dim – hidden dimension in DPRNN
layer – number of DPRNN blocks in iFaSNet
segment_size – dual-path segment size
num_spk – number of speakers
win_len – window length in millisecond
context_len – context length in millisecond
fasnet_type – ‘fasnet’ or ‘ifasnet’. Select from origin fasnet or Implicit fasnet
dropout – dropout rate. Default is 0.
sr – samplerate of input audio
predict_noise – whether to output the estimated noise signal
-
forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor) – (Batch, samples, channels)
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
separated (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.abs_separator¶
-
class
espnet2.enh.separator.abs_separator.
AbsSeparator
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[Tuple[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract property
num_spk
¶
-
abstract
espnet2.enh.separator.rnn_separator¶
-
class
espnet2.enh.separator.rnn_separator.
RNNSeparator
(input_dim: int, rnn_type: str = 'blstm', num_spk: int = 2, predict_noise: bool = False, nonlinear: str = 'sigmoid', layer: int = 3, unit: int = 512, dropout: float = 0.0)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
RNN Separator
- Parameters
input_dim – input feature dimension
rnn_type – string, select from ‘blstm’, ‘lstm’ etc.
bidirectional – bool, whether the inter-chunk RNN layers are bidirectional.
num_spk – number of speakers
predict_noise – whether to output the estimated noise signal
nonlinear – the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer – int, number of stacked RNN layers. Default is 3.
unit – int, dimension of the hidden state.
dropout – float, dropout ratio. Default is 0.
-
forward
(input: Union[torch.Tensor, torch_complex.tensor.ComplexTensor], ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[Union[torch.Tensor, torch_complex.tensor.ComplexTensor]], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
espnet2.enh.separator.svoice_separator¶
-
class
espnet2.enh.separator.svoice_separator.
Decoder
(kernel_size)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(est_source)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.separator.svoice_separator.
Encoder
(enc_kernel_size: int, enc_feat_dim: int)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(mixture)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.separator.svoice_separator.
SVoiceSeparator
(input_dim: int, enc_dim: int, kernel_size: int, hidden_size: int, num_spk: int = 2, num_layers: int = 4, segment_size: int = 20, bidirectional: bool = True, input_normalize: bool = False)[source]¶ Bases:
espnet2.enh.separator.abs_separator.AbsSeparator
SVoice model for speech separation.
- Reference:
Voice Separation with an Unknown Number of Multiple Speakers; E. Nachmani et al., 2020; https://arxiv.org/abs/2003.01531
- Parameters
enc_dim – int, dimension of the encoder module’s output. (Default: 128)
kernel_size – int, the kernel size of Conv1D layer in both encoder and decoder modules. (Default: 8)
hidden_size – int, dimension of the hidden state in RNN layers. (Default: 128)
num_spk – int, the number of speakers in the output. (Default: 2)
num_layers – int, number of stacked MulCat blocks. (Default: 4)
segment_size – dual-path segment size. (Default: 20)
bidirectional – bool, whether the RNN layers are bidirectional. (Default: True)
input_normalize – bool, whether to apply GroupNorm on the input Tensor. (Default: False)
-
forward
(input: torch.Tensor, ilens: torch.Tensor, additional: Optional[Dict] = None) → Tuple[List[torch.Tensor], torch.Tensor, collections.OrderedDict][source]¶ Forward.
- Parameters
input (torch.Tensor or ComplexTensor) – Encoded feature [B, T, N]
ilens (torch.Tensor) – input lengths [Batch]
additional (Dict or None) – other data included in model NOTE: not used in this model
- Returns
[(B, T, N), …] ilens (torch.Tensor): (B,) others predicted data, e.g. masks: OrderedDict[
’mask_spk1’: torch.Tensor(Batch, Frames, Freq), ‘mask_spk2’: torch.Tensor(Batch, Frames, Freq), … ‘mask_spkn’: torch.Tensor(Batch, Frames, Freq),
]
- Return type
masked (List[Union(torch.Tensor, ComplexTensor)])
-
property
num_spk
¶
-
espnet2.enh.separator.svoice_separator.
overlap_and_add
(signal, frame_step)[source]¶ Reconstructs a signal from a framed representation.
Adds potentially overlapping frames of a signal with shape […, frames, frame_length], offsetting subsequent frames by frame_step. The resulting tensor has shape […, output_size] where
output_size = (frames - 1) * frame_step + frame_length
- Args:
- signal: A […, frames, frame_length] Tensor. All dimensions may be unknown,
and rank must be at least 2.
- frame_step: An integer denoting overlap offsets.
Must be less than or equal to frame_length.
- Returns:
- A Tensor with shape […, output_size] containing the
overlap-added frames of signal’s inner-most two dimensions.
output_size = (frames - 1) * frame_step + frame_length
Based on
espnet2.enh.encoder.__init__¶
espnet2.enh.encoder.stft_encoder¶
-
class
espnet2.enh.encoder.stft_encoder.
STFTEncoder
(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True, use_builtin_complex: bool = True)[source]¶ Bases:
espnet2.enh.encoder.abs_encoder.AbsEncoder
STFT encoder for speech enhancement and separation
-
forward
(input: torch.Tensor, ilens: torch.Tensor)[source]¶ Forward.
- Parameters
input (torch.Tensor) – mixed speech [Batch, sample]
ilens (torch.Tensor) – input lengths [Batch]
-
property
output_dim
¶
-
espnet2.enh.encoder.abs_encoder¶
-
class
espnet2.enh.encoder.abs_encoder.
AbsEncoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract property
output_dim
¶
-
abstract
espnet2.enh.encoder.conv_encoder¶
-
class
espnet2.enh.encoder.conv_encoder.
ConvEncoder
(channel: int, kernel_size: int, stride: int)[source]¶ Bases:
espnet2.enh.encoder.abs_encoder.AbsEncoder
Convolutional encoder for speech enhancement and separation
-
forward
(input: torch.Tensor, ilens: torch.Tensor)[source]¶ Forward.
- Parameters
input (torch.Tensor) – mixed speech [Batch, sample]
ilens (torch.Tensor) – input lengths [Batch]
- Returns
mixed feature after encoder [Batch, flens, channel]
- Return type
feature (torch.Tensor)
-
property
output_dim
¶
-
espnet2.enh.encoder.null_encoder¶
-
class
espnet2.enh.encoder.null_encoder.
NullEncoder
[source]¶ Bases:
espnet2.enh.encoder.abs_encoder.AbsEncoder
Null encoder.
-
forward
(input: torch.Tensor, ilens: torch.Tensor)[source]¶ Forward.
- Parameters
input (torch.Tensor) – mixed speech [Batch, sample]
ilens (torch.Tensor) – input lengths [Batch]
-
property
output_dim
¶
-
espnet2.enh.decoder.__init__¶
espnet2.enh.decoder.abs_decoder¶
-
class
espnet2.enh.decoder.abs_decoder.
AbsDecoder
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(input: torch.Tensor, ilens: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
abstract
espnet2.enh.decoder.conv_decoder¶
-
class
espnet2.enh.decoder.conv_decoder.
ConvDecoder
(channel: int, kernel_size: int, stride: int)[source]¶ Bases:
espnet2.enh.decoder.abs_decoder.AbsDecoder
Transposed Convolutional decoder for speech enhancement and separation
espnet2.enh.decoder.stft_decoder¶
-
class
espnet2.enh.decoder.stft_decoder.
STFTDecoder
(n_fft: int = 512, win_length: int = None, hop_length: int = 128, window='hann', center: bool = True, normalized: bool = False, onesided: bool = True)[source]¶ Bases:
espnet2.enh.decoder.abs_decoder.AbsDecoder
STFT decoder for speech enhancement and separation
espnet2.enh.decoder.null_decoder¶
-
class
espnet2.enh.decoder.null_decoder.
NullDecoder
[source]¶ Bases:
espnet2.enh.decoder.abs_decoder.AbsDecoder
Null decoder, return the same args.
espnet2.enh.loss.__init__¶
espnet2.enh.loss.criterions.tf_domain¶
-
class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainAbsCoherence
(compute_on_mask=False, mask_type=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss
-
property
compute_on_mask
¶
-
forward
(ref, inf) → torch.Tensor[source]¶ time-frequency absolute coherence loss.
- Reference:
Independent Vector Analysis with Deep Neural Network Source Priors; Li et al 2020; https://arxiv.org/abs/2008.11273
- Parameters
ref – (Batch, T, F) or (Batch, T, C, F)
inf – (Batch, T, F) or (Batch, T, C, F)
- Returns
(Batch,)
- Return type
loss
-
property
mask_type
¶
-
property
-
class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainCrossEntropy
(compute_on_mask=False, mask_type=None, ignore_id=-100, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss
-
property
compute_on_mask
¶
-
forward
(ref, inf) → torch.Tensor[source]¶ time-frequency cross-entropy loss.
- Parameters
ref – (Batch, T) or (Batch, T, C)
inf – (Batch, T, nclass) or (Batch, T, C, nclass)
- Returns
(Batch,)
- Return type
loss
-
property
mask_type
¶
-
property
-
class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainDPCL
(compute_on_mask=False, mask_type='IBM', loss_type='dpcl', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss
-
property
compute_on_mask
¶
-
forward
(ref, inf) → torch.Tensor[source]¶ time-frequency Deep Clustering loss.
References
- [1] Deep clustering: Discriminative embeddings for segmentation and
separation; John R. Hershey. et al., 2016; https://ieeexplore.ieee.org/document/7471631
- [2] Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding
Vectors Based on Regular Simplex; Tanaka, K. et al., 2021; https://www.isca-speech.org/archive/interspeech_2021/tanaka21_interspeech.html
- Parameters
ref – List[(Batch, T, F) * spks]
inf – (Batch, T*F, D)
- Returns
(Batch,)
- Return type
loss
-
property
mask_type
¶
-
property
-
class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainL1
(compute_on_mask=False, mask_type='IBM', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss
-
property
compute_on_mask
¶
-
forward
(ref, inf) → torch.Tensor[source]¶ time-frequency L1 loss.
- Parameters
ref – (Batch, T, F) or (Batch, T, C, F)
inf – (Batch, T, F) or (Batch, T, C, F)
- Returns
(Batch,)
- Return type
loss
-
property
mask_type
¶
-
property
-
class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainLoss
(name, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss
,abc.ABC
Base class for all frequence-domain Enhancement loss modules.
-
abstract property
compute_on_mask
¶
-
property
is_dereverb_loss
¶
-
property
is_noise_loss
¶
-
abstract property
mask_type
¶
-
property
name
¶
-
property
only_for_test
¶
-
abstract property
-
class
espnet2.enh.loss.criterions.tf_domain.
FrequencyDomainMSE
(compute_on_mask=False, mask_type='IBM', name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.tf_domain.FrequencyDomainLoss
-
property
compute_on_mask
¶
-
forward
(ref, inf) → torch.Tensor[source]¶ time-frequency MSE loss.
- Parameters
ref – (Batch, T, F) or (Batch, T, C, F)
inf – (Batch, T, F) or (Batch, T, C, F)
- Returns
(Batch,)
- Return type
loss
-
property
mask_type
¶
-
property
espnet2.enh.loss.criterions.__init__¶
espnet2.enh.loss.criterions.time_domain¶
-
class
espnet2.enh.loss.criterions.time_domain.
CISDRLoss
(filter_length=512, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
CI-SDR loss
- Reference:
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation; C. Boeddeker et al., 2021; https://arxiv.org/abs/2011.15003
- Parameters
ref – (Batch, samples)
inf – (Batch, samples)
filter_length (int) – a time-invariant filter that allows slight distortion via filtering
- Returns
(Batch,)
- Return type
loss
-
forward
(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
espnet2.enh.loss.criterions.time_domain.
MultiResL1SpecLoss
(window_sz=[512], hop_sz=None, eps=1e-08, time_domain_weight=0.5, name=None, only_for_test=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
Multi-Resolution L1 time-domain + STFT mag loss
Reference: Lu, Y. J., Cornell, S., Chang, X., Zhang, W., Li, C., Ni, Z., … & Watanabe, S. Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNET-Se Submission to the L3DAS22 Challenge. ICASSP 2022 p. 9201-9205.
-
window_sz
¶ (list) list of STFT window sizes.
-
hop_sz
¶ (list, optional) list of hop_sizes, default is each window_sz // 2.
-
eps
¶ (float) stability epsilon
-
time_domain_weight
¶ (float) weight for time domain loss.
-
forward
(target: torch.Tensor, estimate: torch.Tensor)[source]¶ forward.
- Parameters
target – (Batch, T)
estimate – (Batch, T)
- Returns
(Batch,)
- Return type
loss
-
property
name
¶
-
-
class
espnet2.enh.loss.criterions.time_domain.
SDRLoss
(filter_length=512, use_cg_iter=None, clamp_db=None, zero_mean=True, load_diag=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
SDR loss.
- filter_length: int
The length of the distortion filter allowed (default:
512
)- use_cg_iter:
If provided, an iterative method is used to solve for the distortion filter coefficients instead of direct Gaussian elimination. This can speed up the computation of the metrics in case the filters are long. Using a value of 10 here has been shown to provide good accuracy in most cases and is sufficient when using this loss to train neural separation networks.
- clamp_db: float
clamp the output value in [-clamp_db, clamp_db]
- zero_mean: bool
When set to True, the mean of all signals is subtracted prior.
- load_diag:
If provided, this small value is added to the diagonal coefficients of the system metrices when solving for the filter coefficients. This can help stabilize the metric in the case where some of the reference signals may sometimes be zero
-
class
espnet2.enh.loss.criterions.time_domain.
SISNRLoss
(clamp_db=None, zero_mean=True, eps=None, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
SI-SNR (or named SI-SDR) loss
A more stable SI-SNR loss with clamp from fast_bss_eval.
-
clamp_db
¶ float clamp the output value in [-clamp_db, clamp_db]
-
zero_mean
¶ bool When set to True, the mean of all signals is subtracted prior.
-
eps
¶ float Deprecated. Kept for compatibility.
-
-
class
espnet2.enh.loss.criterions.time_domain.
SNRLoss
(eps=1.1920928955078125e-07, name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
-
forward
(ref: torch.Tensor, inf: torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
espnet2.enh.loss.criterions.time_domain.
TimeDomainL1
(name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
-
class
espnet2.enh.loss.criterions.time_domain.
TimeDomainLoss
(name, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss
,abc.ABC
Base class for all time-domain Enhancement loss modules.
-
property
is_dereverb_loss
¶
-
property
is_noise_loss
¶
-
property
name
¶
-
property
only_for_test
¶
-
property
-
class
espnet2.enh.loss.criterions.time_domain.
TimeDomainMSE
(name=None, only_for_test=False, is_noise_loss=False, is_dereverb_loss=False)[source]¶ Bases:
espnet2.enh.loss.criterions.time_domain.TimeDomainLoss
espnet2.enh.loss.criterions.abs_loss¶
-
class
espnet2.enh.loss.criterions.abs_loss.
AbsEnhLoss
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Base class for all Enhancement loss modules.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(ref, inf) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
property
name
¶
-
property
only_for_test
¶
-
abstract
espnet2.enh.loss.wrappers.__init__¶
espnet2.enh.loss.wrappers.fixed_order¶
-
class
espnet2.enh.loss.wrappers.fixed_order.
FixedOrderSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
-
forward
(ref, inf, others={})[source]¶ An naive fixed-order solver
- Parameters
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …]
- Returns
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: reserved
- Return type
loss
-
espnet2.enh.loss.wrappers.abs_wrapper¶
-
class
espnet2.enh.loss.wrappers.abs_wrapper.
AbsLossWrapper
[source]¶ Bases:
torch.nn.modules.module.Module
,abc.ABC
Base class for all Enhancement loss wrapper modules.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
abstract
forward
(ref: List, inf: List, others: Dict) → Tuple[torch.Tensor, Dict, Dict][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
weight
= 1.0¶
-
abstract
espnet2.enh.loss.wrappers.pit_solver¶
-
class
espnet2.enh.loss.wrappers.pit_solver.
PITSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True, flexible_numspk=False)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
Permutation Invariant Training Solver.
- Parameters
criterion (AbsEnhLoss) – an instance of AbsEnhLoss
weight (float) – weight (between 0 and 1) of current loss for multi-task learning.
independent_perm (bool) –
If True, PIT will be performed in forward to find the best permutation; If False, the permutation from the last LossWrapper output will be inherited. NOTE (wangyou): You should be careful about the ordering of loss
wrappers defined in the yaml config, if this argument is False.
flexible_numspk (bool) – If True, num_spk will be taken from inf to handle flexible numbers of speakers. This is because ref may include dummy data in this case.
-
forward
(ref, inf, others={})[source]¶ PITSolver forward.
- Parameters
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …]
- Returns
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned
- Return type
loss
espnet2.enh.loss.wrappers.multilayer_pit_solver¶
-
class
espnet2.enh.loss.wrappers.multilayer_pit_solver.
MultiLayerPITSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0, independent_perm=True)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
Multi-Layer Permutation Invariant Training Solver.
Compute the PIT loss given inferences of multiple layers and a single reference. It also support single inference and single reference in evaluation stage.
- Parameters
criterion (AbsEnhLoss) – an instance of AbsEnhLoss
weight (float) – weight (between 0 and 1) of current loss for multi-task learning.
independent_perm (bool) – If True, PIT will be performed in forward to find the best permutation; If False, the permutation from the last LossWrapper output will be inherited. Note: You should be careful about the ordering of loss wrappers defined in the yaml config, if this argument is False.
-
forward
(ref, infs, others={})[source]¶ Permutation invariant training solver.
- Parameters
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
infs (Union[List[torch.Tensor], List[List[torch.Tensor]]]) – [(batch, …), …]
- Returns
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned
- Return type
loss
espnet2.enh.loss.wrappers.mixit_solver¶
-
class
espnet2.enh.loss.wrappers.mixit_solver.
MixITSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight: float = 1.0)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
Mixture Invariant Training Solver.
- Parameters
criterion (AbsEnhLoss) – an instance of AbsEnhLoss
weight (float) – weight (between 0 and 1) of current loss for multi-task learning.
-
forward
(ref: Union[List[torch.Tensor], List[torch_complex.tensor.ComplexTensor]], inf: Union[List[torch.Tensor], List[torch_complex.tensor.ComplexTensor]], others: Dict = {})[source]¶ MixIT solver.
- Parameters
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …] x n_est
- Returns
(torch.Tensor): minimum loss with the best permutation stats: dict, for collecting training status others: dict, in this PIT solver, permutation order will be returned
- Return type
loss
-
property
name
¶
espnet2.enh.loss.wrappers.dpcl_solver¶
-
class
espnet2.enh.loss.wrappers.dpcl_solver.
DPCLSolver
(criterion: espnet2.enh.loss.criterions.abs_loss.AbsEnhLoss, weight=1.0)[source]¶ Bases:
espnet2.enh.loss.wrappers.abs_wrapper.AbsLossWrapper
-
forward
(ref, inf, others={})[source]¶ A naive DPCL solver
- Parameters
ref (List[torch.Tensor]) – [(batch, …), …] x n_spk
inf (List[torch.Tensor]) – [(batch, …), …]
others (List) – other data included in this solver e.g. “tf_embedding” learned embedding of all T-F bins (B, T * F, D)
- Returns
(torch.Tensor): minimum loss with the best permutation stats: (dict), for collecting training status others: reserved
- Return type
loss
-