espnet2.bin package

espnet2.bin.st_inference

class espnet2.bin.st_inference.Speech2Text(st_train_config: Union[pathlib.Path, str] = None, st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, enh_s2t_task: bool = False)[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("st_config.yml", "st.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

Speech2Text instance.

Return type

Speech2Text

espnet2.bin.st_inference.get_parser()[source]
espnet2.bin.st_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: Optional[str], st_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, enh_s2t_task: bool)[source]
espnet2.bin.st_inference.main(cmd=None)[source]

espnet2.bin.aggregate_stats_dirs

espnet2.bin.aggregate_stats_dirs.aggregate_stats_dirs(input_dir: Iterable[Union[str, pathlib.Path]], output_dir: Union[str, pathlib.Path], log_level: str, skip_sum_stats: bool)[source]
espnet2.bin.aggregate_stats_dirs.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.aggregate_stats_dirs.main(cmd=None)[source]

espnet2.bin.split_scps

espnet2.bin.split_scps.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.split_scps.main(cmd=None)[source]
espnet2.bin.split_scps.split_scps(scps: List[str], num_splits: int, names: Optional[List[str]], output_dir: str, log_level: str)[source]

espnet2.bin.__init__

espnet2.bin.enh_scoring

espnet2.bin.enh_scoring.get_parser()[source]
espnet2.bin.enh_scoring.main(cmd=None)[source]
espnet2.bin.enh_scoring.scoring(output_dir: str, dtype: str, log_level: Union[int, str], key_file: str, ref_scp: List[str], inf_scp: List[str], ref_channel: int, flexible_numspk: bool)[source]

espnet2.bin.lm_train

espnet2.bin.lm_train.get_parser()[source]
espnet2.bin.lm_train.main(cmd=None)[source]

LM training.

Example

% python lm_train.py asr –print_config –optim adadelta % python lm_train.py –config conf/train_asr.yaml

espnet2.bin.diar_train

espnet2.bin.diar_train.get_parser()[source]
espnet2.bin.diar_train.main(cmd=None)[source]

Speaker diarization training.

Example

% python diar_train.py diar –print_config –optim adadelta

> conf/train_diar.yaml

% python diar_train.py –config conf/train_diar.yaml

espnet2.bin.launch

espnet2.bin.launch.get_parser()[source]
espnet2.bin.launch.main(cmd=None)[source]

espnet2.bin.hubert_train

espnet2.bin.hubert_train.get_parser()[source]
espnet2.bin.hubert_train.main(cmd=None)[source]

Hubert pretraining.

Example

% python hubert_train.py asr –print_config –optim adadelta > conf/hubert_asr.yaml % python hubert_train.py –config conf/train_asr.yaml

espnet2.bin.asr_align

Perform CTC segmentation to align utterances within audio files.

class espnet2.bin.asr_align.CTCSegmentation(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, fs: int = 16000, ngpu: int = 0, batch_size: int = 1, dtype: str = 'float32', kaldi_style_text: bool = True, text_converter: str = 'tokenize', time_stamps: str = 'auto', **ctc_segmentation_args)[source]

Bases: object

Align text to audio using CTC segmentation.

Usage:

Initialize with given ASR model and parameters. If needed, parameters for CTC segmentation can be set with set_config(·). Then call the instance as function to align text within an audio file.

Example

>>> # example file included in the ESPnet repository
>>> import soundfile
>>> speech, fs = soundfile.read("test_utils/ctc_align_test.wav")
>>> # load an ASR model
>>> from espnet_model_zoo.downloader import ModelDownloader
>>> d = ModelDownloader()
>>> wsjmodel = d.download_and_unpack( "kamo-naoyuki/wsj" )
>>> # Apply CTC segmentation
>>> aligner = CTCSegmentation( **wsjmodel )
>>> text=["utt1 THE SALE OF THE HOTELS", "utt2 ON PROPERTY MANAGEMENT"]
>>> aligner.set_config( gratis_blank=True )
>>> segments = aligner( speech, text, fs=fs )
>>> print( segments )
utt1 utt 0.27 1.72 -0.1663 THE SALE OF THE HOTELS
utt2 utt 4.54 6.10 -4.9646 ON PROPERTY MANAGEMENT
On multiprocessing:

To parallelize the computation with multiprocessing, these three steps can be separated: (1) get_lpz: obtain the lpz, (2) prepare_segmentation_task: prepare the task, and (3) get_segments: perform CTC segmentation. Note that the function get_segments is a staticmethod and therefore independent of an already initialized CTCSegmentation object.

References

CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition 2020, Kürzinger, Winkelbauer, Li, Watzel, Rigoll https://arxiv.org/abs/2007.09127

More parameters are described in https://github.com/lumaku/ctc-segmentation

Initialize the CTCSegmentation module.

Parameters
  • asr_train_config – ASR model config file (yaml).

  • asr_model_file – ASR model file (pth).

  • fs – Sample rate of audio file.

  • ngpu – Number of GPUs. Set 0 for processing on CPU, set to 1 for processing on GPU. Multi-GPU aligning is currently not implemented. Default: 0.

  • batch_size – Currently, only batch size == 1 is implemented.

  • dtype – Data type used for inference. Set dtype according to the ASR model.

  • kaldi_style_text – A kaldi-style text file includes the name of the utterance at the start of the line. If True, the utterance name is expected as first word at each line. If False, utterance names are automatically generated. Set this option according to your input data. Default: True.

  • text_converter – How CTC segmentation handles text. “tokenize”: Use ESPnet 2 preprocessing to tokenize the text. “classic”: The text is preprocessed as in ESPnet 1 which takes token length into account. If the ASR model has longer tokens, this option may yield better results. Default: “tokenize”.

  • time_stamps – Choose the method how the time stamps are calculated. While “fixed” and “auto” use both the sample rate, the ratio of samples to one frame is either automatically determined for each inference or fixed at a certain ratio that is initially determined by the module, but can be changed via the parameter samples_to_frames_ratio. Recommended for longer audio files: “auto”.

  • **ctc_segmentation_args – Parameters for CTC segmentation.

choices_text_converter = ['tokenize', 'classic']
choices_time_stamps = ['auto', 'fixed']
config = CtcSegmentationParameters( )
estimate_samples_to_frames_ratio(speech_len=215040)[source]

Determine the ratio of encoded frames to sample points.

This method helps to determine the time a single encoded frame occupies. As the sample rate already gave the number of samples, only the ratio of samples per encoded CTC frame are needed. This function estimates them by doing one inference, which is only needed once.

Parameters

speech_len – Length of randomly generated speech vector for single inference. Default: 215040.

Returns

Estimated ratio.

Return type

samples_to_frames_ratio

fs = 16000
get_lpz(speech: Union[torch.Tensor, numpy.ndarray])[source]

Obtain CTC posterior log probabilities for given speech data.

Parameters

speech – Speech audio input.

Returns

Numpy vector with CTC log posterior probabilities.

Return type

lpz

static get_segments(task: espnet2.bin.asr_align.CTCSegmentationTask)[source]

Obtain segments for given utterance texts and CTC log posteriors.

Parameters

task – CTCSegmentationTask object that contains ground truth and CTC posterior probabilities.

Returns

Dictionary with alignments. Combine this with the task

object to obtain a human-readable segments representation.

Return type

result

get_timing_config(speech_len=None, lpz_len=None)[source]

Obtain parameters to determine time stamps.

prepare_segmentation_task(text, lpz, name=None, speech_len=None)[source]

Preprocess text, and gather text and lpz into a task object.

Text is pre-processed and tokenized depending on configuration. If speech_len is given, the timing configuration is updated. Text, lpz, and configuration is collected in a CTCSegmentationTask object. The resulting object can be serialized and passed in a multiprocessing computation.

A minimal amount of text processing is done, i.e., splitting the utterances in text into a list and applying text_cleaner. It is recommended that you normalize the text beforehand, e.g., change numbers into their spoken equivalent word, remove special characters, and convert UTF-8 characters to chars corresponding to your ASR model dictionary.

The text is tokenized based on the text_converter setting:

The “tokenize” method is more efficient and the easiest for models based on latin or cyrillic script that only contain the main chars, [“a”, “b”, …] or for Japanese or Chinese ASR models with ~3000 short Kanji / Hanzi tokens.

The “classic” method improves the the accuracy of the alignments for models that contain longer tokens, but with a greater complexity for computation. The function scans for partial tokens which may improve time resolution. For example, the word “▁really” will be broken down into ['▁', '▁r', '▁re', '▁real', '▁really']. The alignment will be based on the most probable activation sequence given by the network.

Parameters
  • text – List or multiline-string with utterance ground truths.

  • lpz – Log CTC posterior probabilities obtained from the CTC-network; numpy array shaped as ( <time steps>, <classes> ).

  • name – Audio file name. Choose a unique name, or the original audio file name, to distinguish multiple audio files. Default: None.

  • speech_len – Number of sample points. If given, the timing configuration is automatically derived from length of fs, length of speech and length of lpz. If None is given, make sure the timing parameters are correct, see time_stamps for reference! Default: None.

Returns

CTCSegmentationTask object that can be passed to

get_segments() in order to obtain alignments.

Return type

task

samples_to_frames_ratio = None
set_config(**kwargs)[source]

Set CTC segmentation parameters.

Parameters for timing:
time_stamps: Select method how CTC index duration is estimated, and

thus how the time stamps are calculated.

fs: Sample rate. samples_to_frames_ratio: If you want to directly determine the

ratio of samples to CTC frames, set this parameter, and set time_stamps to “fixed”. Note: If you want to calculate the time stamps as in ESPnet 1, set this parameter to: subsampling_factor * frame_duration / 1000.

Parameters for text preparation:

set_blank: Index of blank in token list. Default: 0. replace_spaces_with_blanks: Inserts blanks between words, which is

useful for handling long pauses between words. Only used in text_converter="classic" preprocessing mode. Default: False.

kaldi_style_text: Determines whether the utterance name is expected

as fist word of the utterance. Set at module initialization.

text_converter: How CTC segmentation handles text.

Set at module initialization.

Parameters for alignment:
min_window_size: Minimum number of frames considered for a single

utterance. The current default value of 8000 corresponds to roughly 4 minutes (depending on ASR model) and should be OK in most cases. If your utterances are further apart, increase this value, or decrease it for smaller audio files.

max_window_size: Maximum window size. It should not be necessary

to change this value.

gratis_blank: If True, the transition cost of blank is set to zero.

Useful for long preambles or if there are large unrelated segments between utterances. Default: False.

Parameters for calculation of confidence score:
scoring_length: Block length to calculate confidence score. The

default value of 30 should be OK in most cases.

text_converter = 'tokenize'
time_stamps = 'auto'
warned_about_misconfiguration = False
class espnet2.bin.asr_align.CTCSegmentationTask(**kwargs)[source]

Bases: object

Task object for CTC segmentation.

When formatted with str(·), this object returns results in a kaldi-style segments file formatting. The human-readable output can be configured with the printing options.

Properties:
text: Utterance texts, separated by line. But without the utterance

name at the beginning of the line (as in kaldi-style text).

ground_truth_mat: Ground truth matrix (CTC segmentation). utt_begin_indices: Utterance separator for the Ground truth matrix. timings: Time marks of the corresponding chars. state_list: Estimated alignment of chars/tokens. segments: Calculated segments as: (start, end, confidence score). config: CTC Segmentation configuration object. name: Name of aligned audio file (Optional). If given, name is

considered when generating the text.

utt_ids: The list of utterance names (Optional). This list should

have the same length as the number of utterances.

lpz: CTC posterior log probabilities (Optional).

Properties for printing:

print_confidence_score: Includes the confidence score. print_utterance_text: Includes utterance text.

Initialize the module.

char_probs = None
config = None
done = False
ground_truth_mat = None
lpz = None
name = 'utt'
print_confidence_score = True
print_utterance_text = True
segments = None
set(**kwargs)[source]

Update properties.

Parameters

**kwargs – Key-value dict that contains all properties with their new values. Unknown properties are ignored.

state_list = None
text = None
timings = None
utt_begin_indices = None
utt_ids = None
espnet2.bin.asr_align.ctc_align(log_level: Union[int, str], asr_train_config: str, asr_model_file: str, audio: pathlib.Path, text: TextIO, output: TextIO, print_utt_text: bool = True, print_utt_score: bool = True, **kwargs)[source]

Provide the scripting interface to align text to audio.

espnet2.bin.asr_align.get_parser()[source]

Obtain an argument-parser for the script interface.

espnet2.bin.asr_align.main(cmd=None)[source]

Parse arguments and start the alignment in ctc_align(·).

espnet2.bin.mt_train

espnet2.bin.mt_train.get_parser()[source]
espnet2.bin.mt_train.main(cmd=None)[source]

MT training.

Example

% python mt_train.py st –print_config –optim adadelta

> conf/train_mt.yaml

% python mt_train.py –config conf/train_mt.yaml

espnet2.bin.enh_inference

class espnet2.bin.enh_inference.SeparateSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, inference_config: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, ref_channel: Optional[int] = None, normalize_output_wav: bool = False, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False)[source]

Bases: object

SeparateSpeech class

Examples

>>> import soundfile
>>> separate_speech = SeparateSpeech("enh_config.yml", "enh.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> separate_speech(audio)
[separated_audio1, separated_audio2, ...]
cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters
  • ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns

permutation for enh_wavs (Batch, num_spk)

Return type

perm (torch.Tensor)

static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build SeparateSpeech instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

SeparateSpeech instance.

Return type

SeparateSpeech

espnet2.bin.enh_inference.build_model_from_args_and_file(task, args, model_file, device)[source]
espnet2.bin.enh_inference.get_parser()[source]
espnet2.bin.enh_inference.get_train_config(train_config, model_file=None)[source]
espnet2.bin.enh_inference.humanfriendly_or_none(value: str)[source]
espnet2.bin.enh_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], inference_config: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, ref_channel: Optional[int], normalize_output_wav: bool, enh_s2t_task: bool)[source]
espnet2.bin.enh_inference.main(cmd=None)[source]
espnet2.bin.enh_inference.recursive_dict_update(dict_org, dict_patch, verbose=False, log_prefix='')[source]

Update dict_org with dict_patch in-place recursively.

espnet2.bin.mt_inference

class espnet2.bin.mt_inference.Text2Text(mt_train_config: Union[pathlib.Path, str] = None, mt_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1)[source]

Bases: object

Text2Text class

Examples

>>> text2text = Text2Text("mt_config.yml", "mt.pth")
>>> text2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Text2Text instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

Text2Text instance.

Return type

Text2Text

espnet2.bin.mt_inference.get_parser()[source]
espnet2.bin.mt_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], mt_train_config: Optional[str], mt_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool)[source]
espnet2.bin.mt_inference.main(cmd=None)[source]

espnet2.bin.tts_inference

Script to run the inference of text-to-speeech model.

class espnet2.bin.tts_inference.Text2Speech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False, use_att_constraint: bool = False, backward_window: int = 1, forward_window: int = 3, speed_control_alpha: float = 1.0, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, vocoder_config: Union[pathlib.Path, str] = None, vocoder_file: Union[pathlib.Path, str] = None, dtype: str = 'float32', device: str = 'cpu', seed: int = 777, always_fix_seed: bool = False, prefer_normalized_feats: bool = False)[source]

Bases: object

Text2Speech class.

Examples

>>> from espnet2.bin.tts_inference import Text2Speech
>>> # Case 1: Load the local model and use Griffin-Lim vocoder
>>> text2speech = Text2Speech(
>>>     train_config="/path/to/config.yml",
>>>     model_file="/path/to/model.pth",
>>> )
>>> # Case 2: Load the local model and the pretrained vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     train_config="/path/to/config.yml",
>>>     model_file="/path/to/model.pth",
>>>     vocoder_tag="kan-bayashi/ljspeech_tacotron2",
>>> )
>>> # Case 3: Load the pretrained model and use Griffin-Lim vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     model_tag="kan-bayashi/ljspeech_tacotron2",
>>> )
>>> # Case 4: Load the pretrained model and the pretrained vocoder
>>> text2speech = Text2Speech.from_pretrained(
>>>     model_tag="kan-bayashi/ljspeech_tacotron2",
>>>     vocoder_tag="parallel_wavegan/ljspeech_parallel_wavegan.v1",
>>> )
>>> # Run inference and save as wav file
>>> import soundfile as sf
>>> wav = text2speech("Hello, World")["wav"]
>>> sf.write("out.wav", wav.numpy(), text2speech.fs, "PCM_16")

Initialize Text2Speech module.

static from_pretrained(model_tag: Optional[str] = None, vocoder_tag: Optional[str] = None, **kwargs)[source]

Build Text2Speech instance from the pretrained model.

Parameters
  • model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

  • vocoder_tag (Optional[str]) – Vocoder tag of the pretrained vocoders. Currently, the tags of parallel_wavegan are supported, which should start with the prefix “parallel_wavegan/”.

Returns

Text2Speech instance.

Return type

Text2Speech

property fs

Return sampling rate.

property use_lids

Return sid is needed or not in the inference.

property use_sids

Return sid is needed or not in the inference.

property use_speech

Return speech is needed or not in the inference.

property use_spembs

Return spemb is needed or not in the inference.

espnet2.bin.tts_inference.get_parser()[source]

Get argument parser.

espnet2.bin.tts_inference.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], threshold: float, minlenratio: float, maxlenratio: float, use_teacher_forcing: bool, use_att_constraint: bool, backward_window: int, forward_window: int, speed_control_alpha: float, noise_scale: float, noise_scale_dur: float, always_fix_seed: bool, allow_variable_data_keys: bool, vocoder_config: Optional[str], vocoder_file: Optional[str], vocoder_tag: Optional[str])[source]

Run text-to-speech inference.

espnet2.bin.tts_inference.main(cmd=None)[source]

Run TTS model inference.

espnet2.bin.enh_s2t_train

espnet2.bin.enh_s2t_train.get_parser()[source]
espnet2.bin.enh_s2t_train.main(cmd=None)[source]

EnhS2T training.

Example

% python enh_s2t_train.py enh_s2t –print_config –optim adadelta

> conf/train_enh_s2t.yaml

% python enh_s2t_train.py –config conf/train_enh_s2t.yaml

espnet2.bin.gan_tts_train

espnet2.bin.gan_tts_train.get_parser()[source]
espnet2.bin.gan_tts_train.main(cmd=None)[source]

GAN-based TTS training

Example

% python gan_tts_train.py –print_config –optim1 adadelta % python gan_tts_train.py –config conf/train.yaml

espnet2.bin.tts_train

espnet2.bin.tts_train.get_parser()[source]
espnet2.bin.tts_train.main(cmd=None)[source]

TTS training

Example

% python tts_train.py asr –print_config –optim adadelta % python tts_train.py –config conf/train_asr.yaml

espnet2.bin.diar_inference

class espnet2.bin.diar_inference.DiarizeSpeech(train_config: Union[pathlib.Path, str] = None, model_file: Union[pathlib.Path, str] = None, segment_size: Optional[float] = None, hop_size: Optional[float] = None, normalize_segment_scale: bool = False, show_progressbar: bool = False, normalize_output_wav: bool = False, num_spk: Optional[int] = None, device: str = 'cpu', dtype: str = 'float32', enh_s2t_task: bool = False, multiply_diar_result: bool = False)[source]

Bases: object

DiarizeSpeech class

Examples

>>> import soundfile
>>> diarization = DiarizeSpeech("diar_config.yaml", "diar.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> diarization(audio)
[(spk_id, start, end), (spk_id2, start2, end2)]
cal_permumation(ref_wavs, enh_wavs, criterion='si_snr')[source]

Calculate the permutation between seaprated streams in two adjacent segments.

Parameters
  • ref_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • enh_wavs (List[torch.Tensor]) – [(Batch, Nsamples)]

  • criterion (str) – one of (“si_snr”, “mse”, “corr)

Returns

permutation for enh_wavs (Batch, num_spk)

Return type

perm (torch.Tensor)

decode(encoder_out, encoder_out_lens)[source]
encode(speech, lengths)[source]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build DiarizeSpeech instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

DiarizeSpeech instance.

Return type

DiarizeSpeech

permute_diar(waves, spk_prediction)[source]
espnet2.bin.diar_inference.get_parser()[source]
espnet2.bin.diar_inference.inference(output_dir: str, batch_size: int, dtype: str, fs: int, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], model_tag: Optional[str], allow_variable_data_keys: bool, segment_size: Optional[float], hop_size: Optional[float], normalize_segment_scale: bool, show_progressbar: bool, num_spk: Optional[int], normalize_output_wav: bool, multiply_diar_result: bool, enh_s2t_task: bool)[source]
espnet2.bin.diar_inference.main(cmd=None)[source]

espnet2.bin.asr_inference_k2

espnet2.bin.slu_train

espnet2.bin.slu_train.get_parser()[source]
espnet2.bin.slu_train.main(cmd=None)[source]

SLU training.

Example

% python slu_train.py slu –print_config –optim adadelta

> conf/train_slu.yaml

% python slu_train.py –config conf/train_slu.yaml

espnet2.bin.slu_inference

class espnet2.bin.slu_inference.Speech2Understand(slu_train_config: Union[pathlib.Path, str] = None, slu_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]

Bases: object

Speech2Understand class

Examples

>>> import soundfile
>>> speech2understand = Speech2Understand("slu_config.yml", "slu.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2understand(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Understand instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

Speech2Understand instance.

Return type

Speech2Understand

espnet2.bin.slu_inference.get_parser()[source]
espnet2.bin.slu_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], slu_train_config: Optional[str], slu_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]
espnet2.bin.slu_inference.main(cmd=None)[source]

espnet2.bin.asr_transducer_inference

Inference class definition for Transducer models.

class espnet2.bin.asr_transducer_inference.Speech2Text(asr_train_config: Union[pathlib.Path, str, None] = None, asr_model_file: Union[pathlib.Path, str, None] = None, beam_search_config: Optional[Dict[str, Any]] = None, lm_train_config: Union[pathlib.Path, str, None] = None, lm_file: Union[pathlib.Path, str, None] = None, token_type: Optional[str] = None, bpemodel: Optional[str] = None, device: str = 'cpu', beam_size: int = 5, dtype: str = 'float32', lm_weight: float = 1.0, quantize_asr_model: bool = False, quantize_modules: Optional[List[str]] = None, quantize_dtype: str = 'qint8', nbest: int = 1, streaming: bool = False, chunk_size: int = 16, left_context: int = 32, right_context: int = 0, display_partial_hypotheses: bool = False)[source]

Bases: object

Speech2Text class for Transducer models.

Parameters
  • asr_train_config – ASR model training config path.

  • asr_model_file – ASR model path.

  • beam_search_config – Beam search config path.

  • lm_train_config – Language Model training config path.

  • lm_file – Language Model config path.

  • token_type – Type of token units.

  • bpemodel – BPE model path.

  • device – Device to use for inference.

  • beam_size – Size of beam during search.

  • dtype – Data type.

  • lm_weight – Language model weight.

  • quantize_asr_model – Whether to apply dynamic quantization to ASR model.

  • quantize_modules – List of module names to apply dynamic quantization on.

  • quantize_dtype – Dynamic quantization data type.

  • nbest – Number of final hypothesis.

  • streaming – Whether to perform chunk-by-chunk inference.

  • chunk_size – Number of frames in chunk AFTER subsampling.

  • left_context – Number of frames in left context AFTER subsampling.

  • right_context – Number of frames in right context AFTER subsampling.

  • display_partial_hypotheses – Whether to display partial hypotheses.

Construct a Speech2Text object.

apply_frontend(speech: torch.Tensor, is_final: bool = False) → Tuple[torch.Tensor, torch.Tensor][source]

Forward frontend.

Parameters
  • speech – Speech data. (S)

  • is_final – Whether speech corresponds to the final (or only) chunk of data.

Returns

Features sequence. (1, T_in, F) feats_lengths: Features sequence length. (1, T_in, F)

Return type

feats

static from_pretrained(model_tag: Optional[str] = None, **kwargs) → espnet2.bin.asr_transducer_inference.Speech2Text[source]

Build Speech2Text instance from the pretrained model.

Parameters

model_tag – Model tag of the pretrained models.

Returns

Speech2Text instance.

hypotheses_to_results(nbest_hyps: List[espnet2.asr_transducer.beam_search_transducer.Hypothesis]) → List[Any][source]

Build partial or final results from the hypotheses.

Parameters

nbest_hyps – N-best hypothesis.

Returns

Results containing different representation for the hypothesis.

Return type

results

reset_inference_cache() → None[source]

Reset Speech2Text parameters.

streaming_decode(speech: Union[torch.Tensor, numpy.ndarray], is_final: bool = True) → List[espnet2.asr_transducer.beam_search_transducer.Hypothesis][source]

Speech2Text streaming call.

Parameters
  • speech – Chunk of speech data. (S)

  • is_final – Whether speech corresponds to the final chunk of data.

Returns

N-best hypothesis.

Return type

nbest_hypothesis

espnet2.bin.asr_transducer_inference.get_parser()[source]

Get Transducer model inference parser.

espnet2.bin.asr_transducer_inference.inference(output_dir: str, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], asr_train_config: Optional[str], asr_model_file: Optional[str], beam_search_config: Optional[dict], lm_train_config: Optional[str], lm_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], key_file: Optional[str], allow_variable_data_keys: bool, quantize_asr_model: Optional[bool], quantize_modules: Optional[List[str]], quantize_dtype: Optional[str], streaming: Optional[bool], chunk_size: Optional[int], left_context: Optional[int], right_context: Optional[int], display_partial_hypotheses: bool) → None[source]

Transducer model inference.

Parameters
  • output_dir – Output directory path.

  • batch_size – Batch decoding size.

  • dtype – Data type.

  • beam_size – Beam size.

  • ngpu – Number of GPUs.

  • seed – Random number generator seed.

  • lm_weight – Weight of language model.

  • nbest – Number of final hypothesis.

  • num_workers – Number of workers.

  • log_level – Level of verbose for logs.

  • data_path_and_name_and_type

  • asr_train_config – ASR model training config path.

  • asr_model_file – ASR model path.

  • beam_search_config – Beam search config path.

  • lm_train_config – Language Model training config path.

  • lm_file – Language Model path.

  • model_tag – Model tag.

  • token_type – Type of token units.

  • bpemodel – BPE model path.

  • key_file – File key.

  • allow_variable_data_keys – Whether to allow variable data keys.

  • quantize_asr_model – Whether to apply dynamic quantization to ASR model.

  • quantize_modules – List of module names to apply dynamic quantization on.

  • quantize_dtype – Dynamic quantization data type.

  • streaming – Whether to perform chunk-by-chunk inference.

  • chunk_size – Number of frames in chunk AFTER subsampling.

  • left_context – Number of frames in left context AFTER subsampling.

  • right_context – Number of frames in right context AFTER subsampling.

  • display_partial_hypotheses – Whether to display partial hypotheses.

espnet2.bin.asr_transducer_inference.main(cmd=None)[source]

espnet2.bin.pack

class espnet2.bin.pack.ASRPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['asr_model_file', 'lm_file']
yaml_files = ['asr_train_config', 'lm_train_config']
class espnet2.bin.pack.DiarPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
class espnet2.bin.pack.EnhPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
class espnet2.bin.pack.EnhS2TPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['enh_s2t_model_file', 'lm_file']
yaml_files = ['enh_s2t_train_config', 'lm_train_config']
class espnet2.bin.pack.PackedContents[source]

Bases: object

files = []
yaml_files = []
class espnet2.bin.pack.STPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['st_model_file']
yaml_files = ['st_train_config']
class espnet2.bin.pack.TTSPackedContents[source]

Bases: espnet2.bin.pack.PackedContents

files = ['model_file']
yaml_files = ['train_config']
espnet2.bin.pack.add_arguments(parser: argparse.ArgumentParser, contents: Type[espnet2.bin.pack.PackedContents])[source]
espnet2.bin.pack.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.pack.main(cmd=None)[source]

espnet2.bin.asr_transducer_train

espnet2.bin.asr_transducer_train.get_parser()[source]

Get parser for ASR Transducer task.

espnet2.bin.asr_transducer_train.main(cmd=None)[source]

ASR Transducer training.

Example

% python asr_transducer_train.py asr –print_config

–optim adadelta > conf/train_asr.yaml

% python asr_transducer_train.py

–config conf/tuning/transducer/train_rnn_transducer.yaml

espnet2.bin.tokenize_text

espnet2.bin.tokenize_text.field2slice(field: Optional[str]) → slice[source]

Convert field string to slice

Note that field string accepts 1-based integer.

Examples

>>> field2slice("1-")
slice(0, None, None)
>>> field2slice("1-3")
slice(0, 3, None)
>>> field2slice("-3")
slice(None, 3, None)
espnet2.bin.tokenize_text.get_parser() → argparse.ArgumentParser[source]
espnet2.bin.tokenize_text.main(cmd=None)[source]
espnet2.bin.tokenize_text.tokenize(input: str, output: str, field: Optional[str], delimiter: Optional[str], token_type: str, space_symbol: str, non_linguistic_symbols: Optional[str], bpemodel: Optional[str], log_level: str, write_vocabulary: bool, vocabulary_size: int, remove_non_linguistic_symbols: bool, cutoff: int, add_symbol: List[str], cleaner: Optional[str], g2p: Optional[str])[source]

espnet2.bin.asr_inference_maskctc

class espnet2.bin.asr_inference_maskctc.Speech2Text(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', batch_size: int = 1, dtype: str = 'float32', maskctc_n_iterations: int = 10, maskctc_threshold_probability: float = 0.99)[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

Speech2Text instance.

Return type

Speech2Text

espnet2.bin.asr_inference_maskctc.get_parser()[source]
espnet2.bin.asr_inference_maskctc.inference(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, maskctc_n_iterations: int, maskctc_threshold_probability: float)[source]
espnet2.bin.asr_inference_maskctc.main(cmd=None)[source]

espnet2.bin.lm_calc_perplexity

espnet2.bin.lm_calc_perplexity.calc_perplexity(output_dir: str, batch_size: int, dtype: str, ngpu: int, seed: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], train_config: Optional[str], model_file: Optional[str], log_base: Optional[float], allow_variable_data_keys: bool)[source]
espnet2.bin.lm_calc_perplexity.get_parser()[source]
espnet2.bin.lm_calc_perplexity.main(cmd=None)[source]

espnet2.bin.asr_inference

class espnet2.bin.asr_inference.Speech2Text(asr_train_config: Union[pathlib.Path, str] = None, asr_model_file: Union[pathlib.Path, str] = None, transducer_conf: dict = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, ngram_scorer: str = 'full', ngram_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, ngram_weight: float = 0.9, penalty: float = 0.0, nbest: int = 1, streaming: bool = False, enh_s2t_task: bool = False, quantize_asr_model: bool = False, quantize_lm: bool = False, quantize_modules: List[str] = ['Linear'], quantize_dtype: str = 'qint8')[source]

Bases: object

Speech2Text class

Examples

>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
static from_pretrained(model_tag: Optional[str] = None, **kwargs)[source]

Build Speech2Text instance from the pretrained model.

Parameters

model_tag (Optional[str]) – Model tag of the pretrained models. Currently, the tags of espnet_model_zoo are supported.

Returns

Speech2Text instance.

Return type

Speech2Text

espnet2.bin.asr_inference.get_parser()[source]
espnet2.bin.asr_inference.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, ngram_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: Optional[str], asr_model_file: Optional[str], lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], ngram_file: Optional[str], model_tag: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, transducer_conf: Optional[dict], streaming: bool, enh_s2t_task: bool, quantize_asr_model: bool, quantize_lm: bool, quantize_modules: List[str], quantize_dtype: str)[source]
espnet2.bin.asr_inference.main(cmd=None)[source]

espnet2.bin.asr_train

espnet2.bin.asr_train.get_parser()[source]
espnet2.bin.asr_train.main(cmd=None)[source]

ASR training.

Example

% python asr_train.py asr –print_config –optim adadelta

> conf/train_asr.yaml

% python asr_train.py –config conf/train_asr.yaml

espnet2.bin.st_inference_streaming

class espnet2.bin.st_inference_streaming.Speech2TextStreaming(st_train_config: Union[pathlib.Path, str], st_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]

Bases: object

Speech2TextStreaming class

Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)

Examples

>>> import soundfile
>>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
apply_frontend(speech: torch.Tensor, prev_states=None, is_final: bool = False)[source]
assemble_hyps(hyps)[source]
reset()[source]
espnet2.bin.st_inference_streaming.get_parser()[source]
espnet2.bin.st_inference_streaming.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], st_train_config: str, st_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]
espnet2.bin.st_inference_streaming.main(cmd=None)[source]

espnet2.bin.enh_train

espnet2.bin.enh_train.get_parser()[source]
espnet2.bin.enh_train.main(cmd=None)[source]

Enhancemnet frontend training.

Example

% python enh_train.py enh –print_config –optim adadelta

> conf/train_enh.yaml

% python enh_train.py –config conf/train_enh.yaml

espnet2.bin.asr_inference_streaming

class espnet2.bin.asr_inference_streaming.Speech2TextStreaming(asr_train_config: Union[pathlib.Path, str], asr_model_file: Union[pathlib.Path, str] = None, lm_train_config: Union[pathlib.Path, str] = None, lm_file: Union[pathlib.Path, str] = None, token_type: str = None, bpemodel: str = None, device: str = 'cpu', maxlenratio: float = 0.0, minlenratio: float = 0.0, batch_size: int = 1, dtype: str = 'float32', beam_size: int = 20, ctc_weight: float = 0.5, lm_weight: float = 1.0, penalty: float = 0.0, nbest: int = 1, disable_repetition_detection=False, decoder_text_length_limit=0, encoded_feat_length_limit=0)[source]

Bases: object

Speech2TextStreaming class

Details in “Streaming Transformer ASR with Blockwise Synchronous Beam Search” (https://arxiv.org/abs/2006.14941)

Examples

>>> import soundfile
>>> speech2text = Speech2TextStreaming("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
apply_frontend(speech: torch.Tensor, prev_states=None, is_final: bool = False)[source]
assemble_hyps(hyps)[source]
reset()[source]
espnet2.bin.asr_inference_streaming.get_parser()[source]
espnet2.bin.asr_inference_streaming.inference(output_dir: str, maxlenratio: float, minlenratio: float, batch_size: int, dtype: str, beam_size: int, ngpu: int, seed: int, ctc_weight: float, lm_weight: float, penalty: float, nbest: int, num_workers: int, log_level: Union[int, str], data_path_and_name_and_type: Sequence[Tuple[str, str, str]], key_file: Optional[str], asr_train_config: str, asr_model_file: str, lm_train_config: Optional[str], lm_file: Optional[str], word_lm_train_config: Optional[str], word_lm_file: Optional[str], token_type: Optional[str], bpemodel: Optional[str], allow_variable_data_keys: bool, sim_chunk_length: int, disable_repetition_detection: bool, encoded_feat_length_limit: int, decoder_text_length_limit: int)[source]
espnet2.bin.asr_inference_streaming.main(cmd=None)[source]

espnet2.bin.st_train

espnet2.bin.st_train.get_parser()[source]
espnet2.bin.st_train.main(cmd=None)[source]

ST training.

Example

% python st_train.py st –print_config –optim adadelta

> conf/train_st.yaml

% python st_train.py –config conf/train_st.yaml