medkit.text.segmentation.rush_sentence_tokenizer#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[rush-sentence-tokenizer].

Classes:

RushSentenceTokenizer([output_label, ...])

Sentence segmentation annotator based on PyRuSH.

class RushSentenceTokenizer(output_label='SENTENCE', path_to_rules=None, keep_newlines=True, uid=None)[source]#

Sentence segmentation annotator based on PyRuSH.

Instantiate the RuSH tokenizer

Parameters
  • output_label (str) – The output label of the created annotations. Default: “SENTENCE” (cf.DefaultConfig)

  • path_to_rules (Union[str, Path, None]) – Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)

  • keep_newlines (bool) – With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.

  • uid (str) – Identifier of the tokenizer

Methods:

run(segments)

Return sentences detected in segments.

run(segments)[source]#

Return sentences detected in segments.

Parameters

segments (List[Segment]) – List of segments into which to look for sentences

Return type

List[Segment]

Returns

List[Segments] – Sentences segments found in segments