medkit.text.segmentation.rush_sentence_tokenizer

medkit.text.segmentation.rush_sentence_tokenizer#

This module needs extra-dependencies not installed as core dependencies of medkit. To install them, use pip install medkit-lib[rush-sentence-tokenizer].

Classes:

RushSentenceTokenizer([output_label, ...])

Sentence segmentation annotator based on PyRuSH.

class RushSentenceTokenizer(output_label='sentence', path_to_rules=None, keep_newlines=True, attrs_to_copy=None, uid=None)[source]#

Sentence segmentation annotator based on PyRuSH.

Instantiate the RuSH tokenizer

Parameters:

output_label (str, optional) – The output label of the created annotations.
path_to_rules (str or Path, optional) – Path to csv or tsv file to provide to PyRuSH. If none provided, “rush_tokenizer_default_rules.tsv” will be used (corresponds to the “conf/rush_rules.tsv” in the PyRush repo)
keep_newlines (bool, default=True) – With the default rules, newline chars are not used to split sentences, therefore a sentence maybe contain one or more newline chars. If keep_newlines is False, newlines will be replaced by spaces.
attrs_to_copy (list of str, optional) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.
uid (str, optional) – Identifier of the tokenizer

Methods:

`run`(segments)	Return sentences detected in segments.
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return sentences detected in segments.

Parameters:: segments (list of Segment) – List of segments into which to look for sentences
Return type:: list[Segment]
Returns:: list of Segment – Sentences segments found in segments

property description: OperationDescription#

Contains all the operation init parameters.

Return type:: OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

medkit.text.segmentation.rush_sentence_tokenizer

Contents

medkit.text.segmentation.rush_sentence_tokenizer#