medkit.core.text.utils

medkit.core.text.utils#

Functions:

`clean_multiple_whitespaces_in_sentence`(text, ...)	Replace multiple white-spaces between alphanumeric characters and lowercase characters with a single whitespace
`clean_newline_character`(text, spans[, ...])	Replace the newline character depending on its position in the text.
`clean_parentheses_eds`(text, spans)	Modify the text near the parentheses depending on its content.
`lstrip`(text[, start, chars])	Returns a copy of the string with leading characters removed and its corresponding new start index.
`replace_multiple_newline_after_sentence`(...)	Replace multiple space characters between a newline character \n and a capital letter or a number with a single newline character.
`replace_newline_inside_sentence`(text, spans)	Replace the newline character \n between lowercase letters or punctuation marks with a space
`replace_point_after_keywords`(text, spans, ...)	Replace the character '.' after a keyword and update its span.
`replace_point_before_keywords`(text, spans, ...)	Replace the character '.' before a keyword with a space and update its span.
`replace_point_in_numbers`(text, spans)	Replace the character '.' between numbers with the character ',' a space and update its span.
`replace_point_in_uppercase`(text, spans)	Replace the character '.' between uppercase characters with a space and update its span.
`rstrip`(text[, end, chars])	Returns a copy of the string with trailing characters removed and its corresponding new end index.
`strip`(text[, start, chars])	Returns a copy of the string with leading characters removed and its corresponding new start and end indexes.

replace_point_after_keywords(text, spans, keywords, strict=False, replace_by=' ')[source]#

Replace the character ‘.’ after a keyword and update its span. Could be used to replace dots that indicate the title of a person (i.e. M. or Mrs.) or some dots that appear by mistake after keywords

Parameters:

text (str) – The text to be modified
spans (list of AnySpan) – Spans associated to the text
keywords (list of str) – Word or pattern to match before a point
strict (bool, default=False) – If True, the keyword must be followed by a point. If False, the keyword could have zero or many whitespaces before a point
replace_by (str, default=" ") – Replacement string

Return type:

tuple[str, list[AnySpan]]

Returns:

text (str) – The text with the replaced matches
spans (list of AnySpan) – The list of modified spans

Examples

>>> text = "Le Dr. a un rdv. Mme. Bernand est venue à 14h"
>>> spans = [Span(0, len(text))]
>>> keywords = ["Dr", "Mme"]
>>> text, spans = replace_point_after_keywords(text, spans, keywords, replace_by="")
>>> print(text)
Le Dr a un rdv. Mme Bernand est venue à 14h

replace_multiple_newline_after_sentence(text, spans)[source]#

Replace multiple space characters between a newline character \n and a capital letter or a number with a single newline character.

Parameters:

text (str) – The text to be modified
spans (list of AnySpan) – Spans associated to the text

Return type:

tuple[str, list[AnySpan]]

Returns:

text (str) – The cleaned text
spans (list of AnySpan) – The list of modified spans

replace_newline_inside_sentence(text, spans)[source]#

Replace the newline character \n between lowercase letters or punctuation marks with a space

Parameters:

text (str) – The text to be modified
spans (list of AnySpan) – Spans associated to the text

Return type:

tuple[str, list[AnySpan]]

Returns:

text (str) – The cleaned text
spans (list of AnySpan) – The list of modified spans

clean_newline_character(text, spans, keep_endlines=False)[source]#

Replace the newline character depending on its position in the text. The endlines characters that are not suppressed can be either kept as endlines, or replaced by spaces. This method combines replace_multiple_newline_after_sentence() and replace_newline_inside_sentence().

Parameters:

text (str) – The text to be modified
spans (list of AnySpan) – Spans associated to the text
keep_endlines (bool, default=False) – Whether to keep the endlines as ‘.\n’ or replace them with ‘. ‘

Return type:

tuple[str, list[AnySpan]]

Returns:

text (str) – The cleaned text
spans (list of AnySpan) – The list of modified spans

Examples

>>> text = "This is\n\n\ta sentence\nAnother\nsentence\n\nhere"
>>> spans = [Span(0, len(text))]
>>> text, spans = clean_newline_character(text, spans, keep_endlines=False)
>>> print(text)
This is a sentence. Another sentence here

>>> text, spans = clean_newline_character(text, spans, keep_endlines=True)
>>> print(text)
This is a sentence.
Another sentence here

clean_multiple_whitespaces_in_sentence(text, spans)[source]#

Replace multiple white-spaces between alphanumeric characters and lowercase characters with a single whitespace

Example:#

>>> text = "A   phrase    with  multiple   spaces     "
>>> spans = [Span(0, len(text))]
>>> text, spans = clean_multiple_whitespaces_in_sentence(text, spans)
>>> print(text)
A phrase with multiple spaces

rtype:: tuple[str, list[AnySpan]]

clean_parentheses_eds(text, spans)[source]#

Modify the text near the parentheses depending on its content. The rules are adapted for French documents.

Examples

>>> text = """
... Le test PCR est (-), pas de nouvelles.
... L'examen d'aujourd'hui est (+).
... Les bilans réalisés (biologique, métabolique en particulier à la recherche
... de GAMT et X fragile) sont revenus négatifs.
... Le patient a un traitement(debuté le 3/02).
... """
>>> spans = [Span(0, len(text))]
>>> text, spans = clean_parentheses_eds(text, spans)
>>> print(text)
Le test PCR est  negatif , pas de nouvelles.
L'examen d'aujourd'hui est  positif .
Les bilans réalisés sont revenus négatifs ; biologique, métabolique en particulier à la recherche
de GAMT et X fragile.
Le patient a un traitement,debuté le 3/02,.

Return type:: tuple[str, list[AnySpan]]

replace_point_in_uppercase(text, spans)[source]#

Replace the character ‘.’ between uppercase characters with a space and update its span.

Examples

>>> text = "Abréviation ING.DRT or RTT.J"
>>> spans = [Span(0, len(text))]
>>> text, spans = replace_point_in_uppercase(text, spans)
>>> print(text)
Abréviation ING DRT or RTT J

Return type:: tuple[str, list[AnySpan]]

replace_point_in_numbers(text, spans)[source]#

Replace the character ‘.’ between numbers with the character ‘,’ a space and update its span.

Example:#

>>> text = "La valeur est de 3.456."
>>> spans = [Span(0, len(text))]
>>> text, spans = replace_point_in_numbers(text, spans)
>>> print(text)
La valeur est de 3,456.

rtype:: tuple[str, list[AnySpan]]

replace_point_before_keywords(text, spans, keywords)[source]#

Replace the character ‘.’ before a keyword with a space and update its span.

Return type:: tuple[str, list[AnySpan]]

lstrip(text, start=0, chars=None)[source]#

Returns a copy of the string with leading characters removed and its corresponding new start index.

Parameters:

text (str) – The text to strip.
start (int, default=0) – The start index from the original text if any.
chars (str, optional) – The list of characters to strip. Default behaviour is like str.lstrip([chars]).

Return type:

tuple[str, int]

Returns:

new_text (str) – New text
new_start (int) – New start index

rstrip(text, end=None, chars=None)[source]#

Returns a copy of the string with trailing characters removed and its corresponding new end index.

Parameters:

text (str) – The text to strip.
end (int, optional) – The end index from the original text if any.
chars (str, optional) – The list of characters to strip. Default behaviour is like str.rstrip([chars]).

Return type:

tuple[str, int]

Returns:

new_text (str) – New text
new_end (int) – New end index

strip(text, start=0, chars=None)[source]#

Returns a copy of the string with leading characters removed and its corresponding new start and end indexes.

Parameters:

text (str) – The text to strip.
start (int, default=0) – The start index from the original text if any.
chars (str, optional) – The list of characters to strip. Default behaviour is like str.lstrip([chars]).

Return type:

tuple[str, int, int]

Returns:

new_text (str) – New text
new_start (int) – New start index
new_end (int) – New end index

medkit.core.text.utils

Contents

medkit.core.text.utils#

Example:#

Example:#