medkit.core.text.utils#

Functions#

clean_newline_character(→ tuple[str, ...)

Replace the newline character depending on its position in the text.

clean_parentheses_eds(→ tuple[str, ...)

Modify the text near the parentheses depending on its content.

clean_multiple_whitespaces_in_sentence(→ tuple[str, ...)

Normalize consecutive whitespaces in a sentence.

replace_point_after_keywords(→ tuple[str, ...)

Replace the dot character after a keyword and update its span.

replace_multiple_newline_after_sentence(→ tuple[str, ...)

Normalize consecutive newlines between sentences.

replace_newline_inside_sentence(→ tuple[str, ...)

Replace newline in a sentence.

replace_point_in_uppercase(→ tuple[str, ...)

Replace the dot character between uppercase characters with a space and update its span.

replace_point_in_numbers(→ tuple[str, ...)

Replace the dot character between numbers with a comma and update its span.

replace_point_before_keywords(→ tuple[str, ...)

Replace the dot character before a keyword with a space and update its span.

lstrip(→ tuple[str, int])

Return a copy of the string with leading characters removed and its corresponding new start index.

rstrip(→ tuple[str, int])

Return a copy of the string with trailing characters removed and its corresponding new end index.

strip(→ tuple[str, int, int])

Return a copy of the string with leading characters removed and its corresponding new start and end indexes.

Module Contents#

medkit.core.text.utils.clean_newline_character(text: str, spans: list[medkit.core.text.span.AnySpan], keep_endlines: bool = False) tuple[str, list[medkit.core.text.span.AnySpan]]#

Replace the newline character depending on its position in the text.

The endlines characters that are not suppressed can be either kept as endlines, or replaced by spaces. This method combines replace_multiple_newline_after_sentence() and replace_newline_inside_sentence().

Parameters:
textstr

The text to be modified

spanslist of AnySpan

Spans associated to the text

keep_endlinesbool, default=False

Whether to keep the endlines as ‘.\\n’ or replace them with ‘. ‘

Returns:
textstr

The cleaned text

spanslist of AnySpan

The list of modified spans

Examples

>>> text = "This is\\n\\n\\ta sentence\\nAnother\\nsentence\\n\\nhere"
>>> spans = [Span(0, len(text))]
>>> text, spans = clean_newline_character(text, spans, keep_endlines=False)
>>> print(text)
This is a sentence. Another sentence here
>>> text, spans = clean_newline_character(text, spans, keep_endlines=True)
>>> print(text)
This is a sentence.
Another sentence here
medkit.core.text.utils.clean_parentheses_eds(text: str, spans: list[medkit.core.text.span.AnySpan]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Modify the text near the parentheses depending on its content.

The rules are adapted for French documents.

Examples

>>> text = \"\"\"
... Le test PCR est (-), pas de nouvelles.
... L'examen d'aujourd'hui est (+).
... Les bilans réalisés (biologique, métabolique en particulier à la recherche
... de GAMT et X fragile) sont revenus négatifs.
... Le patient a un traitement(debuté le 3/02).
... \"\"\"
>>> spans = [Span(0, len(text))]
>>> text, spans = clean_parentheses_eds(text, spans)
>>> print(text)
Le test PCR est  negatif , pas de nouvelles.
L'examen d'aujourd'hui est  positif .
Les bilans réalisés sont revenus négatifs ; biologique, métabolique en particulier à la recherche
de GAMT et X fragile.
Le patient a un traitement,debuté le 3/02,.
medkit.core.text.utils.clean_multiple_whitespaces_in_sentence(text: str, spans: list[medkit.core.text.span.AnySpan]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Normalize consecutive whitespaces in a sentence.

Replace multiple white-spaces between alphanumeric characters and lowercase characters with a single whitespace.

Examples

>>> text = "A   phrase    with  multiple   spaces     "
>>> spans = [Span(0, len(text))]
>>> text, spans = clean_multiple_whitespaces_in_sentence(text, spans)
>>> print(text)
A phrase with multiple spaces
medkit.core.text.utils.replace_point_after_keywords(text: str, spans: list[medkit.core.text.span.AnySpan], keywords: list[str], strict: bool = False, replace_by: str = ' ') tuple[str, list[medkit.core.text.span.AnySpan]]#

Replace the dot character after a keyword and update its span.

Could be used to replace dots that indicate the title of a person (i.e. M. or Mrs.) or some dots that appear by mistake after keywords.

Parameters:
textstr

The text to be modified

spanslist of AnySpan

Spans associated to the text

keywordslist of str

Word or pattern to match before a point

strictbool, default=False

If True, the keyword must be followed by a point. If False, the keyword could have zero or many whitespaces before a point

replace_bystr, default=” “

Replacement string

Returns:
textstr

The text with the replaced matches

spanslist of AnySpan

The list of modified spans

Examples

>>> text = "Le Dr. a un rdv. Mme. Bernand est venue à 14h"
>>> spans = [Span(0, len(text))]
>>> keywords = ["Dr", "Mme"]
>>> text, spans = replace_point_after_keywords(text, spans, keywords, replace_by="")
>>> print(text)
Le Dr a un rdv. Mme Bernand est venue à 14h
medkit.core.text.utils.replace_multiple_newline_after_sentence(text: str, spans: list[medkit.core.text.span.AnySpan]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Normalize consecutive newlines between sentences.

Replace multiple space characters between a newline character \\n and a capital letter or a number with a single newline character.

Parameters:
textstr

The text to be modified

spanslist of AnySpan

Spans associated to the text

Returns:
textstr

The cleaned text

spanslist of AnySpan

The list of modified spans

medkit.core.text.utils.replace_newline_inside_sentence(text: str, spans: list[medkit.core.text.span.AnySpan]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Replace newline in a sentence.

Replace the newline character \\n between lowercase letters or punctuation marks with a space.

Parameters:
textstr

The text to be modified

spanslist of AnySpan

Spans associated to the text

Returns:
textstr

The cleaned text

spanslist of AnySpan

The list of modified spans

medkit.core.text.utils.replace_point_in_uppercase(text: str, spans: list[medkit.core.text.span.AnySpan]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Replace the dot character between uppercase characters with a space and update its span.

Examples

>>> text = "Abréviation ING.DRT or RTT.J"
>>> spans = [Span(0, len(text))]
>>> text, spans = replace_point_in_uppercase(text, spans)
>>> print(text)
Abréviation ING DRT or RTT J
medkit.core.text.utils.replace_point_in_numbers(text: str, spans: list[medkit.core.text.span.AnySpan]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Replace the dot character between numbers with a comma and update its span.

Examples

>>> text = "La valeur est de 3.456."
>>> spans = [Span(0, len(text))]
>>> text, spans = replace_point_in_numbers(text, spans)
>>> print(text)
La valeur est de 3,456.
medkit.core.text.utils.replace_point_before_keywords(text: str, spans: list[medkit.core.text.span.AnySpan], keywords: list[str]) tuple[str, list[medkit.core.text.span.AnySpan]]#

Replace the dot character before a keyword with a space and update its span.

medkit.core.text.utils.lstrip(text: str, start: int = 0, chars: str | None = None) tuple[str, int]#

Return a copy of the string with leading characters removed and its corresponding new start index.

Parameters:
textstr

The text to strip.

startint, default=0

The start index from the original text if any.

charsstr, optional

The list of characters to strip. Default behaviour is like str.lstrip([chars]).

Returns:
new_textstr

New text

new_startint

New start index

medkit.core.text.utils.rstrip(text: str, end: int | None = None, chars: str | None = None) tuple[str, int]#

Return a copy of the string with trailing characters removed and its corresponding new end index.

Parameters:
textstr

The text to strip.

endint, optional

The end index from the original text if any.

charsstr, optional

The list of characters to strip. Default behaviour is like str.rstrip([chars]).

Returns:
new_textstr

New text

new_endint

New end index

medkit.core.text.utils.strip(text: str, start: int = 0, chars: str | None = None) tuple[str, int, int]#

Return a copy of the string with leading characters removed and its corresponding new start and end indexes.

Parameters:
textstr

The text to strip.

startint, default=0

The start index from the original text if any.

charsstr, optional

The list of characters to strip. Default behaviour is like str.lstrip([chars]).

Returns:
new_textstr

New text

new_startint

New start index

new_endint

New end index