medkit.io

Contents

medkit.io#

Subpackages#

Submodules#

Classes#

BratInputConverter

Class in charge of converting brat annotations.

BratOutputConverter

Class for converting text documents to a brat collection file.

DoccanoClientConfig

Doccano client configuration.

DoccanoInputConverter

Convert doccano files (.JSONL) containing annotations for a given task.

DoccanoOutputConverter

Convert medkit files to doccano files (.JSONL) for a given task.

DoccanoTask

Supported doccano tasks.

RTTMInputConverter

Class for conversions from Rich Transcription Time Marked (.rttm) into turn segments.

RTTMOutputConverter

Class for conversions to Rich Transcription Time Marked (.rttm).

Package Contents#

class medkit.io.BratInputConverter(detect_cuis_in_notes: bool = True, notes_label: str = 'brat_note', uid: str | None = None)#

Bases: medkit.core.InputConverter

Class in charge of converting brat annotations.

Parameters:
detect_cuis_in_notesbool, default=True

If True, strings looking like CUIs in annotator notes of entities will be converted to UMLS normalization attributes rather than creating an Attribute with the whole note text as value.

notes_labelstr, default=”brat_note”,

Label to use for attributes created from annotator notes.

uidstr, optional

Identifier of the converter.

Attributes:
descriptionstr

Description of the operation

notes_label#
detect_cuis_in_notes#
uid#
_prov_tracer: medkit.core.ProvTracer | None = None#
property description: medkit.core.OperationDescription#
set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#
load(dir_path: str | pathlib.Path, ann_ext: str = ANN_EXT, text_ext: str = TEXT_EXT) list[medkit.core.text.TextDocument]#

Load brat annotations as text documents.

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

Parameters:
dir_pathstr or Path

The path to the directory containing the text files and the annotation files (.ann)

ann_extstr, optional

The extension of the brat annotation file (e.g. .ann)

text_extstr, optional

The extension of the text file (e.g. .txt)

Returns:
list of TextDocument

The list of TextDocuments

load_doc(ann_path: str | pathlib.Path, text_path: str | pathlib.Path) medkit.core.text.TextDocument#

Load a brat annotation and text file combo as a text document.

Create a TextDocument from a .ann file and its associated .txt file.

Parameters:
ann_pathstr or Path

The path to the brat annotation file.

text_pathstr or Path

The path to the text document file.

Returns:
TextDocument

The document containing the text and the annotations

load_annotations(ann_file: str | pathlib.Path) list[medkit.core.text.TextAnnotation]#

Load a brat annotation file as a list of annotations.

Load a .ann file and return a list of Annotation objects.

Parameters:
ann_filestr or Path

Path to the .ann file.

Returns:
list of TextAnnotation

The list of text annotations

class medkit.io.BratOutputConverter(anns_labels: list[str] | None = None, attrs: list[str] | None = None, notes_label: str = 'brat_note', ignore_segments: bool = True, convert_cuis_to_notes: bool = True, create_config: bool = True, top_values_by_attr: int = 50, uid: str | None = None)#

Bases: medkit.core.OutputConverter

Class for converting text documents to a brat collection file.

Hint

BRAT checks for coherence between span and text for each annotation. This converter adjusts the text and spans to get the right visualization and ensure compatibility.

Parameters:
anns_labelslist of str, optional

Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted

attrslist of str, optional

Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes

notes_labelstr, default=”brat_note”

Label of attributes that will be converted to annotator notes.

ignore_segmentsbool, default=True

If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.

convert_cuis_to_notesbool, default=True

If True, UMLS normalization attributes will be converted to annotator notes rather than attributes. For entities with multiple UMLS attributes, CUIs will be separated by spaces (ex: “C0011849 C0004096”).

create_configbool, default=True

Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.

top_values_by_attrint, default=50

Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.

uidstr, optional

Identifier of the converter

Attributes:
descriptionstr

Description for the operation

uid#
anns_labels#
attrs#
notes_label#
ignore_segments#
convert_cuis_to_notes#
create_config#
top_values_by_attr#
property description: medkit.core.OperationDescription#
save(docs: list[medkit.core.text.TextDocument], dir_path: str | pathlib.Path, doc_names: list[str] | None = None)#

Save text documents as brat files.

Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files. A file named ‘annotation.conf’ may also be saved if required.

Parameters:
docslist of TextDocument

List of medkit doc objects to convert

dir_pathstr or Path

String or path object to save the generated files

doc_nameslist of str, optional

Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.

_convert_medkit_anns_to_brat(segments: list[medkit.core.text.Segment], relations: list[medkit.core.text.Relation], config: medkit.io._brat_utils.BratAnnConfiguration, raw_text: str) list[medkit.io._brat_utils.BratEntity | medkit.io._brat_utils.BratAttribute | medkit.io._brat_utils.BratRelation | medkit.io._brat_utils.BratNote]#

Convert Segments, Relations and Attributes into brat data structures.

Parameters:
segmentslist of Segment

Medkit segments to convert

relationslist of Relation

Medkit relations to convert

configBratAnnConfiguration

Optional BratAnnConfiguration structure, this object is updated with the types of the generated Brat annotations.

raw_textstr

Text of reference to get the original text of the annotations

Returns:
list of BratEntity or BratAttribute or BratRelation or BratNote

A list of brat annotations

static _ensure_text_and_spans(segment: medkit.core.text.Segment, raw_text: str) tuple[str, list[tuple[int, int]]]#

Ensure consistency between the segment and the raw text.

The text of a BRAT annotation can’t contain multiple white spaces (including a newline character). This method cleans the fragments’ text and adjust its spans to point to the same location in the raw text.

Parameters:
segmentSegment

Segment to ensure

raw_textstr

Text of reference

Returns:
textstr

The cleaned text

spanslist of tuple

The adjusted spans

_convert_segment_to_brat(segment: medkit.core.text.Segment, nb_segment: int, raw_text: str) medkit.io._brat_utils.BratEntity#

Get a brat entity from a medkit segment.

Parameters:
segmentSegment

A medkit segment to convert into brat format

nb_segmentint

The current counter of brat segments

raw_textstr

Text of reference to get the original text of the segment

Returns:
BratEntity

The equivalent brat entity of the medkit segment

static _convert_relation_to_brat(relation: medkit.core.text.Relation, nb_relation: int, brat_entities_by_segment_id: dict[str, medkit.io._brat_utils.BratEntity]) tuple[medkit.io._brat_utils.BratRelation, medkit.io._brat_utils.RelationConf]#

Get a brat relation from a medkit relation.

Parameters:
relationRelation

A medkit relation to convert into brat format

nb_relationint

The current counter of brat relations

brat_entities_by_segment_iddict of str to BratEntity

A dict to map medkit ID to brat annotation

Returns:
relationBratRelation

The equivalent brat relation of the medkit relation

configRelationConf

Configuration of the brat attribute

Raises:
ValueError

When the source or target was not found in the mapping object

static _convert_attribute_to_brat(label: str, value: str | None, nb_attribute: int, target_brat_id: str, is_from_entity: bool) tuple[medkit.io._brat_utils.BratAttribute, medkit.io._brat_utils.AttributeConf]#

Get a brat attribute from a medkit attribute.

Parameters:
labelstr

Attribute label to convert into brat format

valuestr, optional

Attribute value

nb_attributeint

The current counter of brat attributes

target_brat_idstr

Corresponding target brat ID

Returns:
attributeBratAttribute

The equivalent brat attribute of the medkit attribute

configAttributeConf

Configuration of the brat attribute

static _convert_umls_attributes_to_brat_note(cuis: list[str], nb_note: int, target_brat_id: str) medkit.io._brat_utils.BratNote#

Get a brat note from a medkit umls norm attribute.

Parameters:
cuislist of str

CUI to convert to brat note

nb_noteint

The current counter of brat notes

target_brat_idstr

Corresponding target brat ID

Returns:
BratNote

The equivalent brat note of the medkit umls attribute

static _convert_attributes_to_brat_note(values: list[Any], nb_note: int, target_brat_id: str) medkit.io._brat_utils.BratNote#

Get a brat note from medkit attribute values.

Parameters:
valueslist of Any

Attribute values

nb_noteint

The current counter of brat notes

target_brat_idstr

Corresponding target brat ID

Returns:
BratNote

The equivalent brat note of the medkit attribute values

class medkit.io.DoccanoClientConfig#

Doccano client configuration.

The default values are the default values used by doccano.

Attributes:
column_textstr, default=”text”

Name or key representing the text

column_labelstr, default=”label”

Name or key representing the label

column_text: str = 'text'#
column_label: str = 'label'#
class medkit.io.DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters:
taskDocanoTask

The doccano task for the input converter

client_configDoccanoClientConfig, optional

Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.

attr_labelstr, default=”doccano_category”

The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.

uidstr, optional

Identifier of the converter.

Attributes:
descriptionstr

Description for the operation.

uid#
client_config#
task#
attr_label#
_prov_tracer: medkit.core.ProvTracer | None = None#
set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

Enable provenance tracing.

Parameters:
prov_tracerProvTracer

The provenance tracer used to trace the provenance.

property description: medkit.core.OperationDescription#

Contains all the input converter init parameters.

load_from_directory_zip(dir_path: str | pathlib.Path) list[medkit.core.text.TextDocument]#

Load text documents from a directory of zip files.

The zip files should contain JSONL files coming from doccano.

Parameters:
dir_pathstr or Path

The path to the directory containing zip files.

Returns:
list of TextDocument

A list of TextDocuments

load_from_zip(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument]#

Load text documents from a zip file.

Parameters:
input_filestr or Path

The path to the zip file containing a docanno JSONL file

Returns:
list of TextDocument

A list of TextDocuments

load_from_file(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument]#

Load text documents from a JSONL file.

Parameters:
input_filestr or Path

The path to the JSONL file containing doccano annotations

Returns:
list of TextDocument

A list of TextDocuments

_check_crlf_character(documents: list[medkit.core.text.TextDocument])#

Check if the list of converted documents contains the CRLF character.

This character is the only indicator available to warn if there are alignment problems in the documents.

_parse_doc_line(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a doc_line into a TextDocument depending on the task.

Parameters:
doc_linedict of str to Any

A dictionary representing an annotation from doccano

Returns:
TextDocument

A document with parsed annotations.

_parse_doc_line_relation_extraction(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities and relations.

Parameters:
doc_linedict of str to Any

Dictionary with doccano annotation

Returns:
TextDocument

The document with annotations

_parse_doc_line_seq_labeling(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities.

Parameters:
doc_linedict of str to Any

Dictionary with doccano annotation.

Returns:
TextDocument

The document with annotations

_parse_doc_line_text_classification(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with an attribute.

Parameters:
doc_linedict of str to Any

Dictionary with doccano annotation.

Returns:
TextDocument

The document with its category

class medkit.io.DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters:
taskDoccanoTask

The doccano task for the input converter

anns_labelslist of str, optional

Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

attr_labelstr, optional

The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.

ignore_segmentsbool, default=True

If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

include_metadatabool, default=True

Whether include medkit metadata in the converted documents

uidstr, optional

Identifier of the converter.

Attributes:
descriptionstr

Description for the operation.

uid#
task#
anns_labels#
attr_label#
ignore_segments#
include_metadata#
property description: medkit.core.OperationDescription#
save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)#

Convert and save a list of TextDocuments into a doccano file (.JSONL).

Parameters:
docslist of TextDocument

List of medkit doc objects to convert

output_filestr or Path

Path or string of the JSONL file where to save the converted documents

_convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument into a dictionary depending on the task.

Parameters:
medkit_docTextDocument

Document to convert

Returns:
dict of str to Any

Dictionary with doccano annotation

_convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.

Parameters:
medkit_docTextDocument

Document to convert, it may contain entities and relations.

Returns:
dict of str to Any

Dictionary with doccano annotation. It may contain text, entities and relations.

_convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.

Parameters:
medkit_docTextDocument

Document to convert, it may contain entities.

Returns:
dict of str to Any

Dictionary with doccano annotation. It may contain text ans its label (a list of tuples representing entities).

_convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano text classification task.

Parameters:
medkit_docTextDocument

Document to convert, it may contain at least one attribute to convert.

Returns:
dict of str to Any

Dictionary with doccano annotation. It may contain text ans its label (a category(str)).

class medkit.io.DoccanoTask(*args, **kwds)#

Bases: enum.Enum

Supported doccano tasks.

Attributes:
TEXT_CLASSIFICATION

Documents with a category

RELATION_EXTRACTION

Documents with entities and relations (including IDs)

SEQUENCE_LABELING

Documents with entities in tuples

TEXT_CLASSIFICATION = 'text_classification'#
RELATION_EXTRACTION = 'relation_extraction'#
SEQUENCE_LABELING = 'sequence_labeling'#
class medkit.io.RTTMInputConverter(turn_label: str = 'turn', speaker_label: str = 'speaker', converter_id: str | None = None)#

Bases: medkit.core.InputConverter

Class for conversions from Rich Transcription Time Marked (.rttm) into turn segments.

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

For each turn in a .rttm file containing diarization information, a Segment will be created, with an associated Attribute holding the name of the turn speaker as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters:
turn_labelstr, default=”turn”

Label of segments representing turns in the .rttm file.

speaker_labelstr, default=”speaker”

Label of speaker attributes to add to each segment.

converter_idstr, optional

Identifier of the converter.

Attributes:
descriptionOperationDescription

Description for the operation.

uid#
turn_label#
speaker_label#
_prov_tracer: medkit.core.ProvTracer | None = None#
property description: medkit.core.OperationDescription#

Contains all the input converter init parameters.

set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

Enable provenance tracing.

Parameters:
prov_tracer:

The provenance tracer used to trace the provenance.

load(rttm_dir: str | pathlib.Path, audio_dir: str | pathlib.Path | None = None, audio_ext: str = '.wav') list[medkit.core.audio.AudioDocument]#

Load all .rttm files in a directory into a list of audio documents.

For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters:
rttm_dirstr or Path

Directory containing the .rttm files.

audio_dirstr or Path, optional

Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.

audio_extstr, default=”.wav”

File extension to use for audio files.

Returns:
list of AudioDocument

List of generated documents.

load_doc(rttm_file: str | pathlib.Path, audio_file: str | pathlib.Path) medkit.core.audio.AudioDocument#

Load a single .rttm file into an audio document.

Parameters:
rttm_filestr or Path

Path to the .rttm file.

audio_filestr or Path

Path to the corresponding audio file.

Returns:
AudioDocument

Generated document.

load_turns(rttm_file: str | pathlib.Path, audio_file: str | pathlib.Path) list[medkit.core.audio.Segment]#

Load a .rttm file as a list of segments.

Parameters:
rttm_filestr or Path

Path to the .rttm file.

audio_filestr or Path

Path to the corresponding audio file.

Returns:
list of Segment

Turn segments as found in the .rttm file.

static _load_rows(rttm_file: pathlib.Path)#
_build_turn_segment(row: dict[str, Any], full_audio: medkit.core.audio.FileAudioBuffer) medkit.core.audio.Segment#
class medkit.io.RTTMOutputConverter(turn_label: str = 'turn', speaker_label: str = 'speaker')#

Bases: medkit.core.OutputConverter

Class for conversions to Rich Transcription Time Marked (.rttm).

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the name of the turn speaker as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters:
turn_labelstr, default=”turn”

Label of segments representing turns in the audio documents.

speaker_labelstr, default=”speaker”

Label of speaker attributes attached to each turn segment.

turn_label#
speaker_label#
save(docs: list[medkit.core.audio.AudioDocument], rttm_dir: str | pathlib.Path, doc_names: list[str] | None = None)#

Save a collection of audio documents to RTTM files in a directory.

Parameters:
docslist of AudioDocument

List of audio documents to save.

rttm_dirstr or Path

Directory into which the generated .rttm files will be stored.

doc_nameslist of str, optional

Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.

save_doc(doc: medkit.core.audio.AudioDocument, rttm_file: str | pathlib.Path, rttm_doc_id: str | None = None)#

Save a single audio document to a RTTM file.

Parameters:
docAudioDocument

Audio document to save.

rttm_filestr or Path

Path of the generated .rttm file.

rttm_doc_idstr, optional

File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.

save_turn_segments(turn_segments: list[medkit.core.audio.Segment], rttm_file: str | pathlib.Path, rttm_doc_id: str | None)#

Save Segment objects into a .rttm file.

Parameters:
turn_segmentslist of Segment

Turn segments to save.

rttm_filestr or Path

Path of the generated .rttm file.

rttm_doc_idstr, optional

File uid to use for the generated .rttm file (2d column).

_build_rttm_row(turn_segment: medkit.core.audio.Segment, rttm_doc_id: str | None) dict[str, Any]#