medkit.io

medkit.io#

Subpackages#

medkit.io.medkit_json

Submodules#

Classes#

`BratInputConverter`	Class in charge of converting brat annotations.
`BratOutputConverter`	Class for converting text documents to a brat collection file.
`DoccanoClientConfig`	Doccano client configuration.
`DoccanoInputConverter`	Convert doccano files (.JSONL) containing annotations for a given task.
`DoccanoOutputConverter`	Convert medkit files to doccano files (.JSONL) for a given task.
`DoccanoTask`	Supported doccano tasks.
`RTTMInputConverter`	Class for conversions from Rich Transcription Time Marked (.rttm) into turn segments.
`RTTMOutputConverter`	Class for conversions to Rich Transcription Time Marked (.rttm).

Package Contents#

class medkit.io.BratInputConverter(detect_cuis_in_notes: bool = True, notes_label: str = 'brat_note', uid: str | None = None)#

Bases: medkit.core.InputConverter

Class in charge of converting brat annotations.

Parameters:

detect_cuis_in_notesbool, default=True: If True, strings looking like CUIs in annotator notes of entities will be converted to UMLS normalization attributes rather than creating an Attribute with the whole note text as value.
notes_labelstr, default=”brat_note”,: Label to use for attributes created from annotator notes.
uidstr, optional: Identifier of the converter.

Attributes:

descriptionstr: Description of the operation

notes_label#

detect_cuis_in_notes#

uid#

_prov_tracer: medkit.core.ProvTracer | None = None#

property description: medkit.core.OperationDescription#

set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

load(dir_path: str | pathlib.Path, ann_ext: str = ANN_EXT, text_ext: str = TEXT_EXT) → list[medkit.core.text.TextDocument]#

Load brat annotations as text documents.

Create a list of TextDocuments from a folder containing text files and associated brat annotations files.

Parameters:

dir_pathstr or Path: The path to the directory containing the text files and the annotation files (.ann)
ann_extstr, optional: The extension of the brat annotation file (e.g. .ann)
text_extstr, optional: The extension of the text file (e.g. .txt)

Returns:

list of TextDocument: The list of TextDocuments

load_doc(ann_path: str | pathlib.Path, text_path: str | pathlib.Path) → medkit.core.text.TextDocument#

Load a brat annotation and text file combo as a text document.

Create a TextDocument from a .ann file and its associated .txt file.

Parameters:

ann_pathstr or Path: The path to the brat annotation file.
text_pathstr or Path: The path to the text document file.

Returns:

TextDocument: The document containing the text and the annotations

load_annotations(ann_file: str | pathlib.Path) → list[medkit.core.text.TextAnnotation]#

Load a brat annotation file as a list of annotations.

Load a .ann file and return a list of Annotation objects.

Parameters:

ann_filestr or Path: Path to the .ann file.

Returns:

list of TextAnnotation: The list of text annotations

class medkit.io.BratOutputConverter(anns_labels: list[str] | None = None, attrs: list[str] | None = None, notes_label: str = 'brat_note', ignore_segments: bool = True, convert_cuis_to_notes: bool = True, create_config: bool = True, top_values_by_attr: int = 50, uid: str | None = None)#

Bases: medkit.core.OutputConverter

Class for converting text documents to a brat collection file.

Hint

BRAT checks for coherence between span and text for each annotation. This converter adjusts the text and spans to get the right visualization and ensure compatibility.

Parameters:

anns_labelslist of str, optional: Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted
attrslist of str, optional: Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes
notes_labelstr, default=”brat_note”: Label of attributes that will be converted to annotator notes.
ignore_segmentsbool, default=True: If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.
convert_cuis_to_notesbool, default=True: If True, UMLS normalization attributes will be converted to annotator notes rather than attributes. For entities with multiple UMLS attributes, CUIs will be separated by spaces (ex: “C0011849 C0004096”).
create_configbool, default=True: Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.
top_values_by_attrint, default=50: Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.
uidstr, optional: Identifier of the converter

Attributes:

descriptionstr: Description for the operation

uid#

anns_labels#

attrs#

notes_label#

ignore_segments#

convert_cuis_to_notes#

create_config#

top_values_by_attr#

property description: medkit.core.OperationDescription#

save(docs: list[medkit.core.text.TextDocument], dir_path: str | pathlib.Path, doc_names: list[str] | None = None)#

Save text documents as brat files.

Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files. A file named ‘annotation.conf’ may also be saved if required.

Parameters:

docslist of TextDocument: List of medkit doc objects to convert
dir_pathstr or Path: String or path object to save the generated files
doc_nameslist of str, optional: Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.

_convert_medkit_anns_to_brat(segments: list[medkit.core.text.Segment], relations: list[medkit.core.text.Relation], config: medkit.io._brat_utils.BratAnnConfiguration, raw_text: str) → list[medkit.io._brat_utils.BratEntity | medkit.io._brat_utils.BratAttribute | medkit.io._brat_utils.BratRelation | medkit.io._brat_utils.BratNote]#

Convert Segments, Relations and Attributes into brat data structures.

Parameters:

segmentslist of Segment: Medkit segments to convert
relationslist of Relation: Medkit relations to convert
configBratAnnConfiguration: Optional BratAnnConfiguration structure, this object is updated with the types of the generated Brat annotations.
raw_textstr: Text of reference to get the original text of the annotations

Returns:

list of BratEntity or BratAttribute or BratRelation or BratNote: A list of brat annotations

static _ensure_text_and_spans(segment: medkit.core.text.Segment, raw_text: str) → tuple[str, list[tuple[int, int]]]#

Ensure consistency between the segment and the raw text.

The text of a BRAT annotation can’t contain multiple white spaces (including a newline character). This method cleans the fragments’ text and adjust its spans to point to the same location in the raw text.

Parameters:

segmentSegment: Segment to ensure
raw_textstr: Text of reference

Returns:

textstr: The cleaned text
spanslist of tuple: The adjusted spans

_convert_segment_to_brat(segment: medkit.core.text.Segment, nb_segment: int, raw_text: str) → medkit.io._brat_utils.BratEntity#

Get a brat entity from a medkit segment.

Parameters:

segmentSegment: A medkit segment to convert into brat format
nb_segmentint: The current counter of brat segments
raw_textstr: Text of reference to get the original text of the segment

Returns:

BratEntity: The equivalent brat entity of the medkit segment

static _convert_relation_to_brat(relation: medkit.core.text.Relation, nb_relation: int, brat_entities_by_segment_id: dict[str, medkit.io._brat_utils.BratEntity]) → tuple[medkit.io._brat_utils.BratRelation, medkit.io._brat_utils.RelationConf]#

Get a brat relation from a medkit relation.

Parameters:

relationRelation: A medkit relation to convert into brat format
nb_relationint: The current counter of brat relations
brat_entities_by_segment_iddict of str to BratEntity: A dict to map medkit ID to brat annotation

Returns:

relationBratRelation: The equivalent brat relation of the medkit relation
configRelationConf: Configuration of the brat attribute

Raises:

ValueError: When the source or target was not found in the mapping object

static _convert_attribute_to_brat(label: str, value: str | None, nb_attribute: int, target_brat_id: str, is_from_entity: bool) → tuple[medkit.io._brat_utils.BratAttribute, medkit.io._brat_utils.AttributeConf]#

Get a brat attribute from a medkit attribute.

Parameters:

labelstr: Attribute label to convert into brat format
valuestr, optional: Attribute value
nb_attributeint: The current counter of brat attributes
target_brat_idstr: Corresponding target brat ID

Returns:

attributeBratAttribute: The equivalent brat attribute of the medkit attribute
configAttributeConf: Configuration of the brat attribute

static _convert_umls_attributes_to_brat_note(cuis: list[str], nb_note: int, target_brat_id: str) → medkit.io._brat_utils.BratNote#

Get a brat note from a medkit umls norm attribute.

Parameters:

cuislist of str: CUI to convert to brat note
nb_noteint: The current counter of brat notes
target_brat_idstr: Corresponding target brat ID

Returns:

BratNote: The equivalent brat note of the medkit umls attribute

static _convert_attributes_to_brat_note(values: list[Any], nb_note: int, target_brat_id: str) → medkit.io._brat_utils.BratNote#

Get a brat note from medkit attribute values.

Parameters:

valueslist of Any: Attribute values
nb_noteint: The current counter of brat notes
target_brat_idstr: Corresponding target brat ID

Returns:

BratNote: The equivalent brat note of the medkit attribute values

class medkit.io.DoccanoClientConfig#

Doccano client configuration.

The default values are the default values used by doccano.

Attributes:

column_textstr, default=”text”: Name or key representing the text
column_labelstr, default=”label”: Name or key representing the label

column_text: str = 'text'#

column_label: str = 'label'#

class medkit.io.DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters:

taskDocanoTask: The doccano task for the input converter
client_configDoccanoClientConfig, optional: Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.
attr_labelstr, default=”doccano_category”: The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.
uidstr, optional: Identifier of the converter.

Attributes:

descriptionstr: Description for the operation.

uid#

client_config#

task#

attr_label#

_prov_tracer: medkit.core.ProvTracer | None = None#

set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

Enable provenance tracing.

Parameters:

prov_tracerProvTracer: The provenance tracer used to trace the provenance.

property description: medkit.core.OperationDescription#: Contains all the input converter init parameters.

load_from_directory_zip(dir_path: str | pathlib.Path) → list[medkit.core.text.TextDocument]#

Load text documents from a directory of zip files.

The zip files should contain JSONL files coming from doccano.

Parameters:

dir_pathstr or Path: The path to the directory containing zip files.

Returns:

list of TextDocument: A list of TextDocuments

load_from_zip(input_file: str | pathlib.Path) → list[medkit.core.text.TextDocument]#

Load text documents from a zip file.

Parameters:

input_filestr or Path: The path to the zip file containing a docanno JSONL file

Returns:

list of TextDocument: A list of TextDocuments

load_from_file(input_file: str | pathlib.Path) → list[medkit.core.text.TextDocument]#

Load text documents from a JSONL file.

Parameters:

input_filestr or Path: The path to the JSONL file containing doccano annotations

Returns:

list of TextDocument: A list of TextDocuments

_check_crlf_character(documents: list[medkit.core.text.TextDocument])#

Check if the list of converted documents contains the CRLF character.

This character is the only indicator available to warn if there are alignment problems in the documents.

_parse_doc_line(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a doc_line into a TextDocument depending on the task.

Parameters:

doc_linedict of str to Any: A dictionary representing an annotation from doccano

Returns:

TextDocument: A document with parsed annotations.

_parse_doc_line_relation_extraction(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities and relations.

Parameters:

doc_linedict of str to Any: Dictionary with doccano annotation

Returns:

TextDocument: The document with annotations

_parse_doc_line_seq_labeling(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities.

Parameters:

doc_linedict of str to Any: Dictionary with doccano annotation.

Returns:

TextDocument: The document with annotations

_parse_doc_line_text_classification(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with an attribute.

Parameters:

doc_linedict of str to Any: Dictionary with doccano annotation.

Returns:

TextDocument: The document with its category

class medkit.io.DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters:

taskDoccanoTask: The doccano task for the input converter
anns_labelslist of str, optional: Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.
attr_labelstr, optional: The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.
ignore_segmentsbool, default=True: If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.
include_metadatabool, default=True: Whether include medkit metadata in the converted documents
uidstr, optional: Identifier of the converter.

Attributes:

descriptionstr: Description for the operation.

uid#

task#

anns_labels#

attr_label#

ignore_segments#

include_metadata#

property description: medkit.core.OperationDescription#

save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)#

Convert and save a list of TextDocuments into a doccano file (.JSONL).

Parameters:

docslist of TextDocument: List of medkit doc objects to convert
output_filestr or Path: Path or string of the JSONL file where to save the converted documents

_convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument into a dictionary depending on the task.

Parameters:

medkit_docTextDocument: Document to convert

Returns:

dict of str to Any: Dictionary with doccano annotation

_convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.

Parameters:

medkit_docTextDocument: Document to convert, it may contain entities and relations.

Returns:

dict of str to Any: Dictionary with doccano annotation. It may contain text, entities and relations.

_convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.

Parameters:

medkit_docTextDocument: Document to convert, it may contain entities.

Returns:

dict of str to Any: Dictionary with doccano annotation. It may contain text ans its label (a list of tuples representing entities).

_convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano text classification task.

Parameters:

medkit_docTextDocument: Document to convert, it may contain at least one attribute to convert.

Returns:

dict of str to Any: Dictionary with doccano annotation. It may contain text ans its label (a category(str)).

class medkit.io.DoccanoTask(*args, **kwds)#

Bases: enum.Enum

Supported doccano tasks.

Attributes:

TEXT_CLASSIFICATION: Documents with a category
RELATION_EXTRACTION: Documents with entities and relations (including IDs)
SEQUENCE_LABELING: Documents with entities in tuples

TEXT_CLASSIFICATION = 'text_classification'#

RELATION_EXTRACTION = 'relation_extraction'#

SEQUENCE_LABELING = 'sequence_labeling'#

class medkit.io.RTTMInputConverter(turn_label: str = 'turn', speaker_label: str = 'speaker', converter_id: str | None = None)#

Bases: medkit.core.InputConverter

Class for conversions from Rich Transcription Time Marked (.rttm) into turn segments.

Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.

For each turn in a .rttm file containing diarization information, a Segment will be created, with an associated Attribute holding the name of the turn speaker as value. The segments can be retrieved directly or as part of an AudioDocument instance.

If a ProvTracer is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).

Parameters:

turn_labelstr, default=”turn”: Label of segments representing turns in the .rttm file.
speaker_labelstr, default=”speaker”: Label of speaker attributes to add to each segment.
converter_idstr, optional: Identifier of the converter.

Attributes:

descriptionOperationDescription: Description for the operation.

uid#

turn_label#

speaker_label#

_prov_tracer: medkit.core.ProvTracer | None = None#

property description: medkit.core.OperationDescription#: Contains all the input converter init parameters.

set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

Enable provenance tracing.

Parameters:

prov_tracer:: The provenance tracer used to trace the provenance.

load(rttm_dir: str | pathlib.Path, audio_dir: str | pathlib.Path | None = None, audio_ext: str = '.wav') → list[medkit.core.audio.AudioDocument]#

Load all .rttm files in a directory into a list of audio documents.

For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.

Parameters:

rttm_dirstr or Path: Directory containing the .rttm files.
audio_dirstr or Path, optional: Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.
audio_extstr, default=”.wav”: File extension to use for audio files.

Returns:

list of AudioDocument: List of generated documents.

load_doc(rttm_file: str | pathlib.Path, audio_file: str | pathlib.Path) → medkit.core.audio.AudioDocument#

Load a single .rttm file into an audio document.

Parameters:

rttm_filestr or Path: Path to the .rttm file.
audio_filestr or Path: Path to the corresponding audio file.

Returns:

AudioDocument: Generated document.

load_turns(rttm_file: str | pathlib.Path, audio_file: str | pathlib.Path) → list[medkit.core.audio.Segment]#

Load a .rttm file as a list of segments.

Parameters:

rttm_filestr or Path: Path to the .rttm file.
audio_filestr or Path: Path to the corresponding audio file.

Returns:

list of Segment: Turn segments as found in the .rttm file.

static _load_rows(rttm_file: pathlib.Path)#

_build_turn_segment(row: dict[str, Any], full_audio: medkit.core.audio.FileAudioBuffer) → medkit.core.audio.Segment#

class medkit.io.RTTMOutputConverter(turn_label: str = 'turn', speaker_label: str = 'speaker')#

Bases: medkit.core.OutputConverter

Class for conversions to Rich Transcription Time Marked (.rttm).

Build Rich Transcription Time Marked (.rttm) files containing diarization information from Segment objects.

There must be a segment for each turn, with an associated Attribute holding the name of the turn speaker as value. The segments can be passed directly or as part of AudioDocument instances.

Parameters:

turn_labelstr, default=”turn”: Label of segments representing turns in the audio documents.
speaker_labelstr, default=”speaker”: Label of speaker attributes attached to each turn segment.

turn_label#

speaker_label#

save(docs: list[medkit.core.audio.AudioDocument], rttm_dir: str | pathlib.Path, doc_names: list[str] | None = None)#

Save a collection of audio documents to RTTM files in a directory.

Parameters:

docslist of AudioDocument: List of audio documents to save.
rttm_dirstr or Path: Directory into which the generated .rttm files will be stored.
doc_nameslist of str, optional: Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.

save_doc(doc: medkit.core.audio.AudioDocument, rttm_file: str | pathlib.Path, rttm_doc_id: str | None = None)#

Save a single audio document to a RTTM file.

Parameters:

docAudioDocument: Audio document to save.
rttm_filestr or Path: Path of the generated .rttm file.
rttm_doc_idstr, optional: File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.

save_turn_segments(turn_segments: list[medkit.core.audio.Segment], rttm_file: str | pathlib.Path, rttm_doc_id: str | None)#

Save Segment objects into a .rttm file.

Parameters:

turn_segmentslist of Segment: Turn segments to save.
rttm_filestr or Path: Path of the generated .rttm file.
rttm_doc_idstr, optional: File uid to use for the generated .rttm file (2d column).

_build_rttm_row(turn_segment: medkit.core.audio.Segment, rttm_doc_id: str | None) → dict[str, Any]#