medkit.io#
Subpackages#
Submodules#
Classes#
Class in charge of converting brat annotations. |
|
Class for converting text documents to a brat collection file. |
|
Doccano client configuration. |
|
Convert doccano files (.JSONL) containing annotations for a given task. |
|
Convert medkit files to doccano files (.JSONL) for a given task. |
|
Supported doccano tasks. |
|
Class for conversions from Rich Transcription Time Marked (.rttm) into turn segments. |
|
Class for conversions to Rich Transcription Time Marked (.rttm). |
Package Contents#
- class medkit.io.BratInputConverter(detect_cuis_in_notes: bool = True, notes_label: str = 'brat_note', uid: str | None = None)#
Bases:
medkit.core.InputConverter
Class in charge of converting brat annotations.
- Parameters:
- detect_cuis_in_notesbool, default=True
If True, strings looking like CUIs in annotator notes of entities will be converted to UMLS normalization attributes rather than creating an
Attribute
with the whole note text as value.- notes_labelstr, default=”brat_note”,
Label to use for attributes created from annotator notes.
- uidstr, optional
Identifier of the converter.
- Attributes:
- descriptionstr
Description of the operation
- notes_label#
- detect_cuis_in_notes#
- uid#
- _prov_tracer: medkit.core.ProvTracer | None = None#
- property description: medkit.core.OperationDescription#
- set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#
- load(dir_path: str | pathlib.Path, ann_ext: str = ANN_EXT, text_ext: str = TEXT_EXT) list[medkit.core.text.TextDocument] #
Load brat annotations as text documents.
Create a list of TextDocuments from a folder containing text files and associated brat annotations files.
- Parameters:
- dir_pathstr or Path
The path to the directory containing the text files and the annotation files (.ann)
- ann_extstr, optional
The extension of the brat annotation file (e.g. .ann)
- text_extstr, optional
The extension of the text file (e.g. .txt)
- Returns:
- list of TextDocument
The list of TextDocuments
- load_doc(ann_path: str | pathlib.Path, text_path: str | pathlib.Path) medkit.core.text.TextDocument #
Load a brat annotation and text file combo as a text document.
Create a TextDocument from a .ann file and its associated .txt file.
- Parameters:
- ann_pathstr or Path
The path to the brat annotation file.
- text_pathstr or Path
The path to the text document file.
- Returns:
- TextDocument
The document containing the text and the annotations
- load_annotations(ann_file: str | pathlib.Path) list[medkit.core.text.TextAnnotation] #
Load a brat annotation file as a list of annotations.
Load a .ann file and return a list of
Annotation
objects.- Parameters:
- ann_filestr or Path
Path to the .ann file.
- Returns:
- list of TextAnnotation
The list of text annotations
- class medkit.io.BratOutputConverter(anns_labels: list[str] | None = None, attrs: list[str] | None = None, notes_label: str = 'brat_note', ignore_segments: bool = True, convert_cuis_to_notes: bool = True, create_config: bool = True, top_values_by_attr: int = 50, uid: str | None = None)#
Bases:
medkit.core.OutputConverter
Class for converting text documents to a brat collection file.
Hint
BRAT checks for coherence between span and text for each annotation. This converter adjusts the text and spans to get the right visualization and ensure compatibility.
- Parameters:
- anns_labelslist of str, optional
Labels of medkit annotations to convert into Brat annotations. If None (default) all the annotations will be converted
- attrslist of str, optional
Labels of medkit attributes to add in the annotations that will be included. If None (default) all medkit attributes found in the segments or relations will be converted to Brat attributes
- notes_labelstr, default=”brat_note”
Label of attributes that will be converted to annotator notes.
- ignore_segmentsbool, default=True
If True medkit segments will be ignored. Only entities, attributes and relations will be converted to Brat annotations. If False the medkit segments will be converted to Brat annotations as well.
- convert_cuis_to_notesbool, default=True
If True, UMLS normalization attributes will be converted to annotator notes rather than attributes. For entities with multiple UMLS attributes, CUIs will be separated by spaces (ex: “C0011849 C0004096”).
- create_configbool, default=True
Whether to create a configuration file for the generated collection. This file defines the types of annotations generated, it is necessary for the correct visualization on Brat.
- top_values_by_attrint, default=50
Defines the number of most common values by attribute to show in the configuration. This is useful when an attribute has a large number of values, only the ‘top’ ones will be in the config. By default, the top 50 of values by attr will be in the config.
- uidstr, optional
Identifier of the converter
- Attributes:
- descriptionstr
Description for the operation
- uid#
- anns_labels#
- attrs#
- notes_label#
- ignore_segments#
- convert_cuis_to_notes#
- create_config#
- top_values_by_attr#
- property description: medkit.core.OperationDescription#
- save(docs: list[medkit.core.text.TextDocument], dir_path: str | pathlib.Path, doc_names: list[str] | None = None)#
Save text documents as brat files.
Convert and save a collection or list of TextDocuments into a Brat collection. For each collection or list of documents, a folder is created with ‘.txt’ and ‘.ann’ files. A file named ‘annotation.conf’ may also be saved if required.
- Parameters:
- docslist of TextDocument
List of medkit doc objects to convert
- dir_pathstr or Path
String or path object to save the generated files
- doc_nameslist of str, optional
Optional list with the names for the generated files. If ‘None’, ‘uid’ will be used as the name. Where ‘uid.txt’ has the raw text of the document and ‘uid.ann’ the Brat annotation file.
- _convert_medkit_anns_to_brat(segments: list[medkit.core.text.Segment], relations: list[medkit.core.text.Relation], config: medkit.io._brat_utils.BratAnnConfiguration, raw_text: str) list[medkit.io._brat_utils.BratEntity | medkit.io._brat_utils.BratAttribute | medkit.io._brat_utils.BratRelation | medkit.io._brat_utils.BratNote] #
Convert Segments, Relations and Attributes into brat data structures.
- Parameters:
- segmentslist of Segment
Medkit segments to convert
- relationslist of Relation
Medkit relations to convert
- configBratAnnConfiguration
Optional BratAnnConfiguration structure, this object is updated with the types of the generated Brat annotations.
- raw_textstr
Text of reference to get the original text of the annotations
- Returns:
- list of BratEntity or BratAttribute or BratRelation or BratNote
A list of brat annotations
- static _ensure_text_and_spans(segment: medkit.core.text.Segment, raw_text: str) tuple[str, list[tuple[int, int]]] #
Ensure consistency between the segment and the raw text.
The text of a BRAT annotation can’t contain multiple white spaces (including a newline character). This method cleans the fragments’ text and adjust its spans to point to the same location in the raw text.
- Parameters:
- segmentSegment
Segment to ensure
- raw_textstr
Text of reference
- Returns:
- textstr
The cleaned text
- spanslist of tuple
The adjusted spans
- _convert_segment_to_brat(segment: medkit.core.text.Segment, nb_segment: int, raw_text: str) medkit.io._brat_utils.BratEntity #
Get a brat entity from a medkit segment.
- Parameters:
- segmentSegment
A medkit segment to convert into brat format
- nb_segmentint
The current counter of brat segments
- raw_textstr
Text of reference to get the original text of the segment
- Returns:
- BratEntity
The equivalent brat entity of the medkit segment
- static _convert_relation_to_brat(relation: medkit.core.text.Relation, nb_relation: int, brat_entities_by_segment_id: dict[str, medkit.io._brat_utils.BratEntity]) tuple[medkit.io._brat_utils.BratRelation, medkit.io._brat_utils.RelationConf] #
Get a brat relation from a medkit relation.
- Parameters:
- relationRelation
A medkit relation to convert into brat format
- nb_relationint
The current counter of brat relations
- brat_entities_by_segment_iddict of str to BratEntity
A dict to map medkit ID to brat annotation
- Returns:
- relationBratRelation
The equivalent brat relation of the medkit relation
- configRelationConf
Configuration of the brat attribute
- Raises:
- ValueError
When the source or target was not found in the mapping object
- static _convert_attribute_to_brat(label: str, value: str | None, nb_attribute: int, target_brat_id: str, is_from_entity: bool) tuple[medkit.io._brat_utils.BratAttribute, medkit.io._brat_utils.AttributeConf] #
Get a brat attribute from a medkit attribute.
- Parameters:
- labelstr
Attribute label to convert into brat format
- valuestr, optional
Attribute value
- nb_attributeint
The current counter of brat attributes
- target_brat_idstr
Corresponding target brat ID
- Returns:
- attributeBratAttribute
The equivalent brat attribute of the medkit attribute
- configAttributeConf
Configuration of the brat attribute
- static _convert_umls_attributes_to_brat_note(cuis: list[str], nb_note: int, target_brat_id: str) medkit.io._brat_utils.BratNote #
Get a brat note from a medkit umls norm attribute.
- Parameters:
- cuislist of str
CUI to convert to brat note
- nb_noteint
The current counter of brat notes
- target_brat_idstr
Corresponding target brat ID
- Returns:
- BratNote
The equivalent brat note of the medkit umls attribute
- static _convert_attributes_to_brat_note(values: list[Any], nb_note: int, target_brat_id: str) medkit.io._brat_utils.BratNote #
Get a brat note from medkit attribute values.
- Parameters:
- valueslist of Any
Attribute values
- nb_noteint
The current counter of brat notes
- target_brat_idstr
Corresponding target brat ID
- Returns:
- BratNote
The equivalent brat note of the medkit attribute values
- class medkit.io.DoccanoClientConfig#
Doccano client configuration.
The default values are the default values used by doccano.
- Attributes:
- column_textstr, default=”text”
Name or key representing the text
- column_labelstr, default=”label”
Name or key representing the label
- column_text: str = 'text'#
- column_label: str = 'label'#
- class medkit.io.DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)#
Convert doccano files (.JSONL) containing annotations for a given task.
For each line, a
TextDocument
will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f.
DoccanoClientConfig
)Warning
If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.
- Parameters:
- taskDocanoTask
The doccano task for the input converter
- client_configDoccanoClientConfig, optional
Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.
- attr_labelstr, default=”doccano_category”
The label to use for the medkit attribute that represents the doccano category. This is related to
TEXT_CLASSIFICATION
projects.- uidstr, optional
Identifier of the converter.
- Attributes:
- descriptionstr
Description for the operation.
- uid#
- client_config#
- task#
- attr_label#
- _prov_tracer: medkit.core.ProvTracer | None = None#
- set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#
Enable provenance tracing.
- Parameters:
- prov_tracerProvTracer
The provenance tracer used to trace the provenance.
- property description: medkit.core.OperationDescription#
Contains all the input converter init parameters.
- load_from_directory_zip(dir_path: str | pathlib.Path) list[medkit.core.text.TextDocument] #
Load text documents from a directory of zip files.
The zip files should contain JSONL files coming from doccano.
- Parameters:
- dir_pathstr or Path
The path to the directory containing zip files.
- Returns:
- list of TextDocument
A list of TextDocuments
- load_from_zip(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument] #
Load text documents from a zip file.
- Parameters:
- input_filestr or Path
The path to the zip file containing a docanno JSONL file
- Returns:
- list of TextDocument
A list of TextDocuments
- load_from_file(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument] #
Load text documents from a JSONL file.
- Parameters:
- input_filestr or Path
The path to the JSONL file containing doccano annotations
- Returns:
- list of TextDocument
A list of TextDocuments
- _check_crlf_character(documents: list[medkit.core.text.TextDocument])#
Check if the list of converted documents contains the CRLF character.
This character is the only indicator available to warn if there are alignment problems in the documents.
- _parse_doc_line(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a doc_line into a TextDocument depending on the task.
- Parameters:
- doc_linedict of str to Any
A dictionary representing an annotation from doccano
- Returns:
- TextDocument
A document with parsed annotations.
- _parse_doc_line_relation_extraction(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a dictionary and return a TextDocument with entities and relations.
- Parameters:
- doc_linedict of str to Any
Dictionary with doccano annotation
- Returns:
- TextDocument
The document with annotations
- _parse_doc_line_seq_labeling(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a dictionary and return a TextDocument with entities.
- Parameters:
- doc_linedict of str to Any
Dictionary with doccano annotation.
- Returns:
- TextDocument
The document with annotations
- _parse_doc_line_text_classification(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a dictionary and return a TextDocument with an attribute.
- Parameters:
- doc_linedict of str to Any
Dictionary with doccano annotation.
- Returns:
- TextDocument
The document with its category
- class medkit.io.DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)#
Convert medkit files to doccano files (.JSONL) for a given task.
For each
TextDocument
a jsonline will be created.- Parameters:
- taskDoccanoTask
The doccano task for the input converter
- anns_labelslist of str, optional
Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for
SEQUENCE_LABELING
orRELATION_EXTRACTION
converters.- attr_labelstr, optional
The label of the medkit attribute that represents the text category. Useful for
TEXT_CLASSIFICATION
converters.- ignore_segmentsbool, default=True
If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for
SEQUENCE_LABELING
orRELATION_EXTRACTION
converters.- include_metadatabool, default=True
Whether include medkit metadata in the converted documents
- uidstr, optional
Identifier of the converter.
- Attributes:
- descriptionstr
Description for the operation.
- uid#
- task#
- anns_labels#
- attr_label#
- ignore_segments#
- include_metadata#
- property description: medkit.core.OperationDescription#
- save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)#
Convert and save a list of TextDocuments into a doccano file (.JSONL).
- Parameters:
- docslist of TextDocument
List of medkit doc objects to convert
- output_filestr or Path
Path or string of the JSONL file where to save the converted documents
- _convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument into a dictionary depending on the task.
- Parameters:
- medkit_docTextDocument
Document to convert
- Returns:
- dict of str to Any
Dictionary with doccano annotation
- _convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.
- Parameters:
- medkit_docTextDocument
Document to convert, it may contain entities and relations.
- Returns:
- dict of str to Any
Dictionary with doccano annotation. It may contain text, entities and relations.
- _convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.
- Parameters:
- medkit_docTextDocument
Document to convert, it may contain entities.
- Returns:
- dict of str to Any
Dictionary with doccano annotation. It may contain text ans its label (a list of tuples representing entities).
- _convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument to a doc_line compatible with the doccano text classification task.
- Parameters:
- medkit_docTextDocument
Document to convert, it may contain at least one attribute to convert.
- Returns:
- dict of str to Any
Dictionary with doccano annotation. It may contain text ans its label (a category(str)).
- class medkit.io.DoccanoTask(*args, **kwds)#
Bases:
enum.Enum
Supported doccano tasks.
- Attributes:
- TEXT_CLASSIFICATION
Documents with a category
- RELATION_EXTRACTION
Documents with entities and relations (including IDs)
- SEQUENCE_LABELING
Documents with entities in tuples
- TEXT_CLASSIFICATION = 'text_classification'#
- RELATION_EXTRACTION = 'relation_extraction'#
- SEQUENCE_LABELING = 'sequence_labeling'#
- class medkit.io.RTTMInputConverter(turn_label: str = 'turn', speaker_label: str = 'speaker', converter_id: str | None = None)#
Bases:
medkit.core.InputConverter
Class for conversions from Rich Transcription Time Marked (.rttm) into turn segments.
Convert Rich Transcription Time Marked (.rttm) files containing diarization information into turn segments.
For each turn in a .rttm file containing diarization information, a
Segment
will be created, with an associatedAttribute
holding the name of the turn speaker as value. The segments can be retrieved directly or as part of anAudioDocument
instance.If a
ProvTracer
is set, provenance information will be added for each segment and each attribute (referencing the input converter as the operation).- Parameters:
- turn_labelstr, default=”turn”
Label of segments representing turns in the .rttm file.
- speaker_labelstr, default=”speaker”
Label of speaker attributes to add to each segment.
- converter_idstr, optional
Identifier of the converter.
- Attributes:
- descriptionOperationDescription
Description for the operation.
- uid#
- turn_label#
- speaker_label#
- _prov_tracer: medkit.core.ProvTracer | None = None#
- property description: medkit.core.OperationDescription#
Contains all the input converter init parameters.
- set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#
Enable provenance tracing.
- Parameters:
- prov_tracer:
The provenance tracer used to trace the provenance.
- load(rttm_dir: str | pathlib.Path, audio_dir: str | pathlib.Path | None = None, audio_ext: str = '.wav') list[medkit.core.audio.AudioDocument] #
Load all .rttm files in a directory into a list of audio documents.
For each .rttm file, they must be a corresponding audio file with the same basename, either in the same directory or in an separated audio directory.
- Parameters:
- rttm_dirstr or Path
Directory containing the .rttm files.
- audio_dirstr or Path, optional
Directory containing the audio files corresponding to the .rttm files, if they are not in rttm_dir.
- audio_extstr, default=”.wav”
File extension to use for audio files.
- Returns:
- list of AudioDocument
List of generated documents.
- load_doc(rttm_file: str | pathlib.Path, audio_file: str | pathlib.Path) medkit.core.audio.AudioDocument #
Load a single .rttm file into an audio document.
- Parameters:
- rttm_filestr or Path
Path to the .rttm file.
- audio_filestr or Path
Path to the corresponding audio file.
- Returns:
- AudioDocument
Generated document.
- load_turns(rttm_file: str | pathlib.Path, audio_file: str | pathlib.Path) list[medkit.core.audio.Segment] #
Load a .rttm file as a list of segments.
- Parameters:
- rttm_filestr or Path
Path to the .rttm file.
- audio_filestr or Path
Path to the corresponding audio file.
- Returns:
- list of Segment
Turn segments as found in the .rttm file.
- static _load_rows(rttm_file: pathlib.Path)#
- _build_turn_segment(row: dict[str, Any], full_audio: medkit.core.audio.FileAudioBuffer) medkit.core.audio.Segment #
- class medkit.io.RTTMOutputConverter(turn_label: str = 'turn', speaker_label: str = 'speaker')#
Bases:
medkit.core.OutputConverter
Class for conversions to Rich Transcription Time Marked (.rttm).
Build Rich Transcription Time Marked (.rttm) files containing diarization information from
Segment
objects.There must be a segment for each turn, with an associated
Attribute
holding the name of the turn speaker as value. The segments can be passed directly or as part ofAudioDocument
instances.- Parameters:
- turn_labelstr, default=”turn”
Label of segments representing turns in the audio documents.
- speaker_labelstr, default=”speaker”
Label of speaker attributes attached to each turn segment.
- turn_label#
- speaker_label#
- save(docs: list[medkit.core.audio.AudioDocument], rttm_dir: str | pathlib.Path, doc_names: list[str] | None = None)#
Save a collection of audio documents to RTTM files in a directory.
- Parameters:
- docslist of AudioDocument
List of audio documents to save.
- rttm_dirstr or Path
Directory into which the generated .rttm files will be stored.
- doc_nameslist of str, optional
Optional list of names to use as basenames and file ids for the generated .rttm files (2d column). If none provided, the document ids will be used.
- save_doc(doc: medkit.core.audio.AudioDocument, rttm_file: str | pathlib.Path, rttm_doc_id: str | None = None)#
Save a single audio document to a RTTM file.
- Parameters:
- docAudioDocument
Audio document to save.
- rttm_filestr or Path
Path of the generated .rttm file.
- rttm_doc_idstr, optional
File uid to use for the generated .rttm file (2d column). If none provided, the document uid will be used.
- save_turn_segments(turn_segments: list[medkit.core.audio.Segment], rttm_file: str | pathlib.Path, rttm_doc_id: str | None)#
Save
Segment
objects into a .rttm file.- Parameters:
- turn_segmentslist of Segment
Turn segments to save.
- rttm_filestr or Path
Path of the generated .rttm file.
- rttm_doc_idstr, optional
File uid to use for the generated .rttm file (2d column).
- _build_rttm_row(turn_segment: medkit.core.audio.Segment, rttm_doc_id: str | None) dict[str, Any] #