Core Components

Core Components #

This page explains all core concepts defined in medkit.

For more details, please refer to medkit.core.

Data Structures #

medkit document classes are used to access raw data, as well as store annotations extracted from these raw data.

The Document and Annotation protocols are defined inside medkit.core. They define common properties and methods across all modalities. These protocols are then implemented for each modality (text, audio, image, etc.), with additional logic specific to the modality.

To facilitate the implementation of the Document protocol, the AnnotationContainer class is provided. It behaves like a list of annotations, with additional filtering methods and support for non-memory storage.

medkit.core also defines the Attribute class, which can be used to attach attributes to annotations for any modality. Similarly to AnnotationContainer, the role of this container is to provide additional methods for facilitating access to the list of attributes belonging to an annotation.

classDiagram direction LR class Document~Annotation~{ <<protocol>> uid: str anns: AnnotationContainer~Annotation~ } class Annotation{ <<protocol>> uid: str label: str attrs: AttributeContainer } class Attribute{ uid: str label: str value: Optional[Any] } Document *-- Annotation : contains\n(AnnotationContainer) Annotation *-- Attribute : contains\n(AttributeContainer)

Core protocols and classes#

Currently, medkit.core.text implements a TextDocument class and a corresponding set of TextAnnotation subclasses. Similarly, medkit.core.audio provides an AudioDocument class and a corresponding Segment. Both modality are also subclasses of AnnotationContainer to provide some modality-specific logic or filtering.

You may refer to the documentation specific to audio and text modalities.

Document #

The Document protocol defines the minimal data structure for a medkit document. Regardless of the modality, each document is linked to a corresponding annotation container.

The AnnotationContainer class provides a set of methods to be implemented for each modality.

The goal is to provide users with a minimum set of common interfaces for accessing to the document annotations whatever the modality.

Given a document named doc, one can:

browse its annotations:

for ann in doc.anns:
    ...

add a new annotation:

doc.anns.add(...)

get annotations filtered by label:

disorders = doc.anns.get(label="disorder")

For more details about the public API, please refer to medkit.core.document.Document and medkit.core.annotation_container.AnnotationContainer.

Annotations and Attributes #

The Annotation protocol class provides the minimal data structure for a medkit annotation. Each annotation is linked to an attribute container.

The AttributeContainer class provides a set of common interfaces for accessing attributes (~.core.Attribute) associated to an annotation, regardless of the underlying modality.

Given an annotation ann, one can:

browse the annotation attributes:

for attr in ann.attrs:
    ...

add a new attribute

ann.attrs.add(...)

get attributes filtered by label:

normalized = ann.attrs.get(label="NORMALIZATION")

Operations #

The Operation abstract class groups all necessary methods for being compatible with medkit processing pipeline and provenance.

We have defined different subclasses depending on the nature of the operation, including text-specific and audio-specific operations in medkit.core.text and medkit.core.audio.

To get more details about each modality, you can refer to their documentation:

For all operations inheriting from Operation abstract class, these 4 lines shall be added in __init__ method:

def __init__(self, ..., uid=None):
    ...
    # Pass all arguments to super (remove self)
    init_args = locals()
    init_args.pop("self")
    super().__init__(**init_args)

Each operation is described with OperationDescription.

Converters #

Two abstract classes have been defined for managing document conversion between medkit format and another one.

For more details about the public APIs, refer to medkit.core.conversion.

Pipeline #

Pipeline allows to chain several operations.

To better understand how to declare and use medkit pipelines, you may refer to the pipeline tutorial.

The DocPipeline class is a wrapper allowing to run an annotation pipeline on a list of documents by automatically attach output annotations to these documents.

For more details about the public APIs, refer to medkit.core.pipeline.

Provenance #

Warning

This work is still under development. It may be changed in the future.

Provenance is a medkit concept allowing to track all operations and their role in new knowledge extraction.

With this mechanism, we will be able to provide the provenance information about a generated data. To log this information, a separate provenance store is used.

For better understanding this concept, you may follow the provenance tutorial and/or refer to “how to make your own module” to know what you have to do to enable provenance.

For more details about the public APIs, refer to medkit.core.prov_tracer.

Core Components

Contents