Core Components#
This page explains all core concepts defined in medkit
.
For more details, please refer to medkit.core
.
Data Structures#
medkit
document classes are used to access raw data,
as well as store annotations extracted from these raw data.
The Document
and Annotation
protocols
are defined inside medkit.core
.
They define common properties and methods across all modalities.
These protocols are then implemented for each modality (text, audio, image, etc.),
with additional logic specific to the modality.
To facilitate the implementation of the Document
protocol,
the AnnotationContainer
class is provided.
It behaves like a list of annotations, with additional filtering methods
and support for non-memory storage.
medkit.core
also defines the Attribute
class,
which can be used to attach attributes to annotations for any modality.
Similarly to AnnotationContainer
, the role of this container is
to provide additional methods for facilitating access to the list of attributes
belonging to an annotation.
Currently, medkit.core.text
implements a TextDocument
class
and a corresponding set of TextAnnotation
subclasses.
Similarly, medkit.core.audio
provides an AudioDocument
class
and a corresponding Segment
.
Both modality are also subclasses of AnnotationContainer
to provide some modality-specific logic or filtering.
You may refer to the documentation specific to audio and text modalities.
Document#
The Document
protocol defines the minimal data structure for a medkit
document.
Regardless of the modality, each document is linked to a corresponding annotation container.
The AnnotationContainer
class provides a set of methods to be implemented for each modality.
The goal is to provide users with a minimum set of common interfaces for accessing to the document annotations whatever the modality.
Given a document named doc
, one can:
browse its annotations:
for ann in doc.anns:
...
add a new annotation:
doc.anns.add(...)
get annotations filtered by label:
disorders = doc.anns.get(label="disorder")
For more details about the public API, please refer to medkit.core.document.Document
and medkit.core.annotation_container.AnnotationContainer
.
Annotations and Attributes#
The Annotation
protocol class provides the minimal data structure
for a medkit
annotation. Each annotation is linked to an attribute container.
The AttributeContainer
class provides a set of common interfaces
for accessing attributes (~.core.Attribute
) associated to an annotation,
regardless of the underlying modality.
Given an annotation ann
, one can:
browse the annotation attributes:
for attr in ann.attrs:
...
add a new attribute
ann.attrs.add(...)
get attributes filtered by label:
normalized = ann.attrs.get(label="NORMALIZATION")
Operations#
The Operation
abstract class groups all necessary methods for
being compatible with medkit
processing pipeline and provenance.
We have defined different subclasses depending on the nature of the operation,
including text-specific and audio-specific operations in medkit.core.text
and medkit.core.audio
.
To get more details about each modality, you can refer to their documentation:
For all operations inheriting from Operation
abstract class,
these 4 lines shall be added in __init__
method:
def __init__(self, ..., uid=None):
...
# Pass all arguments to super (remove self)
init_args = locals()
init_args.pop("self")
super().__init__(**init_args)
Each operation is described with OperationDescription
.
Converters#
Two abstract classes have been defined for managing document conversion
between medkit
format and another one.
For more details about the public APIs, refer to medkit.core.conversion
.
Pipeline#
Pipeline
allows to chain several operations.
To better understand how to declare and use medkit
pipelines, you may refer
to the pipeline tutorial.
The DocPipeline
class is a wrapper allowing
to run an annotation pipeline on a list of documents by automatically attach
output annotations to these documents.
For more details about the public APIs, refer to medkit.core.pipeline
.
Provenance#
Warning
This work is still under development. It may be changed in the future.
Provenance is a medkit
concept allowing to track all operations and
their role in new knowledge extraction.
With this mechanism, we will be able to provide the provenance information about a generated data. To log this information, a separate provenance store is used.
For better understanding this concept, you may follow the provenance tutorial and/or refer to โhow to make your own moduleโ to know what you have to do to enable provenance.
For more details about the public APIs, refer to medkit.core.prov_tracer
.