Text Components#
This page contains all core text concepts of medkit
.
For more details about public APIs, please refer to medkit.core.text
.
Data Structures#
The TextDocument
class implements the Document
protocol.
It allows to store subclasses of TextAnnotation
,
which implements the Annotation
protocol.
Document#
TextDocument
relies on TextAnnotationContainer
to manage the annotations.
Given a text document named doc
, one can:
browse segments, entities, and relations:
for entity in doc.anns.entities:
...
for segment in doc.anns.segments:
...
for relation in doc.anns.relations:
...
get and filter segments, entities and relations:
sentences_segments = doc.get_segments(label="sentences")
disorder_entities = doc.get_entities(label="disorder")
entity = ...
relations = doc.get_relations(label="before", source_id=entity.uid)
For more details on common interfaces provided by core components, please refer to Document.
Annotations#
For the text modality, TextDocument
can only contain multiple TextAnnotation
.
Three subclasses are defined Segment
,
Entity
and Relation
.
Note
Each text annotation class inherits from the common interfaces provided by the core component (cf. Annotation).
For more details about public APIs, please refer to medkit.core.text.annotation
.
Attributes#
Text annotations can receive attributes, which will be instances of the core Attribute
class.
Among attributes, medkit.core.text
proposes EntityNormAttribute
,
to be used for normalization attributes, in order to have a common structure for normalization information,
independently of the operation used to create it.
Spans#
medkit
relies on the concept of spans for following all text modifications made by the different operations.
medkit
also proposes a set of utilities for manipulating these spans when implementing new operations.
For more details about public APIs, please refer to medkit.core.text.span
and medkit.core.text.span_utils
.
See also
You may also take a look to the spans examples.
Text Utilities#
These utilities have some preconfigured patterns for preprocessing text documents without destruction. They are not designed to be used directly, but rather inside a cleaning operation.
For more details about public APIs, please refer to medkit.core.text.utils
.
See also
medkit
provides a EDSCleaner
class,
which combines all these utilities to clean French documents (related to EDS documents coming from PDF).
Operations#
Abstract subclasses of Operation
have been defined for text
to ease the development of text operations according to run
operations.
Internal class _CustomTextOperation
has been implemented to allow user to
call create_text_operation()
for easier instantiation of custom
text operations.
For more details about public APIs, please refer to medkit.core.text.operation
.
See also
Please refer to this example for examples of custom operation.