I/O Components#
This page lists all components for converting and loading/saving data.
For more details about public APIs, please refer to medkit.io
.
medkit-json#
medkit
has some utilities to export and import documents saved as JSON format.
You can use medkit.io.medkit_json.save_text_documents
to save a list of documents,
and then medkit.io.medkit_json.load_text_documents
to load them within medkit
.
Warning
load_text_documents
is a generator function yielding a single document per iteration,
to prevent accidental memory spikes if the corpus is large.
To load the full corpus in memory, you may consume the generator in a list with:
from medkit.io.medkit_json import load_text_documents
docs = list(load_text_documents("/path/to/medkit/documents.jsonl"))
For more details, please refer to medkit.io.medkit_json
.
Brat#
Brat is a web-based tool for text annotation.
medkit
supports input and output conversions of Brat text documents.
For more details about the public API, please refer to medkit.io.brat
.
See also
You may refer to this example for more information.
Doccano#
Doccano is a text annotation tool from multiple tasks.
medkit
supports input and output conversions of doccano files (saved in JSONL format).
You can load annotations from a JSONL file or a ZIP directory.
Supported tasks#
Doccano Project |
Task for converter |
Example |
Sequence labeling |
|
|
Relation extraction |
|
Client Configuration#
The doccano user interface allows custom configuration over certain annotation parameters.
The medkit.io.doccano.DoccanoClientConfig
class contains the configuration to be used by the input converter.
You can modify the settings depending on the configuration of your project. If no custom configuration is provided, the converter will use the default doccano configuration.
Note
Metadata
Doccano to
medkit
: All the extra fields are imported as a dictionary inTextDocument.metadata
medkit
to Doccano: TheTextDocument.metadata
are exported as extra fields to the output data. Setinclude_metadata
toFalse
to exclude the extra fields.
For more details, please refer to medkit.io.doccano
.
spaCy#
medkit
supports input and output conversions of spaCy documents.
Important
Using spaCy converters requires additional dependencies:
pip install 'medkit-lib[spacy]'
See also
You may refer to this example for more information.
For more details, please refer to medkit.io.spacy
.
RTTM#
Rich Transcription Time Marked files (saved with .rttm extension) contains diarization information.
medkit
supports input and output conversions of audio documents in RTTM format.
For more details, refer to medkit.io.rttm
.
SRT#
SRT files (saved with .srt extension) contains transcription information associated with an audio recording.
medkit
supports input and output conversions of audio transcription in SRT format.
Important
Using SRT converters requires additional dependencies:
pip install 'medkit-lib[srt-io-converter]'
For more details, refer to medkit.io.srt
.