medkit.io.doccano#

Classes#

DoccanoTask

Supported doccano tasks.

DoccanoClientConfig

Doccano client configuration.

DoccanoInputConverter

Convert doccano files (.JSONL) containing annotations for a given task.

DoccanoOutputConverter

Convert medkit files to doccano files (.JSONL) for a given task.

Module Contents#

class medkit.io.doccano.DoccanoTask(*args, **kwds)#

Bases: enum.Enum

Supported doccano tasks.

Attributes:
TEXT_CLASSIFICATION

Documents with a category

RELATION_EXTRACTION

Documents with entities and relations (including IDs)

SEQUENCE_LABELING

Documents with entities in tuples

TEXT_CLASSIFICATION = 'text_classification'#
RELATION_EXTRACTION = 'relation_extraction'#
SEQUENCE_LABELING = 'sequence_labeling'#
class medkit.io.doccano.DoccanoClientConfig#

Doccano client configuration.

The default values are the default values used by doccano.

Attributes:
column_textstr, default=”text”

Name or key representing the text

column_labelstr, default=”label”

Name or key representing the label

column_text: str = 'text'#
column_label: str = 'label'#
class medkit.io.doccano.DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters:
taskDocanoTask

The doccano task for the input converter

client_configDoccanoClientConfig, optional

Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.

attr_labelstr, default=”doccano_category”

The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.

uidstr, optional

Identifier of the converter.

Attributes:
descriptionstr

Description for the operation.

uid#
client_config#
task#
attr_label#
_prov_tracer: medkit.core.ProvTracer | None = None#
set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

Enable provenance tracing.

Parameters:
prov_tracerProvTracer

The provenance tracer used to trace the provenance.

property description: medkit.core.OperationDescription#

Contains all the input converter init parameters.

load_from_directory_zip(dir_path: str | pathlib.Path) list[medkit.core.text.TextDocument]#

Load text documents from a directory of zip files.

The zip files should contain JSONL files coming from doccano.

Parameters:
dir_pathstr or Path

The path to the directory containing zip files.

Returns:
list of TextDocument

A list of TextDocuments

load_from_zip(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument]#

Load text documents from a zip file.

Parameters:
input_filestr or Path

The path to the zip file containing a docanno JSONL file

Returns:
list of TextDocument

A list of TextDocuments

load_from_file(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument]#

Load text documents from a JSONL file.

Parameters:
input_filestr or Path

The path to the JSONL file containing doccano annotations

Returns:
list of TextDocument

A list of TextDocuments

_check_crlf_character(documents: list[medkit.core.text.TextDocument])#

Check if the list of converted documents contains the CRLF character.

This character is the only indicator available to warn if there are alignment problems in the documents.

_parse_doc_line(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a doc_line into a TextDocument depending on the task.

Parameters:
doc_linedict of str to Any

A dictionary representing an annotation from doccano

Returns:
TextDocument

A document with parsed annotations.

_parse_doc_line_relation_extraction(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities and relations.

Parameters:
doc_linedict of str to Any

Dictionary with doccano annotation

Returns:
TextDocument

The document with annotations

_parse_doc_line_seq_labeling(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities.

Parameters:
doc_linedict of str to Any

Dictionary with doccano annotation.

Returns:
TextDocument

The document with annotations

_parse_doc_line_text_classification(doc_line: dict[str, Any]) medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with an attribute.

Parameters:
doc_linedict of str to Any

Dictionary with doccano annotation.

Returns:
TextDocument

The document with its category

class medkit.io.doccano.DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters:
taskDoccanoTask

The doccano task for the input converter

anns_labelslist of str, optional

Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

attr_labelstr, optional

The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.

ignore_segmentsbool, default=True

If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.

include_metadatabool, default=True

Whether include medkit metadata in the converted documents

uidstr, optional

Identifier of the converter.

Attributes:
descriptionstr

Description for the operation.

uid#
task#
anns_labels#
attr_label#
ignore_segments#
include_metadata#
property description: medkit.core.OperationDescription#
save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)#

Convert and save a list of TextDocuments into a doccano file (.JSONL).

Parameters:
docslist of TextDocument

List of medkit doc objects to convert

output_filestr or Path

Path or string of the JSONL file where to save the converted documents

_convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument into a dictionary depending on the task.

Parameters:
medkit_docTextDocument

Document to convert

Returns:
dict of str to Any

Dictionary with doccano annotation

_convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.

Parameters:
medkit_docTextDocument

Document to convert, it may contain entities and relations.

Returns:
dict of str to Any

Dictionary with doccano annotation. It may contain text, entities and relations.

_convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.

Parameters:
medkit_docTextDocument

Document to convert, it may contain entities.

Returns:
dict of str to Any

Dictionary with doccano annotation. It may contain text ans its label (a list of tuples representing entities).

_convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano text classification task.

Parameters:
medkit_docTextDocument

Document to convert, it may contain at least one attribute to convert.

Returns:
dict of str to Any

Dictionary with doccano annotation. It may contain text ans its label (a category(str)).