medkit.io.doccano

medkit.io.doccano#

Classes#

`DoccanoTask`	Supported doccano tasks.
`DoccanoClientConfig`	Doccano client configuration.
`DoccanoInputConverter`	Convert doccano files (.JSONL) containing annotations for a given task.
`DoccanoOutputConverter`	Convert medkit files to doccano files (.JSONL) for a given task.

Module Contents#

class medkit.io.doccano.DoccanoTask(*args, **kwds)#

Bases: enum.Enum

Supported doccano tasks.

Attributes:

TEXT_CLASSIFICATION: Documents with a category
RELATION_EXTRACTION: Documents with entities and relations (including IDs)
SEQUENCE_LABELING: Documents with entities in tuples

TEXT_CLASSIFICATION = 'text_classification'#

RELATION_EXTRACTION = 'relation_extraction'#

SEQUENCE_LABELING = 'sequence_labeling'#

class medkit.io.doccano.DoccanoClientConfig#

Doccano client configuration.

The default values are the default values used by doccano.

Attributes:

column_textstr, default=”text”: Name or key representing the text
column_labelstr, default=”label”: Name or key representing the label

column_text: str = 'text'#

column_label: str = 'label'#

class medkit.io.doccano.DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)#

Convert doccano files (.JSONL) containing annotations for a given task.

For each line, a TextDocument will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.

The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f. DoccanoClientConfig)

Warning

If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.

Parameters:

taskDocanoTask: The doccano task for the input converter
client_configDoccanoClientConfig, optional: Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.
attr_labelstr, default=”doccano_category”: The label to use for the medkit attribute that represents the doccano category. This is related to TEXT_CLASSIFICATION projects.
uidstr, optional: Identifier of the converter.

Attributes:

descriptionstr: Description for the operation.

uid#

client_config#

task#

attr_label#

_prov_tracer: medkit.core.ProvTracer | None = None#

set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#

Enable provenance tracing.

Parameters:

prov_tracerProvTracer: The provenance tracer used to trace the provenance.

property description: medkit.core.OperationDescription#: Contains all the input converter init parameters.

load_from_directory_zip(dir_path: str | pathlib.Path) → list[medkit.core.text.TextDocument]#

Load text documents from a directory of zip files.

The zip files should contain JSONL files coming from doccano.

Parameters:

dir_pathstr or Path: The path to the directory containing zip files.

Returns:

list of TextDocument: A list of TextDocuments

load_from_zip(input_file: str | pathlib.Path) → list[medkit.core.text.TextDocument]#

Load text documents from a zip file.

Parameters:

input_filestr or Path: The path to the zip file containing a docanno JSONL file

Returns:

list of TextDocument: A list of TextDocuments

load_from_file(input_file: str | pathlib.Path) → list[medkit.core.text.TextDocument]#

Load text documents from a JSONL file.

Parameters:

input_filestr or Path: The path to the JSONL file containing doccano annotations

Returns:

list of TextDocument: A list of TextDocuments

_check_crlf_character(documents: list[medkit.core.text.TextDocument])#

Check if the list of converted documents contains the CRLF character.

This character is the only indicator available to warn if there are alignment problems in the documents.

_parse_doc_line(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a doc_line into a TextDocument depending on the task.

Parameters:

doc_linedict of str to Any: A dictionary representing an annotation from doccano

Returns:

TextDocument: A document with parsed annotations.

_parse_doc_line_relation_extraction(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities and relations.

Parameters:

doc_linedict of str to Any: Dictionary with doccano annotation

Returns:

TextDocument: The document with annotations

_parse_doc_line_seq_labeling(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with entities.

Parameters:

doc_linedict of str to Any: Dictionary with doccano annotation.

Returns:

TextDocument: The document with annotations

_parse_doc_line_text_classification(doc_line: dict[str, Any]) → medkit.core.text.TextDocument#

Parse a dictionary and return a TextDocument with an attribute.

Parameters:

doc_linedict of str to Any: Dictionary with doccano annotation.

Returns:

TextDocument: The document with its category

class medkit.io.doccano.DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)#

Convert medkit files to doccano files (.JSONL) for a given task.

For each TextDocument a jsonline will be created.

Parameters:

taskDoccanoTask: The doccano task for the input converter
anns_labelslist of str, optional: Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.
attr_labelstr, optional: The label of the medkit attribute that represents the text category. Useful for TEXT_CLASSIFICATION converters.
ignore_segmentsbool, default=True: If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for SEQUENCE_LABELING or RELATION_EXTRACTION converters.
include_metadatabool, default=True: Whether include medkit metadata in the converted documents
uidstr, optional: Identifier of the converter.

Attributes:

descriptionstr: Description for the operation.

uid#

task#

anns_labels#

attr_label#

ignore_segments#

include_metadata#

property description: medkit.core.OperationDescription#

save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)#

Convert and save a list of TextDocuments into a doccano file (.JSONL).

Parameters:

docslist of TextDocument: List of medkit doc objects to convert
output_filestr or Path: Path or string of the JSONL file where to save the converted documents

_convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument into a dictionary depending on the task.

Parameters:

medkit_docTextDocument: Document to convert

Returns:

dict of str to Any: Dictionary with doccano annotation

_convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.

Parameters:

medkit_docTextDocument: Document to convert, it may contain entities and relations.

Returns:

dict of str to Any: Dictionary with doccano annotation. It may contain text, entities and relations.

_convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.

Parameters:

medkit_docTextDocument: Document to convert, it may contain entities.

Returns:

dict of str to Any: Dictionary with doccano annotation. It may contain text ans its label (a list of tuples representing entities).

_convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) → dict[str, Any]#

Convert a TextDocument to a doc_line compatible with the doccano text classification task.

Parameters:

medkit_docTextDocument: Document to convert, it may contain at least one attribute to convert.

Returns:

dict of str to Any: Dictionary with doccano annotation. It may contain text ans its label (a category(str)).