medkit.io.doccano#
Classes#
Supported doccano tasks. |
|
Doccano client configuration. |
|
Convert doccano files (.JSONL) containing annotations for a given task. |
|
Convert medkit files to doccano files (.JSONL) for a given task. |
Module Contents#
- class medkit.io.doccano.DoccanoTask(*args, **kwds)#
Bases:
enum.Enum
Supported doccano tasks.
- Attributes:
- TEXT_CLASSIFICATION
Documents with a category
- RELATION_EXTRACTION
Documents with entities and relations (including IDs)
- SEQUENCE_LABELING
Documents with entities in tuples
- TEXT_CLASSIFICATION = 'text_classification'#
- RELATION_EXTRACTION = 'relation_extraction'#
- SEQUENCE_LABELING = 'sequence_labeling'#
- class medkit.io.doccano.DoccanoClientConfig#
Doccano client configuration.
The default values are the default values used by doccano.
- Attributes:
- column_textstr, default=”text”
Name or key representing the text
- column_labelstr, default=”label”
Name or key representing the label
- column_text: str = 'text'#
- column_label: str = 'label'#
- class medkit.io.doccano.DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)#
Convert doccano files (.JSONL) containing annotations for a given task.
For each line, a
TextDocument
will be created. The doccano files can be loaded from a directory with zip files or from a jsonl file.The converter supports custom configuration to define the parameters used by doccano when importing the data (c.f.
DoccanoClientConfig
)Warning
If the option Count grapheme clusters as one character was selected when creating the doccano project, the converted documents are likely to have alignment problems; the converter does not support this option.
- Parameters:
- taskDocanoTask
The doccano task for the input converter
- client_configDoccanoClientConfig, optional
Optional client configuration to define default values in doccano interface. This config can change, for example, the name of the text field or labels.
- attr_labelstr, default=”doccano_category”
The label to use for the medkit attribute that represents the doccano category. This is related to
TEXT_CLASSIFICATION
projects.- uidstr, optional
Identifier of the converter.
- Attributes:
- descriptionstr
Description for the operation.
- uid#
- client_config#
- task#
- attr_label#
- _prov_tracer: medkit.core.ProvTracer | None = None#
- set_prov_tracer(prov_tracer: medkit.core.ProvTracer)#
Enable provenance tracing.
- Parameters:
- prov_tracerProvTracer
The provenance tracer used to trace the provenance.
- property description: medkit.core.OperationDescription#
Contains all the input converter init parameters.
- load_from_directory_zip(dir_path: str | pathlib.Path) list[medkit.core.text.TextDocument] #
Load text documents from a directory of zip files.
The zip files should contain JSONL files coming from doccano.
- Parameters:
- dir_pathstr or Path
The path to the directory containing zip files.
- Returns:
- list of TextDocument
A list of TextDocuments
- load_from_zip(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument] #
Load text documents from a zip file.
- Parameters:
- input_filestr or Path
The path to the zip file containing a docanno JSONL file
- Returns:
- list of TextDocument
A list of TextDocuments
- load_from_file(input_file: str | pathlib.Path) list[medkit.core.text.TextDocument] #
Load text documents from a JSONL file.
- Parameters:
- input_filestr or Path
The path to the JSONL file containing doccano annotations
- Returns:
- list of TextDocument
A list of TextDocuments
- _check_crlf_character(documents: list[medkit.core.text.TextDocument])#
Check if the list of converted documents contains the CRLF character.
This character is the only indicator available to warn if there are alignment problems in the documents.
- _parse_doc_line(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a doc_line into a TextDocument depending on the task.
- Parameters:
- doc_linedict of str to Any
A dictionary representing an annotation from doccano
- Returns:
- TextDocument
A document with parsed annotations.
- _parse_doc_line_relation_extraction(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a dictionary and return a TextDocument with entities and relations.
- Parameters:
- doc_linedict of str to Any
Dictionary with doccano annotation
- Returns:
- TextDocument
The document with annotations
- _parse_doc_line_seq_labeling(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a dictionary and return a TextDocument with entities.
- Parameters:
- doc_linedict of str to Any
Dictionary with doccano annotation.
- Returns:
- TextDocument
The document with annotations
- _parse_doc_line_text_classification(doc_line: dict[str, Any]) medkit.core.text.TextDocument #
Parse a dictionary and return a TextDocument with an attribute.
- Parameters:
- doc_linedict of str to Any
Dictionary with doccano annotation.
- Returns:
- TextDocument
The document with its category
- class medkit.io.doccano.DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)#
Convert medkit files to doccano files (.JSONL) for a given task.
For each
TextDocument
a jsonline will be created.- Parameters:
- taskDoccanoTask
The doccano task for the input converter
- anns_labelslist of str, optional
Labels of medkit annotations to convert into doccano annotations. If None (default) all the entities or relations will be converted. Useful for
SEQUENCE_LABELING
orRELATION_EXTRACTION
converters.- attr_labelstr, optional
The label of the medkit attribute that represents the text category. Useful for
TEXT_CLASSIFICATION
converters.- ignore_segmentsbool, default=True
If True medkit segments will be ignored. Only entities will be converted to Doccano entities. If False the medkit segments will be converted to Doccano entities as well. Useful for
SEQUENCE_LABELING
orRELATION_EXTRACTION
converters.- include_metadatabool, default=True
Whether include medkit metadata in the converted documents
- uidstr, optional
Identifier of the converter.
- Attributes:
- descriptionstr
Description for the operation.
- uid#
- task#
- anns_labels#
- attr_label#
- ignore_segments#
- include_metadata#
- property description: medkit.core.OperationDescription#
- save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)#
Convert and save a list of TextDocuments into a doccano file (.JSONL).
- Parameters:
- docslist of TextDocument
List of medkit doc objects to convert
- output_filestr or Path
Path or string of the JSONL file where to save the converted documents
- _convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument into a dictionary depending on the task.
- Parameters:
- medkit_docTextDocument
Document to convert
- Returns:
- dict of str to Any
Dictionary with doccano annotation
- _convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.
- Parameters:
- medkit_docTextDocument
Document to convert, it may contain entities and relations.
- Returns:
- dict of str to Any
Dictionary with doccano annotation. It may contain text, entities and relations.
- _convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.
- Parameters:
- medkit_docTextDocument
Document to convert, it may contain entities.
- Returns:
- dict of str to Any
Dictionary with doccano annotation. It may contain text ans its label (a list of tuples representing entities).
- _convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) dict[str, Any] #
Convert a TextDocument to a doc_line compatible with the doccano text classification task.
- Parameters:
- medkit_docTextDocument
Document to convert, it may contain at least one attribute to convert.
- Returns:
- dict of str to Any
Dictionary with doccano annotation. It may contain text ans its label (a category(str)).