DocxToMarkdownPipeline
What it is
A document conversion pipeline that reads a DOCX file and produces Markdown by extracting paragraphs from word/document.xml and mapping basic Word paragraph styles (headings, lists) to Markdown syntax.
Public API
-
DocxToMarkdownPipelineConfiguration- Extends
ConvertToMarkdownBasePipelineConfiguration. mime_type: Defaults toapplication/vnd.openxmlformats-officedocument.wordprocessingml.document.
- Extends
-
DocxToMarkdownPipelineParameters- Extends
ConvertToMarkdownBasePipelineParameters. processor_iri: Defaults tohttp://ontology.naas.ai/abi/document/DocxToMarkdownProcessor.
- Extends
-
DocxToMarkdownPipeline- Extends
ConvertToMarkdownBasePipeline. convert_to_markdown(file: File) -> str: Converts a DOCXFileto a Markdown string.
- Extends
Configuration/Dependencies
- Depends on:
naas_abi_marketplace.domains.document.pipelines.ConvertToMarkdownBasePipeline(base classes/config/params)naas_abi_marketplace...Filefor reading file bytes (file.read()must returnbytes)- Standard libs:
zipfile,xml.etree.ElementTree,re,io pydantic.Fieldandtyping.Annotatedfor configuration metadata
Usage
from naas_abi_marketplace.domains.document.pipelines.DocxToMarkdownPipeline import (
DocxToMarkdownPipeline,
)
from naas_abi_marketplace.domains.document.ontologies.classes.ontology_naas_ai.abi.document.File import (
File,
)
pipeline = DocxToMarkdownPipeline()
docx_file: File = ... # must support .read() -> bytes for a DOCX
markdown = pipeline.convert_to_markdown(docx_file)
print(markdown)
Caveats
- Only processes
word/document.xml; invalid DOCX content or missingword/document.xmlraisesValueError("Invalid DOCX content: missing word/document.xml"). - Markdown conversion is minimal:
- Headings: only paragraph styles matching
Heading1…Heading6. - Lists: any paragraph with numbering properties (
w:numPr) becomes- ...(no ordered list numbering). - Inline formatting (bold/italic/links/images/tables) is not handled; text is extracted from
w:tnodes only.
- Headings: only paragraph styles matching
- Whitespace normalization collapses runs of spaces/tabs to a single space and trims ends; explicit
w:tabbecomes four spaces andw:br/w:crbecome newline characters (which may be normalized).