PptxToMarkdownPipeline
What it is
A pipeline that converts a .pptx (PowerPoint) file into Markdown by extracting text from each slide’s XML and formatting it as slide headers with bullet points.
Public API
PptxToMarkdownPipelineConfigurationmime_type: str— MIME type this pipeline targets. Default:application/vnd.openxmlformats-officedocument.presentationml.presentation
PptxToMarkdownPipelineParametersprocessor_iri: str— IRI identifying the processor. Default:http://ontology.naas.ai/abi/document/PptxToMarkdownProcessor
PptxToMarkdownPipelineconvert_to_markdown(file: File) -> str— Reads PPTX content fromfile, extracts slide text, and returns Markdown.- Internal helpers (not intended as public API):
_normalize_whitespace(value: str) -> str_list_slide_paths(pptx_file: zipfile.ZipFile) -> list[str]_extract_text_from_slide(slide_xml: bytes) -> list[str]
Configuration/Dependencies
- Depends on:
naas_abi_marketplace...Fileprovidingread() -> bytes- Base classes:
ConvertToMarkdownBasePipelineConvertToMarkdownBasePipelineConfigurationConvertToMarkdownBasePipelineParameters
- Standard library:
zipfile,io,re,xml.etree.ElementTree pydantic.Fieldfor parameter metadata
Usage
from naas_abi_marketplace.domains.document.pipelines.PptxToMarkdownPipeline import (
PptxToMarkdownPipeline,
)
from naas_abi_marketplace.domains.document.ontologies.classes.ontology_naas_ai.abi.document.File import File
pipeline = PptxToMarkdownPipeline()
pptx_file: File = ... # must implement .read() returning PPTX bytes
md = pipeline.convert_to_markdown(pptx_file)
print(md)
Markdown output shape:
- For each non-empty slide:
## Slide N- <line 1>- <line 2>- ...
Caveats
- Only extracts text present in slide XML text runs (
a:t) and treats line breaks (a:br) as newline markers before whitespace normalization. - Images, tables, speaker notes, and most formatting are not converted.
- If the input is not a valid PPTX zip (or slide XML cannot be read),
convert_to_markdownraisesValueError("Invalid PPTX content: unable to read slide XML"). - Slide numbering is derived from
ppt/slides/slide*.xmlfilenames and sorted by their numeric index.