PptxToMarkdownPipeline

What it is

A pipeline that converts a .pptx (PowerPoint) file into Markdown by extracting text from each slide’s XML and formatting it as slide headers with bullet points.

Public API

PptxToMarkdownPipelineConfiguration
- mime_type: str — MIME type this pipeline targets. Default:
  - application/vnd.openxmlformats-officedocument.presentationml.presentation
PptxToMarkdownPipelineParameters
- processor_iri: str — IRI identifying the processor. Default:
  - http://ontology.naas.ai/abi/document/PptxToMarkdownProcessor
PptxToMarkdownPipeline
- convert_to_markdown(file: File) -> str — Reads PPTX content from file, extracts slide text, and returns Markdown.
- Internal helpers (not intended as public API):
  - _normalize_whitespace(value: str) -> str
  - _list_slide_paths(pptx_file: zipfile.ZipFile) -> list[str]
  - _extract_text_from_slide(slide_xml: bytes) -> list[str]

Configuration/Dependencies

Depends on:
- naas_abi_marketplace...File providing read() -> bytes
- Base classes:
  - ConvertToMarkdownBasePipeline
  - ConvertToMarkdownBasePipelineConfiguration
  - ConvertToMarkdownBasePipelineParameters
- Standard library: zipfile, io, re, xml.etree.ElementTree
- pydantic.Field for parameter metadata

Usage

from naas_abi_marketplace.domains.document.pipelines.PptxToMarkdownPipeline import (
    PptxToMarkdownPipeline,
)
from naas_abi_marketplace.domains.document.ontologies.classes.ontology_naas_ai.abi.document.File import File

pipeline = PptxToMarkdownPipeline()

pptx_file: File = ...  # must implement .read() returning PPTX bytes
md = pipeline.convert_to_markdown(pptx_file)
print(md)

Markdown output shape:

For each non-empty slide:
- ## Slide N
- - <line 1>
- - <line 2>
- ...

Caveats

Only extracts text present in slide XML text runs (a:t) and treats line breaks (a:br) as newline markers before whitespace normalization.
Images, tables, speaker notes, and most formatting are not converted.
If the input is not a valid PPTX zip (or slide XML cannot be read), convert_to_markdown raises ValueError("Invalid PPTX content: unable to read slide XML").
Slide numbering is derived from ppt/slides/slide*.xml filenames and sorted by their numeric index.

What it is​

Public API​

Configuration/Dependencies​

Usage​

Caveats​

What it is

Public API

Configuration/Dependencies

Usage

Caveats