ArXivPaperPipeline
What it is
- A
Pipelinethat fetches an ArXiv paper by ID, builds an RDF graph (ABI/BFO ontology terms), writes the graph to a Turtle (.ttl) file, and optionally downloads the paper PDF and records its local path in the graph.
Public API
-
ArXivPaperPipelineConfiguration(dataclass)- Holds runtime configuration:
arxiv_integration_config: ArXivIntegrationConfigurationtriple_store: ITripleStoreService(present in config but not used in this file)storage_base_path: str(default:"storage/triplestore/application-level/arxiv")pdf_storage_path: str(default:"datastore/application-level/arxiv")
- Holds runtime configuration:
-
ArXivPaperPipelineParameters(pydantic model)paper_id: str— ArXiv paper IDdownload_pdf: bool = True— whether to download the PDF
-
ArXivPaperPipeline(configuration)run(parameters) -> rdflib.Graph- Validates parameter type (
ArXivPaperPipelineParameters) - Fetches paper metadata via
ArXivIntegration.get_paper(paper_id) - Adds to an
ABIGraph:- Paper individual (type
ABI.ArXivPaper, with label/description/url) - Published timestamp as a
BFO.BFO_0000203“temporal instant”, linked viaBFO.BFO_0000222 - Authors as
ABI.ArXivAuthor, linked viaABI.hasAuthor - Categories as
ABI.ArXivCategory, linked viaABI.hasCategory
- Paper individual (type
- Serializes graph to a uniquely named
.ttlfile understorage_base_path - If
download_pdfis true andpdf_urlexists:- Downloads PDF to
pdf_storage_path - Adds
(paper, ABI.localFilePath, <local pdf path literal>)to the graph - Rewrites the
.ttlto include the local file path
- Downloads PDF to
- Validates parameter type (
as_tools() -> list[BaseTool]- Returns a single LangChain
StructuredToolnamed"arxiv_paper_pipeline"that callsrun()withArXivPaperPipelineParameters.
- Returns a single LangChain
as_api(...) -> None- Present but currently does nothing (returns
Nonewithout registering routes).
- Present but currently does nothing (returns
Configuration/Dependencies
- External services/libraries:
ArXivIntegration/ArXivIntegrationConfiguration(paper lookup)requests(PDF download)rdflib(Graph,Literal) andnaas_abi_core.utils.Graph(ABIGraph,ABI,BFO)langchain_core.tools.StructuredTool(tool wrapper)fastapi.APIRouter(API hook is stubbed)
- Filesystem:
- Ensures directories exist:
storage_base_pathfor Turtle filespdf_storage_pathfor downloaded PDFs
- Ensures directories exist:
Usage
from naas_abi_marketplace.applications.arxiv.pipelines.ArXivPaperPipeline import (
ArXivPaperPipeline,
ArXivPaperPipelineConfiguration,
ArXivPaperPipelineParameters,
)
from naas_abi_marketplace.applications.arxiv.integrations.ArXivIntegration import (
ArXivIntegrationConfiguration,
)
# Provide real integration configuration and a triple store service instance as required by your environment.
arxiv_cfg = ArXivIntegrationConfiguration(...) # depends on integration implementation
triple_store = ... # must implement ITripleStoreService (not used by this pipeline)
cfg = ArXivPaperPipelineConfiguration(
arxiv_integration_config=arxiv_cfg,
triple_store=triple_store,
)
pipeline = ArXivPaperPipeline(cfg)
g = pipeline.run(
ArXivPaperPipelineParameters(
paper_id="1706.03762",
download_pdf=True,
)
)
print(len(g)) # number of RDF triples
Caveats
as_api()is a stub and does not expose any FastAPI routes.triple_storeis required in configuration but is not used byrun()in this implementation.- Side effects:
- Writes
.ttlfiles and optionally.pdffiles to disk. - Uses
print()for status/errors; PDF download errors are caught and only printed (pipeline still returns the graph).
- Writes