ArXivQueryWorkflow

What it is

A workflow that loads local ArXiv RDF/Turtle (.ttl) files into an RDFLib graph and provides SPARQL-backed queries to:

list authors for papers (by ID or title substring)
list papers (by author name or category substring)
execute custom SPARQL queries
retrieve the ArXiv ontology schema file (as Turtle)
expose these capabilities as LangChain tools and FastAPI endpoints

Public API

Configuration

ArXivQueryWorkflowConfiguration(storage_path: str = "storage/triplestore/application-level/arxiv")
- Directory containing .ttl files to load into the combined graph.

Pydantic parameter models

AuthorQueryParameters(paper_id: Optional[str], paper_title: Optional[str])
- Inputs for author lookup by paper ID and/or title substring.
PaperQueryParameters(author_name: Optional[str], category: Optional[str])
- Inputs for paper lookup by author name and/or category substring.
SchemaParameters()
- No fields (placeholder).
SparqlQueryParameters(query: str)
- SPARQL query string to execute.

Workflow

class ArXivQueryWorkflow(configuration: ArXivQueryWorkflowConfiguration)
- query_authors(parameters: AuthorQueryParameters) -> Dict[str, Any]
  - Returns {"papers": [...]} with paper id, title, and unique authors list.
  - If neither paper_id nor paper_title is provided, returns {"error": ...}.
- query_papers(parameters: PaperQueryParameters) -> Dict[str, Any]
  - Returns {"papers": [...]} with paper id, title, and optional pdf_url.
  - If neither author_name nor category is provided, returns {"error": ...}.
- get_schema(parameters: SchemaParameters) -> Dict[str, str]
  - Reads and returns {"schema": "<ttl content>"} from src/custom/modules/arxiv_agent/ontologies/ArXivOntology.ttl.
  - If missing, returns {"error": ...}.
- execute_query(parameters: SparqlQueryParameters) -> Dict[str, Any]
  - Executes the provided SPARQL query and returns {"results": [ {var: value, ...}, ... ]}.
  - On failure returns {"error": ...}.
- get_frequent_authors() -> Dict[str, Any]
  - Returns {"authors": [{"name": str, "paper_count": int}, ...]} ordered by descending count.
  - On failure returns {"error": ...}.
- as_tools() -> list[langchain_core.tools.BaseTool]
  - Provides LangChain StructuredTools:
    - query_arxiv_authors
    - query_arxiv_papers
    - get_arxiv_schema
    - execute_arxiv_query
    - get_frequent_authors
- as_api(router: fastapi.APIRouter, ...) -> None
  - Registers FastAPI routes:
    - POST /arxiv/query-authors
    - POST /arxiv/query-papers
    - POST /arxiv/schema
    - POST /arxiv/query

Configuration/Dependencies

File storage:
- Loads all *.ttl files from ArXivQueryWorkflowConfiguration.storage_path.
Ontology file:
- src/custom/modules/arxiv_agent/ontologies/ArXivOntology.ttl is read by get_schema().
Key dependencies (imported):
- rdflib (Graph, SPARQL querying)
- naas_abi_core.utils.Graph.ABIGraph (used as combined graph)
- fastapi (APIRouter) for as_api
- langchain_core.tools (StructuredTool) for as_tools
- pydantic for request/parameter models

Usage

Run queries directly (Python)

from naas_abi_marketplace.applications.arxiv.workflows.ArXivQueryWorkflow import (
    ArXivQueryWorkflow,
    ArXivQueryWorkflowConfiguration,
    AuthorQueryParameters,
    PaperQueryParameters,
    SparqlQueryParameters,
)

wf = ArXivQueryWorkflow(ArXivQueryWorkflowConfiguration(
    storage_path="storage/triplestore/application-level/arxiv"
))

print(wf.query_authors(AuthorQueryParameters(paper_id="2206.11097")))
print(wf.query_papers(PaperQueryParameters(author_name="smith")))
print(wf.execute_query(SparqlQueryParameters(query="""
PREFIX abi: <http://ontology.naas.ai/abi/>
SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5
""")))

Expose as FastAPI endpoints

from fastapi import FastAPI, APIRouter
from naas_abi_marketplace.applications.arxiv.workflows.ArXivQueryWorkflow import (
    ArXivQueryWorkflow, ArXivQueryWorkflowConfiguration
)

app = FastAPI()
router = APIRouter()
wf = ArXivQueryWorkflow(ArXivQueryWorkflowConfiguration())

wf.as_api(router)
app.include_router(router)

Caveats

Graph loading:
- If storage_path does not exist or contains no .ttl files, methods return empty result sets (and print warnings to stdout).
- Per-file parse errors are printed and skipped.
Query construction:
- query_authors() / query_papers() interpolate user input directly into SPARQL strings; malformed input may break queries.
API coverage:
- as_api() does not expose get_frequent_authors() as an endpoint (only available via direct call or as_tools()).
Schema path is fixed:
- get_schema() always reads from src/custom/modules/arxiv_agent/ontologies/ArXivOntology.ttl (not configurable via workflow configuration).

What it is​

Public API​

Configuration​

Pydantic parameter models​

Workflow​

Configuration/Dependencies​

Usage​

Run queries directly (Python)​

Expose as FastAPI endpoints​

Caveats​

What it is

Public API

Configuration

Pydantic parameter models

Workflow

Configuration/Dependencies

Usage

Run queries directly (Python)

Expose as FastAPI endpoints

Caveats