ArXivQueryWorkflow
What it is
A workflow that loads local ArXiv RDF/Turtle (.ttl) files into an RDFLib graph and provides SPARQL-backed queries to:
- list authors for papers (by ID or title substring)
- list papers (by author name or category substring)
- execute custom SPARQL queries
- retrieve the ArXiv ontology schema file (as Turtle)
- expose these capabilities as LangChain tools and FastAPI endpoints
Public API
Configuration
ArXivQueryWorkflowConfiguration(storage_path: str = "storage/triplestore/application-level/arxiv")- Directory containing
.ttlfiles to load into the combined graph.
- Directory containing
Pydantic parameter models
AuthorQueryParameters(paper_id: Optional[str], paper_title: Optional[str])- Inputs for author lookup by paper ID and/or title substring.
PaperQueryParameters(author_name: Optional[str], category: Optional[str])- Inputs for paper lookup by author name and/or category substring.
SchemaParameters()- No fields (placeholder).
SparqlQueryParameters(query: str)- SPARQL query string to execute.
Workflow
class ArXivQueryWorkflow(configuration: ArXivQueryWorkflowConfiguration)query_authors(parameters: AuthorQueryParameters) -> Dict[str, Any]- Returns
{"papers": [...]}with paperid,title, and uniqueauthorslist. - If neither
paper_idnorpaper_titleis provided, returns{"error": ...}.
- Returns
query_papers(parameters: PaperQueryParameters) -> Dict[str, Any]- Returns
{"papers": [...]}with paperid,title, and optionalpdf_url. - If neither
author_namenorcategoryis provided, returns{"error": ...}.
- Returns
get_schema(parameters: SchemaParameters) -> Dict[str, str]- Reads and returns
{"schema": "<ttl content>"}fromsrc/custom/modules/arxiv_agent/ontologies/ArXivOntology.ttl. - If missing, returns
{"error": ...}.
- Reads and returns
execute_query(parameters: SparqlQueryParameters) -> Dict[str, Any]- Executes the provided SPARQL query and returns
{"results": [ {var: value, ...}, ... ]}. - On failure returns
{"error": ...}.
- Executes the provided SPARQL query and returns
get_frequent_authors() -> Dict[str, Any]- Returns
{"authors": [{"name": str, "paper_count": int}, ...]}ordered by descending count. - On failure returns
{"error": ...}.
- Returns
as_tools() -> list[langchain_core.tools.BaseTool]- Provides LangChain
StructuredTools:query_arxiv_authorsquery_arxiv_papersget_arxiv_schemaexecute_arxiv_queryget_frequent_authors
- Provides LangChain
as_api(router: fastapi.APIRouter, ...) -> None- Registers FastAPI routes:
POST /arxiv/query-authorsPOST /arxiv/query-papersPOST /arxiv/schemaPOST /arxiv/query
- Registers FastAPI routes:
Configuration/Dependencies
- File storage:
- Loads all
*.ttlfiles fromArXivQueryWorkflowConfiguration.storage_path.
- Loads all
- Ontology file:
src/custom/modules/arxiv_agent/ontologies/ArXivOntology.ttlis read byget_schema().
- Key dependencies (imported):
rdflib(Graph, SPARQL querying)naas_abi_core.utils.Graph.ABIGraph(used as combined graph)fastapi(APIRouter) foras_apilangchain_core.tools(StructuredTool) foras_toolspydanticfor request/parameter models
Usage
Run queries directly (Python)
from naas_abi_marketplace.applications.arxiv.workflows.ArXivQueryWorkflow import (
ArXivQueryWorkflow,
ArXivQueryWorkflowConfiguration,
AuthorQueryParameters,
PaperQueryParameters,
SparqlQueryParameters,
)
wf = ArXivQueryWorkflow(ArXivQueryWorkflowConfiguration(
storage_path="storage/triplestore/application-level/arxiv"
))
print(wf.query_authors(AuthorQueryParameters(paper_id="2206.11097")))
print(wf.query_papers(PaperQueryParameters(author_name="smith")))
print(wf.execute_query(SparqlQueryParameters(query="""
PREFIX abi: <http://ontology.naas.ai/abi/>
SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 5
""")))
Expose as FastAPI endpoints
from fastapi import FastAPI, APIRouter
from naas_abi_marketplace.applications.arxiv.workflows.ArXivQueryWorkflow import (
ArXivQueryWorkflow, ArXivQueryWorkflowConfiguration
)
app = FastAPI()
router = APIRouter()
wf = ArXivQueryWorkflow(ArXivQueryWorkflowConfiguration())
wf.as_api(router)
app.include_router(router)
Caveats
- Graph loading:
- If
storage_pathdoes not exist or contains no.ttlfiles, methods return empty result sets (and print warnings to stdout). - Per-file parse errors are printed and skipped.
- If
- Query construction:
query_authors()/query_papers()interpolate user input directly into SPARQL strings; malformed input may break queries.
- API coverage:
as_api()does not exposeget_frequent_authors()as an endpoint (only available via direct call oras_tools()).
- Schema path is fixed:
get_schema()always reads fromsrc/custom/modules/arxiv_agent/ontologies/ArXivOntology.ttl(not configurable via workflow configuration).