SanaxLinkedInSalesNavigatorExtractorPipeline

What it is

A pipeline that reads an Excel export generated by the Sanax LinkedIn Sales Navigator Chrome extension and converts rows into an RDFLib Graph, then inserts the triples into a configured triple store. It also writes the produced TTL and a copy of the processed Excel to object storage.

Public API

Configuration

SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration
- triple_store: ITripleStoreService — target triple store (used for insert() and identifier lookups).
- data_store_path: str = "datastore/sanax/linkedin_sales_navigator" — declared but not used in this file.
- limit: int | None = None — optional row limit applied before processing.

Parameters

SanaxLinkedInSalesNavigatorExtractorPipelineParameters
- file_path: str — Excel file path (supports paths prefixed with "storage/").
- sheet_name: str = "LinkedIn Sales Navigator" — worksheet name.

Pipeline class

SanaxLinkedInSalesNavigatorExtractorPipeline(configuration)
- run(parameters) -> rdflib.Graph
  - Reads Excel (object storage first, then local).
  - Validates required columns.
  - Builds RDF individuals for:
    - LinkedIn profile pages, people, locations, positions, organizations, LinkedIn company pages
    - “DataSource” + per-row “DataSourceComponent”
    - Two “Act of Association” individuals per row (role+company and company-level), with optional computed startDate.
  - Persists graph + logs (TTL + Excel) to object storage and inserts into the triple store.
- calculate_start_date(duration_str: str, start_datetime: datetime | None = None) -> datetime | None
  - Converts durations like "7 years 5 months" to a start date by subtracting the duration from the first day of the provided/current month (UTC).
- generate_graph_date(date: datetime, date_format: str = "%Y-%m-%dT%H:%M:%S.%fZ") -> tuple[URIRef, Graph]
  - Creates a date individual (URI based on epoch milliseconds) and returns (date_uri, graph_with_date_triples).
- as_tools() -> list[langchain_core.tools.BaseTool]
  - Exposes a StructuredTool named linkedin_sales_navigator_import_excel that calls run().

Configuration/Dependencies

Requires services provided via ABIModule.get_instance().engine.services:
- Triple store service (also passed explicitly in configuration)
- Object storage service (used by StorageUtils to get_excel, save_triples, save_excel)
Uses SPARQL identifier lookups via SPARQLUtils.get_identifiers() to deduplicate entities.
Excel schema requirements (must exist as columns):
- Name, Job Title, Company, Company URL, Location, Time in Role, Time in Company, LinkedIn URL
Key libraries:
- pandas, rdflib, fastapi (router type only), langchain_core (tools)

Usage

Run the pipeline

from naas_abi_marketplace.applications.sanax.pipelines.SanaxLinkedInSalesNavigatorExtractorPipeline import (
    SanaxLinkedInSalesNavigatorExtractorPipeline,
    SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration,
    SanaxLinkedInSalesNavigatorExtractorPipelineParameters,
)

# triple_store must implement ITripleStoreService
config = SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration(triple_store=triple_store)

pipeline = SanaxLinkedInSalesNavigatorExtractorPipeline(config)

graph = pipeline.run(
    SanaxLinkedInSalesNavigatorExtractorPipelineParameters(
        file_path="storage/datastore/linkedin_sales_navigator/sanax_extractor/Example.xlsx",
        sheet_name="LinkedIn Sales Navigator",
    )
)

print(len(graph))

Use as a LangChain tool

tool = SanaxLinkedInSalesNavigatorExtractorPipeline(
    SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration(triple_store=triple_store)
).as_tools()[0]

result_graph = tool.func(
    file_path="storage/path/to/file.xlsx",
    sheet_name="LinkedIn Sales Navigator",
)

Caveats

If the Excel cannot be read from storage and local reading also fails, the code may attempt len(df) without df being defined (depends on exception flow in _read_and_validate_excel).
Local file metadata is read via os.path.getmtime("storage/..."); the file must exist at that constructed local path even if the Excel was read from object storage.
Duration parsing for Time in Role / Time in Company only recognizes tokens containing "year" and/or "month" with a preceding digit; invalid formats return None (and no startDate is added).
Deduplication logic uses SPARQL identifier maps; some entities (e.g., person keyed by name label) may collide across different individuals with the same label.
as_api(...) is a stub and does not register any FastAPI routes.

What it is​

Public API​

Configuration​

Parameters​

Pipeline class​

Configuration/Dependencies​

Usage​

Run the pipeline​

Use as a LangChain tool​

Caveats​