Skip to main content

SanaxLinkedInSalesNavigatorExtractorPipeline

What it is

A pipeline that reads an Excel export generated by the Sanax LinkedIn Sales Navigator Chrome extension and converts rows into an RDFLib Graph, then inserts the triples into a configured triple store. It also writes the produced TTL and a copy of the processed Excel to object storage.

Public API

Configuration

  • SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration
    • triple_store: ITripleStoreService — target triple store (used for insert() and identifier lookups).
    • data_store_path: str = "datastore/sanax/linkedin_sales_navigator" — declared but not used in this file.
    • limit: int | None = None — optional row limit applied before processing.

Parameters

  • SanaxLinkedInSalesNavigatorExtractorPipelineParameters
    • file_path: str — Excel file path (supports paths prefixed with "storage/").
    • sheet_name: str = "LinkedIn Sales Navigator" — worksheet name.

Pipeline class

  • SanaxLinkedInSalesNavigatorExtractorPipeline(configuration)
    • run(parameters) -> rdflib.Graph
      • Reads Excel (object storage first, then local).
      • Validates required columns.
      • Builds RDF individuals for:
        • LinkedIn profile pages, people, locations, positions, organizations, LinkedIn company pages
        • “DataSource” + per-row “DataSourceComponent”
        • Two “Act of Association” individuals per row (role+company and company-level), with optional computed startDate.
      • Persists graph + logs (TTL + Excel) to object storage and inserts into the triple store.
    • calculate_start_date(duration_str: str, start_datetime: datetime | None = None) -> datetime | None
      • Converts durations like "7 years 5 months" to a start date by subtracting the duration from the first day of the provided/current month (UTC).
    • generate_graph_date(date: datetime, date_format: str = "%Y-%m-%dT%H:%M:%S.%fZ") -> tuple[URIRef, Graph]
      • Creates a date individual (URI based on epoch milliseconds) and returns (date_uri, graph_with_date_triples).
    • as_tools() -> list[langchain_core.tools.BaseTool]
      • Exposes a StructuredTool named linkedin_sales_navigator_import_excel that calls run().

Configuration/Dependencies

  • Requires services provided via ABIModule.get_instance().engine.services:
    • Triple store service (also passed explicitly in configuration)
    • Object storage service (used by StorageUtils to get_excel, save_triples, save_excel)
  • Uses SPARQL identifier lookups via SPARQLUtils.get_identifiers() to deduplicate entities.
  • Excel schema requirements (must exist as columns):
    • Name, Job Title, Company, Company URL, Location, Time in Role, Time in Company, LinkedIn URL
  • Key libraries:
    • pandas, rdflib, fastapi (router type only), langchain_core (tools)

Usage

Run the pipeline

from naas_abi_marketplace.applications.sanax.pipelines.SanaxLinkedInSalesNavigatorExtractorPipeline import (
SanaxLinkedInSalesNavigatorExtractorPipeline,
SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration,
SanaxLinkedInSalesNavigatorExtractorPipelineParameters,
)

# triple_store must implement ITripleStoreService
config = SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration(triple_store=triple_store)

pipeline = SanaxLinkedInSalesNavigatorExtractorPipeline(config)

graph = pipeline.run(
SanaxLinkedInSalesNavigatorExtractorPipelineParameters(
file_path="storage/datastore/linkedin_sales_navigator/sanax_extractor/Example.xlsx",
sheet_name="LinkedIn Sales Navigator",
)
)

print(len(graph))

Use as a LangChain tool

tool = SanaxLinkedInSalesNavigatorExtractorPipeline(
SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration(triple_store=triple_store)
).as_tools()[0]

result_graph = tool.func(
file_path="storage/path/to/file.xlsx",
sheet_name="LinkedIn Sales Navigator",
)

Caveats

  • If the Excel cannot be read from storage and local reading also fails, the code may attempt len(df) without df being defined (depends on exception flow in _read_and_validate_excel).
  • Local file metadata is read via os.path.getmtime("storage/..."); the file must exist at that constructed local path even if the Excel was read from object storage.
  • Duration parsing for Time in Role / Time in Company only recognizes tokens containing "year" and/or "month" with a preceding digit; invalid formats return None (and no startDate is added).
  • Deduplication logic uses SPARQL identifier maps; some entities (e.g., person keyed by name label) may collide across different individuals with the same label.
  • as_api(...) is a stub and does not register any FastAPI routes.