SanaxLinkedInSalesNavigatorExtractorPipeline
What it is
A pipeline that reads an Excel export generated by the Sanax LinkedIn Sales Navigator Chrome extension and converts rows into an RDFLib Graph, then inserts the triples into a configured triple store. It also writes the produced TTL and a copy of the processed Excel to object storage.
Public API
Configuration
SanaxLinkedInSalesNavigatorExtractorPipelineConfigurationtriple_store: ITripleStoreService— target triple store (used forinsert()and identifier lookups).data_store_path: str = "datastore/sanax/linkedin_sales_navigator"— declared but not used in this file.limit: int | None = None— optional row limit applied before processing.
Parameters
SanaxLinkedInSalesNavigatorExtractorPipelineParametersfile_path: str— Excel file path (supports paths prefixed with"storage/").sheet_name: str = "LinkedIn Sales Navigator"— worksheet name.
Pipeline class
SanaxLinkedInSalesNavigatorExtractorPipeline(configuration)run(parameters) -> rdflib.Graph- Reads Excel (object storage first, then local).
- Validates required columns.
- Builds RDF individuals for:
- LinkedIn profile pages, people, locations, positions, organizations, LinkedIn company pages
- “DataSource” + per-row “DataSourceComponent”
- Two “Act of Association” individuals per row (role+company and company-level), with optional computed
startDate.
- Persists graph + logs (TTL + Excel) to object storage and inserts into the triple store.
calculate_start_date(duration_str: str, start_datetime: datetime | None = None) -> datetime | None- Converts durations like
"7 years 5 months"to a start date by subtracting the duration from the first day of the provided/current month (UTC).
- Converts durations like
generate_graph_date(date: datetime, date_format: str = "%Y-%m-%dT%H:%M:%S.%fZ") -> tuple[URIRef, Graph]- Creates a date individual (URI based on epoch milliseconds) and returns
(date_uri, graph_with_date_triples).
- Creates a date individual (URI based on epoch milliseconds) and returns
as_tools() -> list[langchain_core.tools.BaseTool]- Exposes a
StructuredToolnamedlinkedin_sales_navigator_import_excelthat callsrun().
- Exposes a
Configuration/Dependencies
- Requires services provided via
ABIModule.get_instance().engine.services:- Triple store service (also passed explicitly in configuration)
- Object storage service (used by
StorageUtilstoget_excel,save_triples,save_excel)
- Uses SPARQL identifier lookups via
SPARQLUtils.get_identifiers()to deduplicate entities. - Excel schema requirements (must exist as columns):
Name,Job Title,Company,Company URL,Location,Time in Role,Time in Company,LinkedIn URL
- Key libraries:
pandas,rdflib,fastapi(router type only),langchain_core(tools)
Usage
Run the pipeline
from naas_abi_marketplace.applications.sanax.pipelines.SanaxLinkedInSalesNavigatorExtractorPipeline import (
SanaxLinkedInSalesNavigatorExtractorPipeline,
SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration,
SanaxLinkedInSalesNavigatorExtractorPipelineParameters,
)
# triple_store must implement ITripleStoreService
config = SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration(triple_store=triple_store)
pipeline = SanaxLinkedInSalesNavigatorExtractorPipeline(config)
graph = pipeline.run(
SanaxLinkedInSalesNavigatorExtractorPipelineParameters(
file_path="storage/datastore/linkedin_sales_navigator/sanax_extractor/Example.xlsx",
sheet_name="LinkedIn Sales Navigator",
)
)
print(len(graph))
Use as a LangChain tool
tool = SanaxLinkedInSalesNavigatorExtractorPipeline(
SanaxLinkedInSalesNavigatorExtractorPipelineConfiguration(triple_store=triple_store)
).as_tools()[0]
result_graph = tool.func(
file_path="storage/path/to/file.xlsx",
sheet_name="LinkedIn Sales Navigator",
)
Caveats
- If the Excel cannot be read from storage and local reading also fails, the code may attempt
len(df)withoutdfbeing defined (depends on exception flow in_read_and_validate_excel). - Local file metadata is read via
os.path.getmtime("storage/..."); the file must exist at that constructed local path even if the Excel was read from object storage. - Duration parsing for
Time in Role/Time in Companyonly recognizes tokens containing"year"and/or"month"with a preceding digit; invalid formats returnNone(and nostartDateis added). - Deduplication logic uses SPARQL identifier maps; some entities (e.g., person keyed by name label) may collide across different individuals with the same label.
as_api(...)is a stub and does not register any FastAPI routes.