PubMedCentralDownloader
What it is
A small utility class to locate and download a PubMed Central (PMC) PDF given a PMCID, using the local oa_file_list.txt mapping file and streaming/downloading from the NCBI PMC FTP endpoint.
Public API
- Class:
PubMedCentralDownloaderfind_pdf_path(pmcid: str, oa_file_list_path: str) -> str- Streams through the given OA file list to find the relative FTP path (PDF or
.tar.gz) associated with a PMCID. - Raises
FileNotFoundErrorif no matching entry is found.
- Streams through the given OA file list to find the relative FTP path (PDF or
open_pmc_pdf_stream(pmcid: str, oa_file_list_path: str = "oa_file_list.txt") -> BinaryIO- Returns an open binary stream containing the PMC PDF bytes.
- If the OA entry points to a
.pdf, returns the underlying HTTP raw stream. - If it points to a
.tar.gz, downloads the archive content, extracts the first.pdffound, and returns aBytesIOstream. - Raises
FileNotFoundErrorfor missing PDFs or unsupported path types. - Caller is responsible for closing the returned stream.
Configuration/Dependencies
- External dependency:
requests(imported dynamically viaimportlib.import_module("requests")) - Local input file:
oa_file_list.txt(default name; path configurable via parameters) - Remote base URL constant:
PMC_FTP_BASE = "https://ftp.ncbi.nlm.nih.gov/pub/pmc/"
Usage
from naas_abi_marketplace.applications.pubmed.integrations.PubMedAPI.PubMedCentralDownloader import (
PubMedCentralDownloader
)
downloader = PubMedCentralDownloader()
pmcid = "PMC1234567"
with downloader.open_pmc_pdf_stream(pmcid, oa_file_list_path="oa_file_list.txt") as pdf_stream:
pdf_bytes = pdf_stream.read()
with open(f"{pmcid}.pdf", "wb") as f:
f.write(pdf_bytes)
Caveats
open_pmc_pdf_streamfully downloads.tar.gzarchives into memory (response.content) before extracting a PDF.find_pdf_pathexpects the OA file list to contain tab-delimited (or whitespace-delimited) lines where:parts[0]is the relative path, and- one of
parts[1:]equals the providedpmcid.
- Network call uses a fixed
timeout=60and a hardcodedUser-Agentstring.