By

Python: Extract UNSPSC from a PDF

Using the UNSPSC (United Nations Standard Products and Services Code) to identify hazardous materials is beneficial because it provides a global classification system that is standardized and consistent. This system facilitates the identification and categorization of products and services, including hazardous materials.

Please note that the data source may contain information that is not up-to-date and could also be subject to copyright protection. It’s important to verify the currency and legal permissions before using such data.

The code below is not finished, and heaps of information get lost, when taking the table from the PDF file, Moreover, UNSPSC codes with a non-numeric FSC are disregarded. I am most interested in suggestions for improvements.

%python
!pip install PyPDF2

import requests
response = requests.get('https://www.commerce.gov/sites/default/files/2023-07/UNSPSC-FSC%20Master%20Crosswalk.pdf')

import io
import PyPDF2
import pandas as pd
import re as regex

reader = PyPDF2.PdfReader(io.BytesIO(response.content))

row ={'UNSPSC':[],'FSC':[],'FSC_DS':[]}

for page_number in range(0,len(reader.pages)):

#UNSPSC_C SEGMENT FAMILY UCLASS COMMODITY UNSPSC_TITLE FSC FSC_DS

     dynamic_content = reader.pages[page_number].extract_text().split('\n')
#^\d{4}$
     x=0
     for i in dynamic_content:
        if x != 0 :
           FSC= (regex.findall(r"\W[0-9]{4}\W",i))
           if (len(FSC) > 0):
               row['UNSPSC'].append(i[:8].strip())
               row['FSC'].append(FSC[0].strip())
               FSC_Position=i.index(FSC[0].strip())
               row['FSC_DS'].append(i[FSC_Position+4:].strip())

           df=pd.DataFrame(row)
        x=x+1

df[df["FSC_DS"] == "CHEMICALS"]

Leave a Reply

About the blog

RAW is a WordPress blog theme design inspired by the Brutalist concepts from the homonymous Architectural movement.

Get updated

Subscribe to our newsletter and receive our very latest news.

← Back

Thank you for your response. ✨

Discover more from The Golden Hour

Subscribe now to keep reading and get access to the full archive.

Continue reading