Using the UNSPSC (United Nations Standard Products and Services Code) to identify hazardous materials is beneficial because it provides a global classification system that is standardized and consistent. This system facilitates the identification and categorization of products and services, including hazardous materials.
Please note that the data source may contain information that is not up-to-date and could also be subject to copyright protection. It’s important to verify the currency and legal permissions before using such data.
The code below is not finished, and heaps of information get lost, when taking the table from the PDF file, Moreover, UNSPSC codes with a non-numeric FSC are disregarded. I am most interested in suggestions for improvements.
%python
!pip install PyPDF2
import requests
response = requests.get('https://www.commerce.gov/sites/default/files/2023-07/UNSPSC-FSC%20Master%20Crosswalk.pdf')
import io
import PyPDF2
import pandas as pd
import re as regex
reader = PyPDF2.PdfReader(io.BytesIO(response.content))
row ={'UNSPSC':[],'FSC':[],'FSC_DS':[]}
for page_number in range(0,len(reader.pages)):
#UNSPSC_C SEGMENT FAMILY UCLASS COMMODITY UNSPSC_TITLE FSC FSC_DS
dynamic_content = reader.pages[page_number].extract_text().split('\n')
#^\d{4}$
x=0
for i in dynamic_content:
if x != 0 :
FSC= (regex.findall(r"\W[0-9]{4}\W",i))
if (len(FSC) > 0):
row['UNSPSC'].append(i[:8].strip())
row['FSC'].append(FSC[0].strip())
FSC_Position=i.index(FSC[0].strip())
row['FSC_DS'].append(i[FSC_Position+4:].strip())
df=pd.DataFrame(row)
x=x+1
df[df["FSC_DS"] == "CHEMICALS"]

Leave a Reply