OpenAI Text Parsing(Python)

Loading...

Showcases OpenAIs ability to parse out relevent information from an entire PDF page.

Adding this command cell just incase user is missing openai python library.

pip install openai
Python interpreter will be restarted. Collecting openai Downloading openai-0.27.6-py3-none-any.whl (71 kB) Collecting tqdm Downloading tqdm-4.65.0-py3-none-any.whl (77 kB) Requirement already satisfied: requests>=2.20 in /databricks/python3/lib/python3.9/site-packages (from openai) (2.26.0) Collecting aiohttp Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB) Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.9/site-packages (from requests>=2.20->openai) (3.2) Requirement already satisfied: charset-normalizer~=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from requests>=2.20->openai) (2.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.9/site-packages (from requests>=2.20->openai) (1.26.7) Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.9/site-packages (from requests>=2.20->openai) (2021.10.8) Collecting frozenlist>=1.1.1 Downloading frozenlist-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB) Collecting yarl<2.0,>=1.0 Downloading yarl-1.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (269 kB) Collecting async-timeout<5.0,>=4.0.0a3 Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB) Collecting aiosignal>=1.1.2 Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB) Collecting multidict<7.0,>=4.5 Downloading multidict-6.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB) Requirement already satisfied: attrs>=17.3.0 in /databricks/python3/lib/python3.9/site-packages (from aiohttp->openai) (21.2.0) Installing collected packages: multidict, frozenlist, yarl, async-timeout, aiosignal, tqdm, aiohttp, openai Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 frozenlist-1.3.3 multidict-6.0.4 openai-0.27.6 tqdm-4.65.0 yarl-1.9.2 Python interpreter will be restarted.
pip install pdfplumber
Python interpreter will be restarted. Collecting pdfplumber Downloading pdfplumber-0.9.0-py3-none-any.whl (46 kB) Collecting Wand>=0.6.10 Downloading Wand-0.6.11-py2.py3-none-any.whl (143 kB) Collecting Pillow>=9.1 Downloading Pillow-9.5.0-cp39-cp39-manylinux_2_28_x86_64.whl (3.4 MB) Collecting pdfminer.six==20221105 Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB) Requirement already satisfied: charset-normalizer>=2.0.0 in /databricks/python3/lib/python3.9/site-packages (from pdfminer.six==20221105->pdfplumber) (2.0.4) Collecting cryptography>=36.0.0 Downloading cryptography-40.0.2-cp36-abi3-manylinux_2_28_x86_64.whl (3.7 MB) Requirement already satisfied: cffi>=1.12 in /databricks/python3/lib/python3.9/site-packages (from cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (1.14.6) Requirement already satisfied: pycparser in /databricks/python3/lib/python3.9/site-packages (from cffi>=1.12->cryptography>=36.0.0->pdfminer.six==20221105->pdfplumber) (2.20) Installing collected packages: cryptography, Wand, Pillow, pdfminer.six, pdfplumber Attempting uninstall: cryptography Found existing installation: cryptography 3.4.8 Not uninstalling cryptography at /databricks/python3/lib/python3.9/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-d3129cee-e864-4618-9c69-e1f62321ba74 Can't uninstall 'cryptography'. No files were found to uninstall. Attempting uninstall: Pillow Found existing installation: Pillow 8.4.0 Not uninstalling pillow at /databricks/python3/lib/python3.9/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-d3129cee-e864-4618-9c69-e1f62321ba74 Can't uninstall 'Pillow'. No files were found to uninstall. Successfully installed Pillow-9.5.0 Wand-0.6.11 cryptography-40.0.2 pdfminer.six-20221105 pdfplumber-0.9.0 Python interpreter will be restarted.

We read directly from the PDF that was imported to Databricks

File Location

/Workspace/Users/chhirsh@gmail.com/Alimak-Scando-650-Brochure-w-max-lift-logo.pdf

Script searches through entire PDF marking any pages that contain the words "Technical specifications"

import os
import openai
import pdfplumber
import re
import pandas as pd

openai.api_key = "sk-uCplXmEuLWmxxxxxxxxxxxxxZqpcU7hXqhgWBbLaBj"
results = []

#Open File an search for "Technical specifications"

pdf_file = "/Workspace/Users/chhirsh@gmail.com/Alimak-Scando-650-Brochure-w-max-lift-logo.pdf"
pattern = re.compile(r"Technical specifications", flags=re.IGNORECASE)
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 0):
        content = pg.extract_text()
        for match in pattern.finditer(content):
            print('Keyword:',match.group(), '|  Page:', page_nr, '| Location: ',content.index(match.group()))
            foundPageNum = page_nr
Keyword: TECHNICAL SPECIFICATIONS | Page: 3 | Location: 765

Show the Text extracted from the PDF

table_page = pdf.pages[foundPageNum]
testouput01 = table_page.extract_text_simple(x_tolerance=3, y_tolerance=3)

testouput01
Out[2]: 'ALIMAK SCANDO 650\nMain characteristics\n• Robust design and excellent comfort \n• Modular car design with extensions and multiple\ndoor and ramp options\n• Single and dual car configurations \n• Triple entrance possible at ground level and at any\nlanding level\n• A high-efficiency helical gearbox provides lower\npower consumption and reduced operational costs\n• ALC-II collective control system with group control\nand internal fault diagnostics system. Control panel\nmovable between two sides \n• Built-in stainless steel electrical cabinet maximises\ninternal hoist dimensions for increased lifting\ncapacity \n• Wide range of optional equipment and functions\n• Remote Monitoring System — A3\n• Fully compliant with EN, ANSI, AS, PB and the\nmajority of all national regulations\nTECHNICAL SPECIFICATIONS\nALIMAK SCANDO 650 DOL FC\nMotor control Direct-on-line (DOL) Frequency control (FC)\nMax. payload capacity 1,500–3,200 kg (3,300–7,054 lbs) * 1,500–3,200 kg (3,300–7,054 lbs) *\nMax. travelling speed 38 m/min. (125 fpm) 66 m/min.(216 fpm)\nMax. lifting height 250 / 400 m (825 / 1300 ft) * 250 / 400 m (825 / 1,300 ft) *\nCar width, internal 1.5 m (4’-11”) 1.5 m (4’-11”)\nCar length, internal 2.8–5.0 m (9\'- 2 1⁄4"-16\'- 4 3⁄4") 2.8–5.0 m (9\'- 2 1⁄4"-16\'- 4 3⁄4")\nCar height, internal 2.3 m (7’-6 1/2”) 2.3 m (7’-6 1/2”)\nMax. load space per car 17 m3(610 ft3) 17 m3(610 ft3)\nNo. of motors 2 x 11 kW 2 or 3 x 11 kW\nSafety device type ALIMAK GFD ALIMAK GFD\nPower supply range 400–500 V, 50 or 60 Hz, 3 phase 400–500 V, 50 or 60 Hz, 3 phase\nFuse ratings From 60 A From 63 A \nType of mast 650, tubular steel with integrated rack 650, tubular steel with integrated rack\nLength mast section 1.508 m (4’-11 3⁄8”) 1.508 m (4’-11 3⁄8”)\nWeight mast section with 1 rack 118 kg (260 lbs) 118 kg (260 lbs)\nRack module 5 5 \n* Increased payload capacity and lifting height and on request.\nFor other demands or specifications, please consult your Alimak representative.\n9\n1\n0\nwww.alimak.com 2\nb. \ne\nF\nN/\nE\n7 \nPictures are illustrative only and do not necessarily show the configuration of products on the market at a given point in time. Products must be used in conformity with safe practice and applicable 21\nstatutes, regulations, codes and ordinances. Specifications of products and equipment shown herein are subject to change without notice. Copyright © 2019 Alimak Group. All rights reserved. 1\nAlimak and Scando are registered trademarks of Alimak Group. '

Using the example text from the Alimak Scando 650 brocure, AI will will read through and extract the relevant info requested.

the response from the AI returned in dilimeted format, with the "split" function we are able to write each line of the response to a point in a python array.

response = openai.Completion.create(
  model="text-davinci-003",
  prompt=f"Please extract all mechanical measurment contianing numbers into the table format [section | Spec] from the input text.\n\n\"\"\"{page_text}\"\"\"",
  temperature=0.7,
  max_tokens=3000,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

output = response.choices[0].text.split('\n')
output
Out[2]: ['', '', 'Section | Spec', 'Motor control | Direct-on-line (DOL) Frequency control (FC)', 'Max. payload capacity | 1,500–3,200 kg (3,300–7,054 lbs)', 'Max. travelling speed | 38 m/min. (125 fpm) 66 m/min.(216 fpm)', 'Max. lifting height | 250 / 400 m (825 / 1300 ft)', 'Car width, internal | 1.5 m (4’-11”)', 'Car length, internal | 2.8–5.0 m (9\'- 2 1⁄4"-16\'- 4 3⁄4")', 'Car height, internal | 2.3 m (7’-6 1/2”)', 'Max. load space per car | 17 m3(610 ft3)', 'No. of motors | 2 x 11 kW 2 or 3 x 11 kW', 'Safety device type | ALIMAK GFD', 'Power supply range | 400–500 V, 50 or 60 Hz, 3 phase', 'Fuse ratings | From 60 A', 'Type of mast | 650, tubular steel with integrated rack', 'Length mast section | 1.508 m (4’-11 3⁄8”)', 'Weight mast section with 1 rack | 118 kg (260 lbs)', 'Rack module | 5']

After the output format is converted to an array we can then piviot that array into a data table to easily view.

for row in output:
    temp = row.split('|')
    results.append(temp)

output_df = pd.DataFrame(results,columns = ['Section', 'Spec'])

# Add adtional columns with plan to expand more makes and models
output_df['Make'] = 'Alimak'
output_df['Model'] = 'Scando 650'

display(output_df)
 
Section
Spec
Make
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
null
Alimak
Scando 650
null
Alimak
Scando 650
Section
Spec
Alimak
Scando 650
Motor control
Direct-on-line (DOL) Frequency control (FC)
Alimak
Scando 650
Max. payload capacity
1,500–3,200 kg (3,300–7,054 lbs)
Alimak
Scando 650
Max. travelling speed
38 m/min. (125 fpm) 66 m/min.(216 fpm)
Alimak
Scando 650
Max. lifting height
250 / 400 m (825 / 1300 ft)
Alimak
Scando 650
Car width, internal
1.5 m (4’-11”)
Alimak
Scando 650
Car length, internal
2.8–5.0 m (9'- 2 1⁄4"-16'- 4 3⁄4")
Alimak
Scando 650
Car height, internal
2.3 m (7’-6 1/2”)
Alimak
Scando 650
Max. load space per car
17 m3(610 ft3)
Alimak
Scando 650
No. of motors
2 x 11 kW 2 or 3 x 11 kW
Alimak
Scando 650
Safety device type
ALIMAK GFD
Alimak
Scando 650
Power supply range
400–500 V, 50 or 60 Hz, 3 phase
Alimak
Scando 650
Fuse ratings
From 60 A
Alimak
Scando 650
Type of mast
650, tubular steel with integrated rack
Alimak
Scando 650
Length mast section
1.508 m (4’-11 3⁄8”)
Alimak
Scando 650
Weight mast section with 1 rack
118 kg (260 lbs)
Alimak
Scando 650
Rack module
5
Alimak
Scando 650
19 rows

Write the data to a Delta Table

data = spark.createDataFrame(output_df)

data.write.mode("overwrite").saveAsTable("hive_metastore.default.bronze_heavy_machinery")

Verify the Table was created

%sql

SELECT * FROM default.bronze_heavy_machinery;
 
Section
Spec
Make
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
null
Alimak
Scando 650
null
Alimak
Scando 650
Section
Spec
Alimak
Scando 650
Motor control
Direct-on-line (DOL) Frequency control (FC)
Alimak
Scando 650
Max. payload capacity
1,500–3,200 kg (3,300–7,054 lbs)
Alimak
Scando 650
Max. travelling speed
38 m/min. (125 fpm) 66 m/min.(216 fpm)
Alimak
Scando 650
Max. lifting height
250 / 400 m (825 / 1300 ft)
Alimak
Scando 650
Car width, internal
1.5 m (4’-11”)
Alimak
Scando 650
Car length, internal
2.8–5.0 m (9'- 2 1⁄4"-16'- 4 3⁄4")
Alimak
Scando 650
Car height, internal
2.3 m (7’-6 1/2”)
Alimak
Scando 650
Max. load space per car
17 m3(610 ft3)
Alimak
Scando 650
No. of motors
2 x 11 kW 2 or 3 x 11 kW
Alimak
Scando 650
Safety device type
ALIMAK GFD
Alimak
Scando 650
Power supply range
400–500 V, 50 or 60 Hz, 3 phase
Alimak
Scando 650
Fuse ratings
From 60 A
Alimak
Scando 650
Type of mast
650, tubular steel with integrated rack
Alimak
Scando 650
Length mast section
1.508 m (4’-11 3⁄8”)
Alimak
Scando 650
19 rows

Clean Up the table using SQL and create silver level

%sql

CREATE TABLE hive_metastore.default.silver_heavy_machinery AS (

SELECT * FROM default.bronze_heavy_machinery
WHERE 
Section NOT IN ('','null','Section ')
OR
Spec NOT IN ('','null',' Spec')
);
Query returned no results

Display created silver level table

%sql

SELECT * FROM default.silver_heavy_machinery;
 
Section
Spec
Make
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Motor control
Direct-on-line (DOL) Frequency control (FC)
Alimak
Scando 650
Max. payload capacity
1,500–3,200 kg (3,300–7,054 lbs)
Alimak
Scando 650
Max. travelling speed
38 m/min. (125 fpm) 66 m/min.(216 fpm)
Alimak
Scando 650
Max. lifting height
250 / 400 m (825 / 1300 ft)
Alimak
Scando 650
Car width, internal
1.5 m (4’-11”)
Alimak
Scando 650
Car length, internal
2.8–5.0 m (9'- 2 1⁄4"-16'- 4 3⁄4")
Alimak
Scando 650
Car height, internal
2.3 m (7’-6 1/2”)
Alimak
Scando 650
Max. load space per car
17 m3(610 ft3)
Alimak
Scando 650
No. of motors
2 x 11 kW 2 or 3 x 11 kW
Alimak
Scando 650
Safety device type
ALIMAK GFD
Alimak
Scando 650
Power supply range
400–500 V, 50 or 60 Hz, 3 phase
Alimak
Scando 650
Fuse ratings
From 60 A
Alimak
Scando 650
Type of mast
650, tubular steel with integrated rack
Alimak
Scando 650
Length mast section
1.508 m (4’-11 3⁄8”)
Alimak
Scando 650
Weight mast section with 1 rack
118 kg (260 lbs)
Alimak
Scando 650
Rack module
5
Alimak
Scando 650
16 rows