SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP by peter-spencer-ons · Pull Request #23 · ONSdigital/soc-classification-utils

peter-spencer-ons · 2026-05-05T22:33:26Z

✨ Summary

Prepare datasets for knowledgebase and Direct Lookup for SOC, using data from ASHE.

Changes include adapting prompts and LLM methods that are not meant to be used in the main branch, but exclusively in SA617, without merging.

📜 Changes Introduced

Adapt prompts and methods within src to be usable for the task.
Add prompt and method to correct misspelled words from ASHE dataset
Prepare scripts for:
- correcting misspellings
- creating knowledgebase
- assigning SOC codes with LLM use
- selecting unambiguous codes for Direct Lookup

✅ Checklist

Please confirm you've completed these checks before requesting a review.

Code is formatted using Black
Imports are sorted using isort
Code passes linting with Ruff, Pylint, and Mypy
Security checks pass using Bandit
API and Unit tests are written and pass using pytest
Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
DocStrings follow Google-style and are added as per Pylint recommendations
Documentation has been updated if needed

🔍 How to Test

select a subset of ASHE dataset
create .env file (please reach out for details)
run:

ashe_clean_2026_04.py
assign_soc_code_2026_03.py
soc_kb_2026_04.ipynb
create_soc_lookup_2026_04.ipynb

… education, manage others and industry description (keep only job title)

ivyONS · 2026-05-12T08:35:38Z

+
+import dotenv
+import pandas as pd
+from occupational_classification.data_access.soc_data_access import combine_job_title


does not exist?
edit: the lock doesn't match pyproject.toml, so I had newer version of occupational_classification

ivyONS · 2026-05-12T09:01:21Z

+
+### Remove duplicates ###
+data["documents"] = data["documents"].str.strip()  # Remove leading space
+data = data.drop_duplicates(subset="documents", keep="last")  # remove duplicates


do I read this correctly that you drop duplicates ONLY before running the LLM spellcheck?
Shall we run it after as well? Shall we check that the duplicates have the same code assigned?
edit: I see it is handled in the last nb

also shall we deduplocate on pair text-code?

ivyONS · 2026-05-12T10:38:56Z

+    data = pd.read_csv(f"{output_folder}/{file_name}{input_file_name}.csv")
+    print("Database loaded from local.")
+except FileNotFoundError:
+    print("KNOWLEDGE_BUCKET not found in .env file. Please set it.")


the logic if/else seem reversed, also the messaging is not quite on point (file doesn't exist != env variable not exist)

ivyONS · 2026-05-12T11:30:29Z

+)
+
+# %%
+# soc_kb.to_csv(f"{knowledge_bucket}SOC_KB.csv")


don't need to include index

ivyONS · 2026-05-12T12:14:21Z

+            # print("SAVED TO BUCKET")
+
+
+if not os.path.exists(f"{output_folder}/{file_prefix}{file_suffix}.csv"):


before writing, check records with NAN in 'corrected_spelling'

ivyONS · 2026-05-12T13:18:09Z

+    )
+
+# asyncio.run(split_in_batches(not_in_list))
+asyncio.run(split_in_batches(in_list))


I don;t think we want to modify the terms that are in the index, shouldn't we be processing the not_in_list?

ivyONS · 2026-05-12T13:21:47Z

+
+        if current_batch_id + 1 == final_batch:
+            df = df.drop_duplicates(
+                subset=["corrected_spelling", "label"], keep="last"


maybe we should not deduplicate anywhere in this script and leave it to postprocessing (in soc_kb_2026_04)?

Deduplicating here happens only on the subset of the data (either in_list or not_in_list, depending which subset we use for split_in_batches method). I think it's worth doing it here too - less storage space, and clearer data being saved.

ivyONS · 2026-05-12T13:26:16Z

+
+
+async def batching(job_titles_column: pd.Series, batch_id: int):
+    """Takes next batch from the dataset of size 10.


ivyONS · 2026-05-12T13:27:41Z

+    with open(
+        f"{output_folder}/{file_prefix}{file_suffix}.json", encoding="utf-8"
+    ) as file:
+        recent_batch_id = json.load(file)["completed_batches"]


have you tried the interruption? I think there may be a duplicated batch

I didn't see any duplication after interrupting. Happy to discuss further

ivyONS · 2026-05-12T13:32:15Z

+
+
+### Create a dictionary for short list ###
+s_list = load_soc_framework(


I don;t think this is used enywhere. the s_list is also defined (and used) in create_soc_lookup (at line 303)

ivyONS · 2026-05-12T13:37:53Z

+
+
+### Read the data ###
+def load_soc_index(filepath: str) -> pd.DataFrame:


this is defined at like three different places

…e with original text

peter-spencer-ons added 28 commits March 30, 2026 13:11

Adapt prompt for SOC KNOWLEDGEBASE - remove job description, level of…

abda90f

… education, manage others and industry description (keep only job title)

Add prompt for cleaning text strings

930bb75

Add functionality for cleaning text strings

1866384

add codes to corrected titles

f8b3b1f

spell check job titles

f97a22c

ignore csv files

d117afa

ignore parquet and json files

450cbca

save soc knowledgebase

9b23e16

change logic rregarding deduplicaiton

0501860

change deduplication logic

b49c43d

correct output from fixing prompt

1d831b6

create SOC knowledgebase

4440585

rename file

438dbfc

rename files

5e4738d

rename files

ecd45be

prompt attempt2

633187a

prompt attempt3

c4aa376

prompt attempt4

73d9ade

prompt attempt5

4f7fe85

prompt attempt5

0e36e72

prompt attempt6

100edb4

prompt attempt7

efc0978

prompt attempt8

265dad2

clean file to assigning soc code

877c771

clean commented out code from llm and promp files

36b0975

faster saving

5cbf18d

clean layout

cc7f148

create soc lookup - notebook

a5d89a6

peter-spencer-ons requested a review from ivyONS May 6, 2026 09:49

merge main

d2c90ed

ivyONS reviewed May 12, 2026

View reviewed changes

Comment thread notebooks/ashe_clean_2026_04.py

ivyONS reviewed May 12, 2026

View reviewed changes

Comment thread notebooks/ashe_clean_2026_04.py Outdated

ivyONS reviewed May 12, 2026

View reviewed changes

Comment thread notebooks/ashe_clean_2026_04.py

ivyONS reviewed May 12, 2026

View reviewed changes

peter-spencer-ons added 11 commits May 14, 2026 11:26

update poetery lock, add env template

ad0d103

correct imports after new version

cdf362c

add docstrings to scripts

cc7ddbb

lint

11725d7

Small changes: import logic, typehints, not saving index, replace Non…

f47f50f

…e with original text

remove async where not needed

67d01f6

remove unused method, variable and import

b239c58

replace reasoning in the corrected_spelling with original label

6b33a45

abbreviations in prompt

95703a8

poetry

87fa56e

typo in a prompt for correcting typos

c2cceb7

		# print("SAVED TO BUCKET")


		if not os.path.exists(f"{output_folder}/{file_prefix}{file_suffix}.csv"):



		async def batching(job_titles_column: pd.Series, batch_id: int):
		"""Takes next batch from the dataset of size 10.



		### Create a dictionary for short list ###
		s_list = load_soc_framework(



		### Read the data ###
		def load_soc_index(filepath: str) -> pd.DataFrame:

Conversation

peter-spencer-ons commented May 5, 2026

✨ Summary

📜 Changes Introduced

✅ Checklist

🔍 How to Test

Uh oh!

ivyONS May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivyONS May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivyONS May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivyONS May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivyONS May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivyONS May 12, 2026 •

edited

Loading

ivyONS May 12, 2026 •

edited

Loading

ivyONS May 12, 2026 •

edited

Loading

ivyONS May 12, 2026 •

edited

Loading

ivyONS May 12, 2026 •

edited

Loading