Skip to content

SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP#23

Draft
peter-spencer-ons wants to merge 40 commits into
mainfrom
SA617_soc_knowledgebase
Draft

SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP#23
peter-spencer-ons wants to merge 40 commits into
mainfrom
SA617_soc_knowledgebase

Conversation

@peter-spencer-ons
Copy link
Copy Markdown
Contributor

✨ Summary

Prepare datasets for knowledgebase and Direct Lookup for SOC, using data from ASHE.

Changes include adapting prompts and LLM methods that are not meant to be used in the main branch, but exclusively in SA617, without merging.

📜 Changes Introduced

  • Adapt prompts and methods within src to be usable for the task.
  • Add prompt and method to correct misspelled words from ASHE dataset
  • Prepare scripts for:
    • correcting misspellings
    • creating knowledgebase
    • assigning SOC codes with LLM use
    • selecting unambiguous codes for Direct Lookup

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

select a subset of ASHE dataset
create .env file (please reach out for details)
run:

  • ashe_clean_2026_04.py
  • assign_soc_code_2026_03.py
  • soc_kb_2026_04.ipynb
  • create_soc_lookup_2026_04.ipynb

@peter-spencer-ons peter-spencer-ons requested a review from ivyONS May 6, 2026 09:49
Comment thread notebooks/ashe_clean_2026_04.py Outdated

import dotenv
import pandas as pd
from occupational_classification.data_access.soc_data_access import combine_job_title
Copy link
Copy Markdown

@ivyONS ivyONS May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not exist?
edit: the lock doesn't match pyproject.toml, so I had newer version of occupational_classification

Comment thread notebooks/ashe_clean_2026_04.py
Comment thread notebooks/ashe_clean_2026_04.py Outdated
Comment thread notebooks/ashe_clean_2026_04.py

### Remove duplicates ###
data["documents"] = data["documents"].str.strip() # Remove leading space
data = data.drop_duplicates(subset="documents", keep="last") # remove duplicates
Copy link
Copy Markdown

@ivyONS ivyONS May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I read this correctly that you drop duplicates ONLY before running the LLM spellcheck?
Shall we run it after as well? Shall we check that the duplicates have the same code assigned?
edit: I see it is handled in the last nb

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also shall we deduplocate on pair text-code?

Comment thread notebooks/assign_soc_code_2026_03.py Outdated
data = pd.read_csv(f"{output_folder}/{file_name}{input_file_name}.csv")
print("Database loaded from local.")
except FileNotFoundError:
print("KNOWLEDGE_BUCKET not found in .env file. Please set it.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic if/else seem reversed, also the messaging is not quite on point (file doesn't exist != env variable not exist)

Comment thread notebooks/soc_kb_2026_04.py Outdated
)

# %%
# soc_kb.to_csv(f"{knowledge_bucket}SOC_KB.csv")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to include index

# print("SAVED TO BUCKET")


if not os.path.exists(f"{output_folder}/{file_prefix}{file_suffix}.csv"):
Copy link
Copy Markdown

@ivyONS ivyONS May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before writing, check records with NAN in 'corrected_spelling'

Comment thread notebooks/ashe_clean_2026_04.py Outdated
)

# asyncio.run(split_in_batches(not_in_list))
asyncio.run(split_in_batches(in_list))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don;t think we want to modify the terms that are in the index, shouldn't we be processing the not_in_list?

Comment thread notebooks/ashe_clean_2026_04.py Outdated

if current_batch_id + 1 == final_batch:
df = df.drop_duplicates(
subset=["corrected_spelling", "label"], keep="last"
Copy link
Copy Markdown

@ivyONS ivyONS May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should not deduplicate anywhere in this script and leave it to postprocessing (in soc_kb_2026_04)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicating here happens only on the subset of the data (either in_list or not_in_list, depending which subset we use for split_in_batches method). I think it's worth doing it here too - less storage space, and clearer data being saved.



async def batching(job_titles_column: pd.Series, batch_id: int):
"""Takes next batch from the dataset of size 10.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why async?

with open(
f"{output_folder}/{file_prefix}{file_suffix}.json", encoding="utf-8"
) as file:
recent_batch_id = json.load(file)["completed_batches"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried the interruption? I think there may be a duplicated batch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any duplication after interrupting. Happy to discuss further

Comment thread notebooks/assign_soc_code_2026_03.py Outdated


### Create a dictionary for short list ###
s_list = load_soc_framework(
Copy link
Copy Markdown

@ivyONS ivyONS May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don;t think this is used enywhere. the s_list is also defined (and used) in create_soc_lookup (at line 303)



### Read the data ###
def load_soc_index(filepath: str) -> pd.DataFrame:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is defined at like three different places

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants