SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP#23
SA617 - Cleaning ASHE strings and preparing SOC knowledgebase and SOC DIRECT LOOKUP#23peter-spencer-ons wants to merge 40 commits into
Conversation
… education, manage others and industry description (keep only job title)
|
|
||
| import dotenv | ||
| import pandas as pd | ||
| from occupational_classification.data_access.soc_data_access import combine_job_title |
There was a problem hiding this comment.
does not exist?
edit: the lock doesn't match pyproject.toml, so I had newer version of occupational_classification
|
|
||
| ### Remove duplicates ### | ||
| data["documents"] = data["documents"].str.strip() # Remove leading space | ||
| data = data.drop_duplicates(subset="documents", keep="last") # remove duplicates |
There was a problem hiding this comment.
do I read this correctly that you drop duplicates ONLY before running the LLM spellcheck?
Shall we run it after as well? Shall we check that the duplicates have the same code assigned?
edit: I see it is handled in the last nb
There was a problem hiding this comment.
also shall we deduplocate on pair text-code?
| data = pd.read_csv(f"{output_folder}/{file_name}{input_file_name}.csv") | ||
| print("Database loaded from local.") | ||
| except FileNotFoundError: | ||
| print("KNOWLEDGE_BUCKET not found in .env file. Please set it.") |
There was a problem hiding this comment.
the logic if/else seem reversed, also the messaging is not quite on point (file doesn't exist != env variable not exist)
| ) | ||
|
|
||
| # %% | ||
| # soc_kb.to_csv(f"{knowledge_bucket}SOC_KB.csv") |
| # print("SAVED TO BUCKET") | ||
|
|
||
|
|
||
| if not os.path.exists(f"{output_folder}/{file_prefix}{file_suffix}.csv"): |
There was a problem hiding this comment.
before writing, check records with NAN in 'corrected_spelling'
| ) | ||
|
|
||
| # asyncio.run(split_in_batches(not_in_list)) | ||
| asyncio.run(split_in_batches(in_list)) |
There was a problem hiding this comment.
I don;t think we want to modify the terms that are in the index, shouldn't we be processing the not_in_list?
|
|
||
| if current_batch_id + 1 == final_batch: | ||
| df = df.drop_duplicates( | ||
| subset=["corrected_spelling", "label"], keep="last" |
There was a problem hiding this comment.
maybe we should not deduplicate anywhere in this script and leave it to postprocessing (in soc_kb_2026_04)?
There was a problem hiding this comment.
Deduplicating here happens only on the subset of the data (either in_list or not_in_list, depending which subset we use for split_in_batches method). I think it's worth doing it here too - less storage space, and clearer data being saved.
|
|
||
|
|
||
| async def batching(job_titles_column: pd.Series, batch_id: int): | ||
| """Takes next batch from the dataset of size 10. |
| with open( | ||
| f"{output_folder}/{file_prefix}{file_suffix}.json", encoding="utf-8" | ||
| ) as file: | ||
| recent_batch_id = json.load(file)["completed_batches"] |
There was a problem hiding this comment.
have you tried the interruption? I think there may be a duplicated batch
There was a problem hiding this comment.
I didn't see any duplication after interrupting. Happy to discuss further
|
|
||
|
|
||
| ### Create a dictionary for short list ### | ||
| s_list = load_soc_framework( |
There was a problem hiding this comment.
I don;t think this is used enywhere. the s_list is also defined (and used) in create_soc_lookup (at line 303)
|
|
||
|
|
||
| ### Read the data ### | ||
| def load_soc_index(filepath: str) -> pd.DataFrame: |
There was a problem hiding this comment.
this is defined at like three different places
…e with original text
✨ Summary
Prepare datasets for knowledgebase and Direct Lookup for SOC, using data from ASHE.
Changes include adapting prompts and LLM methods that are not meant to be used in the main branch, but exclusively in SA617, without merging.
📜 Changes Introduced
✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
select a subset of ASHE dataset
create .env file (please reach out for details)
run:
ashe_clean_2026_04.pyassign_soc_code_2026_03.pysoc_kb_2026_04.ipynbcreate_soc_lookup_2026_04.ipynb