Fix IndexError in kaggle_submission.py for sequences exceeding max_length#11
Fix IndexError in kaggle_submission.py for sequences exceeding max_length#11drqsatoshi merged 2 commits intomainfrom
Conversation
…ion.py Co-authored-by: drqsatoshi <240532885+drqsatoshi@users.noreply.github.com>
There was a problem hiding this comment.
def create_submission(predictions, output_path, num_structures=5):
"""
Create submission file from predictions.
Args:
predictions: dict with 'coords' (N, L, num_structures, 3) and 'sequences'
output_path: path to save submission.csv
num_structures: number of structures to predict (default 5)
"""
coords = predictions['coords'] # Expected shape: (num_seq, max_len, num_struct, 3)
sequences = predictions['sequences']
rows = []
residue_idx = 0
for seq_idx in range(len(sequences)):
seq = sequences[seq_idx]
for pos, nucleotide in enumerate(seq):
row = {
'id': residue_idx,
'resname': nucleotide,
'resid': pos + 1
}
# Add coordinates for each structure
for struct_idx in range(num_structures):
x, y, z = coords[seq_idx, pos, struct_idx, :]
row[f'x_{struct_idx + 1}'] = x
row[f'y_{struct_idx + 1}'] = y
row[f'z_{struct_idx + 1}'] = z
rows.append(row)
residue_idx += 1
df = pd.DataFrame(rows)
df.to_csv(output_path, index=False)
print(f"Saved submission to {output_path}")
return df
There was a problem hiding this comment.
Pull request overview
This PR fixes an IndexError when generating Kaggle submissions for RNA sequences longer than the model’s max_length (500), ensuring sequence length and predicted coordinate arrays stay aligned.
Changes:
- Cap
seq_leninpredict_structurestomin(len(row['sequence']), max_length)so it never exceeds the model’s output length. - Truncate the stored
sequencetorow['sequence'][:max_length]so the sequence length matches the coordinate array’s first dimension.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| seq_len = min(len(row['sequence']), max_length) | ||
| target_id = row['target_id'] | ||
| sequence = row['sequence'] | ||
| sequence = row['sequence'][:max_length] # Truncate sequence to max_length |
There was a problem hiding this comment.
This change fixes the IndexError for sequences longer than max_length, but there is no automated test covering the case where len(row['sequence']) > max_length for the predict_structures → create_submission pipeline; adding a regression test that constructs a sequence longer than max_length and asserts that submission generation completes without index errors (and that only the first max_length residues are emitted) would help prevent this bug from reappearing.
create_submissionthrowsIndexError: index 500 is out of bounds for axis 0 with size 500when processing RNA sequences longer thanmax_length(500).Problem
In
predict_structures, coordinates are truncated tomax_lengthdue to model output shape, but the sequence andseq_lenretain original length. Whencreate_submissioniterates over the full sequence, it accesses out-of-bounds indices.Changes
seq_lentomax_lengthinpredict_structuresOriginal prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.