Simplify byte_pair_merge by hauntsaninja · Pull Request #255 · openai/tiktoken

hauntsaninja · 2024-02-09T22:56:46Z

Based on suggestion in #239 (specifically 8f5dd7d)

Like that commit, this:

Does the init in a single loop and saves a loop if there are no merges
Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values)

Unlike that commit:

We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version.
I preserve some of the comments from the original that I think are still useful

Let me know what you think! Once we figure this one out, we'll look at the linearithmic fix (I have some thoughts there, still doing some benchmarking).

Co-authored-by: @paplorinc

src/lib.rs

l0rinc · 2024-02-10T11:21:42Z

src/lib.rs

+    parts.push((piece.len(), Rank::MAX));

    let get_rank = {
        #[inline(always)]


did you see any effect of the inlining here?
I didn't, and even the linter complained, this being a closure inheriting some paramters

Good question, I dimly remember it being useful in #31 (but it was also used in an additional place then). I can double check :-) Which linter?

src/lib.rs

l0rinc

We drop optimisations enabled by ignoring single tokens.

the parts.len() > 3 means that once we're down to 2 tokens, we don't need more iterations, we don't have to try to merge it into a single token since we've already filtered those out - but that won't show up in the benchmarks, since that's basically constant time, so I agree, the code is simpler this way :)

Thanks for adding the comments back, please see my inline comments.
If you can, please add me as a coauthor here.

Thanks!

Co-authored-by: Lőrinc Pap <1841944+paplorinc@users.noreply.github.com>

hauntsaninja · 2024-02-11T08:20:52Z

Thank you for the original change and follow-up review! I've marked you as co-author on the commit :-)

backport of openai#255 Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: Lőrinc Pap <1841944+paplorinc@users.noreply.github.com>

Simplify byte_pair_merge

66a57ba

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Show resolved Hide resolved

l0rinc approved these changes Feb 10, 2024

View reviewed changes

Apply suggestions from code review

2cc09e0

Co-authored-by: Lőrinc Pap <1841944+paplorinc@users.noreply.github.com>

l0rinc approved these changes Feb 11, 2024

View reviewed changes

hauntsaninja merged commit 1b9faf2 into main Feb 11, 2024

hauntsaninja deleted the byte-pair-merge branch February 11, 2024 08:20

stephentoub mentioned this pull request Feb 20, 2024

Ensure tiktoken implementation up-to-date with OpenAI reference implementation dotnet/machinelearning#7019

Open

tmm1 mentioned this pull request Oct 17, 2024

Simplify byte_pair_merge anysphere/tiktoken-rs#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Simplify byte_pair_merge#255

Simplify byte_pair_merge#255
hauntsaninja merged 2 commits intomainfrom
byte-pair-merge

hauntsaninja commented Feb 9, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l0rinc Feb 10, 2024 •

edited

Loading

Uh oh!

hauntsaninja Feb 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

l0rinc left a comment •

edited

Loading

Uh oh!

hauntsaninja commented Feb 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

hauntsaninja commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

l0rinc Feb 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hauntsaninja Feb 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

l0rinc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hauntsaninja commented Feb 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hauntsaninja commented Feb 9, 2024 •

edited

Loading

l0rinc Feb 10, 2024 •

edited

Loading

hauntsaninja Feb 11, 2024 •

edited

Loading

l0rinc left a comment •

edited

Loading