Skip to content

ACP: Introduce something akin to reverse ancestor method/iterator for std::path #745

@asder8215

Description

@asder8215

Proposal

Problem statement

Currently, in std::path, there are multiple methods that produces an iterator for Path that lets you iterate through components of a path or parents of a path. For example, say I have this:

let path = Path::new("/foo/bar/text.txt");

If we use path.iter(), we get the following from iterating through Iter<'_>:

"/"
"foo"
"bar"
"text.txt"

If we use path.components(), we get the following result from iterating through an Components<'_> iterator:

RootDir
Normal("foo")
Normal("bar")
Normal("text.txt")

And if we use path.ancestors(), we get the following from iterating through a Ancestor<'_> iterator:

"/foo/bar/text.txt"
"/foo/bar"
"/foo"
"/"

The Ancestors<'_> iterator is very useful in many programs or functions that want to iteratively apply changes to the current path and its parent paths. However, we currently don't have an iterator in std::path that mirrors what the Ancestors<'_> iterator does in the opposite direction. That is, from let path = Path::new("/foo/bar/text.txt");, we should be able to iterate through an iterator that shows the following:

"/"
"/foo"
"/foo/bar"
"/foo/bar/text.txt"

And note: Ancestors<'_> only implements Iterator and FusedIterator (though if it implemented DoubleEndedIterator, I don't think it would produce the above results, and just strip from the beginning components, like "foo/bar/text.txt", "bar/text.txt", "text.txt", "").

Motivating examples or use cases

The motivation behind this ACP came from my time changing the implementation for std::fs::create_dir_all() a while back from a recursive version of creating the current directory and its parent components to an iterative version; this was to avoid a stack overflow issue that could occur on Windows with a deeply nested path of uncreated directories (see this issue).

In my iterative implementation of create_dir_all() (see here), I used the Ancestors<'_> iterator to find the first (parent) path to create before I can move in the forward direction to create the rest of the directories. How I went about creating the remaining directories was through using a counter (in traversing to the first path to create), Vec::with_capacity(counter), and extending the Vec with ancestors.take(counter). This allowed me to transform the Vec<&Path> into a reversed iterator, allowing me to create the remaining directories. However, doing this still costed me a O(N) space (similar to the recursive solution, just instead of stack it's on heap) when we could have went with an optimized O(1) (whilst keeping time complexity at O(N)) by reusing the path given to us.

The idea for this proposal was actually from @Mark-Simulacrum in reviewing my PR that commented on the allocation for Vec<&Path> being unfortunate as we lacked a "children"/reversed ancestor API, and he also gave his own sketch on what this API could look like.

All together, I think it would make std::path consistent to introduce a "children"/reversed ancestor API with how we have an ancestor API, and many people could possible have use for such an API.

Solution sketch

This is a solution sketch of the Children API I came up with (which differs from the one Mark proposed in my create_dir_all() PR).

#![feature(stmt_expr_attributes)] // for the cfg blocks I use here
#![feature(path_trailing_sep)] // for removing trailing '/' or '\\'
use std::{ffi::OsStr, path::{Path, Prefix}};

#[derive(Copy, Clone, Debug)] // consistent with traits that Ancestor<'a> derives
struct Children<'a> {
    path: &'a [u8],
    #[cfg(target_os = "windows")]
    prefix: Option<Prefix<'a>>, // we don't need this on non-windows OS
    end: usize, // the end index to show our forward path
}

// this is taken from std::path from Prefix::len
#[cfg(target_os = "windows")]
fn prefix_len(prefix: Prefix) -> usize {
    use self::Prefix::*;
    fn os_str_len(s: &OsStr) -> usize {
        s.as_encoded_bytes().len()
    }
    match prefix {
        Verbatim(x) => 4 + os_str_len(x),
        VerbatimUNC(x, y) => {
            8 + os_str_len(x) + if os_str_len(y) > 0 { 1 + os_str_len(y) } else { 0 }
        }
        VerbatimDisk(_) => 6,
        UNC(x, y) => 2 + os_str_len(x) + if os_str_len(y) > 0 { 1 + os_str_len(y) } else { 0 },
        DeviceNS(x) => 4 + os_str_len(x),
        Disk(_) => 2,
    }
}

// Note this may run on certain platforms like
// UEFI or Solid ASP3 (which uses '\\'), so instead of these functions
// we rely on is_sep_byte() that's already privately imported in
// std::path from std::sys::path
#[inline]
#[cfg(not(target_os = "windows"))]
fn is_sep_byte(b: u8) -> bool {
    b == b'/'
}
#[inline]
#[cfg(target_os = "windows")]
fn is_sep_byte(b: u8) -> bool {
    b == b'\\' || b == b'/'
}

impl<'a> Iterator for Children<'a> {
    type Item = &'a Path;

    fn next(&mut self) -> Option<Self::Item> {
        // we've reached the end of the iterator if our
        // end is >= path given to us
        if self.end >= self.path.len() {
            return None;
        }
        
        // if we have a prefix, we can just start iterating through bytes
        // after prefix len
        #[cfg(target_os = "windows")]
        if let Some(prefix) = self.prefix {
            self.end += prefix_len(prefix);
            self.prefix = None; // prefix consumed
        }
  
        let mut check_trailing_sep = false;
        // find the first separator byte that delimits the path
        while self.end < self.path.len() {
            if is_sep_byte(self.path[self.end]) {
                check_trailing_sep = true;
                self.end += 1;
                break; 
            }
            self.end += 1;
        }
        
        // if there are trailing sep bytes, then we continue until 
        // we don't see anymore sep bytes
        while check_trailing_sep {
            if self.end >= self.path.len() {
                break;
            } else if !is_sep_byte(self.path[self.end]) {
                break;
            }
            self.end += 1;
        }
        
        // SAFETY: A valid path must have been passed into our iterator
        // and our `OsStr` encode bytes with "an unspecified, platform-specific,
        // self-synchronizing superset of UTF-8", which means that we wouldn't
        // touch the middle of a code point if we delimit on ascii separators like '/' or '\\'
        // (which are 1 byte)
        // To shorten this code, we could use Path::from_u8_slice() as seen for .ancestors(), 
        // but that's a private function it seems...
        let sliced_path = unsafe { OsStr::from_encoded_bytes_unchecked(&self.path[0..self.end]) };
        // we trim the trailing sep (which note doesn't do anything to '/' path) to
        // have returned path be consistent with Ancestor iterator 
        Some(Path::new(sliced_path).trim_trailing_sep())
    }
}

impl Path {
// A mirror on how .ancestors() returns its Ancestor<'_> iterator
// but we could probably just reduce this into one function with
// cfg blocks on what to return for specific platforms?
#[cfg(target_os = "windows")]
fn children(&self, path: &Path) -> Children<'_> {
    Children { path: path.as_os_str().as_encoded_bytes(), prefix: parse_prefix(path), end: 0}
}
#[cfg(not(target_os = "windows"))]
fn children(&self) -> Children<'_> {
    Children { path: self.as_os_str().as_encoded_bytes(), end: 0}
}
}

This implementation of the children/reversed ancestor API uses a reference to the path (no extra allocation of path needed) and consumes the iterator in O(N) or O(# of components path has). You can try it out a version of it reimplementing the private functions into accessible functions here.

Alternatives

Alternatively:

  • A reverse ancestor API could be handwritten, which I've shown in the solution sketch (or people could use heap allocated reversed iterator on a Vec approach that I've done for create_dir_all(), but that isn't optimal on space).

  • We could also use Mark's Children API here:

use std::path::Components;
use std::ffi::OsStr;
use std::path::Path;

fn main() {
    let path = Path::new("/tmp/foo/bar.txt");
    let mut children = Children::new(path);
    
    // Desired sequence:
    // /tmp
    // /tmp/foo
    // /tmp/foo/bar.txt

    dbg!(children.collect::<Vec<_>>());
}

struct Children<'a> {
    original: &'a Path,
    iterated: Components<'a>,
    done: bool,
}

impl<'a> Children<'a> {
    fn new(path: &'a Path) -> Self {
        let mut iterated = path.components();
        iterated.next();
        Children { original: path, iterated, done: false }
    }
}

impl<'a> Iterator for Children<'a> {
    type Item = &'a Path;
    
    fn next(&mut self) -> Option<Self::Item> {
        if self.done {
            return None;
        }
    
        let orig = self.original.as_os_str().as_encoded_bytes();
        let remaining_suffix = self.iterated.as_path().as_os_str().as_encoded_bytes();
        if remaining_suffix.is_empty() {
            self.done = true;
        }
        self.iterated.next();
        
        let prefix = orig.strip_suffix(remaining_suffix).expect("path ends with own suffix");
    
        // SAFETY: TODO, but I think justified:
        // we only split on something `components()` would split on,
        // and it splits only on path separators (\\ or / depending on platform) which are UTF-8.
        // We could use https://doc.rust-lang.org/nightly/std/ffi/struct.OsStr.html#method.slice_encoded_bytes
        // to guarantee safety at the cost of a panic if we get it wrong.
        unsafe {
            Some(Path::new(OsStr::from_encoded_bytes_unchecked(prefix)))
        }
    }
}

The issue I think with this solution is that to walk through the Children<'_> iterator on each step you strip the remaining components/suffix from the original string in this part:

let prefix = orig.strip_suffix(remaining_suffix).expect("path ends with own suffix");

Walking through the full Children<'_> iterator here would be O(N^2) (or O(components^2))

  • As mentioned in the discussion below, we could also change how Ancestors<'_> struct looks like from this:
#[derive(Copy, Clone, Debug)]
#[must_use = "iterators are lazy and do nothing unless consumed"]
#[stable(feature = "path_ancestors", since = "1.28.0")]
pub struct Ancestors<'a> {
    next: Option<&'a Path>,
}

#[stable(feature = "path_ancestors", since = "1.28.0")]
impl<'a> Iterator for Ancestors<'a> {
    type Item = &'a Path;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        let next = self.next;
        self.next = next.and_then(Path::parent);
        next
    }
}

#[stable(feature = "path_ancestors", since = "1.28.0")]
impl FusedIterator for Ancestors<'_> {}

To something like this:

#[derive(Copy, Clone, Debug)]
#[must_use = "iterators are lazy and do nothing unless consumed"]
#[stable(feature = "path_ancestors", since = "1.28.0")]
pub struct Ancestors<'a> {
    next: &'a [u8],
    front: usize,
    back: usize,
}

#[stable(feature = "path_ancestors", since = "1.28.0")]
impl<'a> Iterator for Ancestors<'a> {
    type Item = &'a Path;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        ... 
    }
}

impl<'a> DoubleEndedIterator for Ancestors<'a> {
    type Item = &'a Path;
    
    #[inline]
    fn next_back(&mut self) -> Option<Self::Item> {
        ... 
    }
}

#[stable(feature = "path_ancestors", since = "1.28.0")]
impl FusedIterator for Ancestors<'_> {}

This would preserve how Ancestors<'_> iterator works currently (actually optimizing it a bit from using Path::parent, which uses Components<'_> .next_back() and .as_path() to get the parent path), and only introduce the DoubleEndedIterator trait to it, which shouldn't be a breaking change. We can take advantage of a front and back index strategy to get a slice of the path properly on .next()/.next_back(). This option would prevent us from creating a whole new iterator for the forward path when they can use a reversible iterator on Ancestors<'_>.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ACP-acceptedAPI Change Proposal is accepted (seconded with no objections)T-libs-apiapi-change-proposalA proposal to add or alter unstable APIs in the standard libraries

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions