Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for BufRead::has_data_left #86423

Open
1 of 3 tasks
YuhanLiin opened this issue Jun 18, 2021 · 13 comments
Open
1 of 3 tasks

Tracking Issue for BufRead::has_data_left #86423

YuhanLiin opened this issue Jun 18, 2021 · 13 comments
Labels
A-io Area: std::io, std::fs, std::net and std::path C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@YuhanLiin
Copy link
Contributor

YuhanLiin commented Jun 18, 2021

Feature gate: #![feature(buf_read_has_data_left)]

This is a tracking issue for adding has_data_left to BufRead. Its purpose is to determine whether a reader has any data left over to read.

Example

let file = File::open("foo.txt")?;
let reader = BufReader::new(file);
let is_not_empty = reader.has_data_left()?;

Steps / History

Unresolved Questions

  • Still need to finalize the name of has_data_left and whether the API should check the positive case or the negative case.
  • Internals discussion (note that the post uses is_at_eof as the name of the new method).
@YuhanLiin YuhanLiin added C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 18, 2021
@m-ou-se m-ou-se added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-io Area: std::io, std::fs, std::net and std::path and removed T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 18, 2021
@jamesluke
Copy link
Contributor

Bikeshedding on the name: has_data_left() implies that there is a finite amount of data, we have been sucking it up, and eventually there will be no more. While it's already obvious to users that a reader could produce an infinite amount of data, what's less obvious is that a reader can seem to run dry (returning 0 from a call to read()), then spring back to life again on subsequent calls. So something like has_more_data() would fit a bit better — it inspires less certainty.

Better still would be something that make clear what the function is really doing: performing an action, not simply returning some descriptive state flag. Perhaps a name like check_for_data(), which indicates that the function 1) runs a test, and 2) returns an answer valid only for a given moment. As a bonus, it's more obvious why the function returns an io::Result: it's doing I/O!

(I'm afraid I don't know how to rewrite check_for_data() to make it sound like it returns a bool, but I'm sure there's something.)


Note: It's possible I've misunderstood the intention behind the proposal. If the goal for a final implementation is to return false only on those permanent "fuse blown, no more data ever" conditions that can occur, this bikeshed is moot.

A second note: Is this the correct place to reply? I'm assuming so because the PR was already merged, but chime in if I'm committing a faux pas.

@YuhanLiin
Copy link
Contributor Author

TBH when I came up with this I didn't even consider the fact that readers can "spring back to life". The method I'm adding is only meant to check for EOF, rather than the definitive end of the reader (I'm don't think that's even possible for something like a File). You make a good point about the naming. I think something like check_for_eof, check_eof or check_at_eof can work.

@Stargateur
Copy link
Contributor

Stargateur commented Nov 1, 2021

I don't like this I think the correct solution would be to introduce a better API then fill_buf() like:

fn fill_buf2(&mut self) -> Result<Option<&[u8]>> {
    let buf = self.fill_buf()?;
    if buf.is_empty() {
        None
    } else {
        Some(buf)
    }
}

This is in my opinion way nicer to use and clearly hint (actually force at compile time) to do think properly. It's also allow to avoid call fill_buf() twice and doesn't require implementation to optimize it.

use std::io::prelude::*;
use std::io::BufReader;
use std::fs::File;

fn main() -> std::io::Result<()> {
    let f = File::open("log.txt")?;
    let mut reader = BufReader::new(f);

    while let Some(amazing_log_line) = reader.fill_buf2()? {
        println!("{:?}", amazing_log_line);
        read.consume(amazing_log_line.len());
    }

    Ok(())
}

Random Off topic Idea:

  1. A funny optimization could be to have a NonEmptySlice like NonZero integer type that could allow the Option to be zero size.
  2. I think we could also add an api that auto consume too (with a Drop implementation) for simple use case.
  3. A consume_buf that consume buf.len() would be nice too.

@YuhanLiin
Copy link
Contributor Author

The thing with fill_buf2 is that most usages of BufRead involve methods that are higher-level than fill_buf and consume, so you'd never directly use the results of fill_buf2. In your example, to read log lines you'd need to call read_line rather than fill_buf2, since fill_buf does not have any guarantees about how much data it reads. Your loop would need to be something like:

while let Some(_) = reader.fill_buf2()? {
    let mut line = String::new();
    reader.read_line(&mut line)?;
    println!("{:?}", line);
}

Since higher-level methods like read_line may call fill_buf and consume implicitly, you end up possibly calling fill_buf twice per iteration, just like with the current API. Note that this isn't a bad thing, because if the first fill_buf call from has_data_left reads in the whole line, then read_line won't need to call fill_buf at all. In essence, the loop body only calls fill_buf as many times as it needs to buffer one line, regardless of API changes. The only use case of fill_buf2 for users of BufRead (who use higher-level methods to gather data) is for EOF checking, and in that case I believe or more explicit API like has_data_left is better to use.

Your suggestions for coordinating fill_buf calls with consume calls already exists in the implementations of higher-level BufRead method, which is why users stick to those methods. We don't need higher-level versions of consume because users won't call consume directly anyways.

@Stargateur
Copy link
Contributor

Stargateur commented Nov 2, 2021

that wrong this issue is about fill_buf, a normal use of read_line should not use fill_buf or consume since the purpose of these function is to avoid using them. The purpose of read_line is too reuse the buffer:

let mut line = String::new();
while let Ok(n) = reader.read_line(&mut line)? {
    if n == 0 {
        break;
    }
    println!("{:?}", line);
    line.clear();
}

Or accumulate octets into the line.

And once again a better api could be:

fn read_line(&mut self, buf: &mut String) -> Result<Option<NonZeroUsize>> { ... }

What you propose with has_data_left doesn't fix the real problem of BufRead my proposition is to rethink this trait to allow better use at compile time (aka the purpose of Rust) your solution is very C like. (no judgement I love C)

@Stargateur
Copy link
Contributor

Stargateur commented Nov 2, 2021

I think BufRead is a very old API that doesn't reflect the Rust of 2021. Maybe this require a RFC but here a start, we should:

  • Introduce 1 new trait let's say for now BufRead2 that will have low level API
  • Introduce 1 new trait BufReadExt that will have high level API

What is nice about this is BufRead2 can be implemented for every T that implement BufRead.
This pattern is often used in Async for example Future and FutureExt.

consume also should return an Result cause like we see here implementation could choice to return an Error instead of ingoring the problem, also, the current code allow to overflow and so to "reverse" the consumed octets. (self.pos + (usize::MAX - 50) for example)

pub trait BufRead2: Read {
    fn read_buf(&mut self) -> IoResult<Option<&[u8]>>; // a way to express non empty slice ?
    fn consume(&mut self, n: usize) -> Result<???,  ???> ;
}

The current problem of helper function today is there are error prone, there is a crate that exist for the sole purpose to avoid this pitfall CharReader

The problem is no obvious way to limit what is being read using the helper method. Only Read::take() could allow this I think the BufReadExt should have a similar way to do that since BufRead is Read too it's possible to have a generic implementation of BufReadExt that can use take to implement this feature to have a "limit by line" instead of only a limit on the whole Reader.

Nevermind, one can do buf_reader.take(42).read_line() thanks to impl<T: BufRead> BufRead for Take<T> (we should add some hint in the doc)

trait BufReadExt: BufRead2 {
    fn read_until(&mut self, delimiter: u8, buf: &mut Vec<u8>) -> IoResult<Option<NonZeroUsize>>;
    fn read_line(&mut self, buf: &mut String) -> IoResult<Option<NonZeroUsize>>;
    fn split(self, delimiter: u8) -> Split<Self>
    where
        Self: Sized;
    fn lines(self) -> Lines<Self>
    where
        Self: Sized;

@YuhanLiin
Copy link
Contributor Author

It seems like your suggestion is to add a version of the BufRead methods that return Result<Option<...>>. I do believe that it would be an improvement over the existing API and remove the need for has_data_left, but it's also a significant addition that warrants a RFC. You should bring up BufReadExt to the internal.rust-lang forum first and mention this issue in your post. Also, BufRead2 is not necessary since you can just implement BufReadExt in terms of BufRead.

@ratulb
Copy link

ratulb commented Dec 21, 2021

What is the point of adding this API - if this piggy backs on an already exposed API? Anyone can call fill_buf and find the size of the returned buffer length!

If the API could figure out before calling fill_buf if the underlying Read has already been exhausted - and return that - that would be a real use case.

@YuhanLiin
Copy link
Contributor Author

Without this API you'd need to need to call .fill_buf().is_empty(). It's not obvious that this expression is the right way to check if a reader is at EOF, so the main advantage of the new API is discoverability and clarity. Writing and reading .has_data_left() is a clearer way to convey the programmer's intent.

As for figuring out if a reader has been exhausted without calling fill_buf, I'm not sure if that's possible in general. Maybe with Seek, but you'd need even more syscalls.

@ratulb
Copy link

ratulb commented Dec 22, 2021

fill_buf blocks on a BufReader created from a TcpStream - after 'whole data' is consumed and fill_buf is called again. In that scenario, this API might not be useful.

The problem is we don't know how long is the 'whole data' up front - which is where we need has_data_left to tell us if there is any more data left without blocking.

@YuhanLiin
Copy link
Contributor Author

In the case of TCP, I don't think what you're proposing is possible. TCP connections terminate by sending a FIN flag, so if you want to know if there's any more data left you'll need to wait for the last packet.

@ratulb
Copy link

ratulb commented Dec 24, 2021

In that case - has_data_left might not behave the same way for TCP and File IO.

@YuhanLiin
Copy link
Contributor Author

For files, EOF is reached when the file pointer is at the end of the file and nothing can be read. In that case, has_data_left reads nothing and returns false. For TCP streams, EOF means that the connection has terminated. In that case, has_data_left detects that the TCP connection has terminated and returns false. The thing with TCP is that the other side needs to shut down the connection explicitly for it to count as an EOF, so the definition of EOF is slightly different from a file. This means has_data_left will behave differently as well, which is a good thing because it's detecting EOFs in both cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: std::io, std::fs, std::net and std::path C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants