Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical Rust grammar distinct from parser (tracking issue for RFC #1331) #30942

Closed
nikomatsakis opened this issue Jan 15, 2016 · 56 comments
Closed
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools A-grammar Area: The grammar of Rust B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. P-medium Medium priority

Comments

@nikomatsakis
Copy link
Contributor

This is the tracking issue for rust-lang/rfcs#1331, which specifies a procedure for creating a canonical grammar apart from the compiler. This is a multi-phase process. I think the first step, honestly, is just to lay out a firm plan of how to proceed -- what kind of automatic testing to use and so forth. The RFC provides a general plan but it needs to be made more concrete. Once this issue has an owner, I (or they) can update this summary and try to keep it up-to-date.

@steveklabnik steveklabnik added A-docs B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. labels Jan 15, 2016
@nagisa nagisa added the A-grammar Area: The grammar of Rust label Jan 18, 2016
@steveklabnik
Copy link
Member

@nagisa , as it was your RFC, do you have opinions on the strategy we take here? Or should I be thinking about this.

EDIT: @nagisa mentioned on IRC that they are too busy, so I'll come up with aplan.

@SimonSapin
Copy link
Contributor

I’m curious what kind of testing can check that a grammar matches the not-grammar-based parser.

@steveklabnik
Copy link
Member

Well, it can show the presence of bugs, but not the absence of them.

https://github.com/rust-lang/rust/blob/master/mk/grammar.mk is what we used to do. Just vestigial at this point :)

@matklad
Copy link
Member

matklad commented Jan 27, 2016

I’m curious what kind of testing can check that a grammar matches the not-grammar-based parser.

In intellij-rust, we have regression tests for the parser, which consist of a Rust file and a serialized AST. Then we check that the parser produces the expected AST. We also have some negative "you shall not parse" tests. The same technique can be applied to the reference grammar, if we ensure that the serialized AST is sufficiently abstract and doesn't leak grammar "implementation details".

The question here is what tool should be used for the reference grammar? I think it has to goals:

  1. Create two rust parsers, which can be verified against each other.
  2. Make it easier to learn and reason about Rust grammar.

One option is to continue to use Bison. In my opinion, it doesn't fulfill the second goal: grammar is difficult to read because of large amount of duplication and low level details. Regular expression extensions would be really nice to have in the canonical grammar.

Another extreme option is to build a custom parser generator (LALRPOP ?) :) It would have a nice side effect of making Rust more suitable for efficiently implementing programming languages: a realm currently occupied by C/C++.

@nikomatsakis
Copy link
Contributor Author

I have opinions but not much time. I've been interesting in porting the Rust grammar to LALRPOP -- I think it'd be much cleaner, as LALRPOP has a number of features that should allow us to avoid some of the abuse of precedence declarations and the like. (I found the existing grammar not very useful in evaluating syntactic changes because of those precedence declarations, for example.)

(As an aside, this would probably be a good stress test for LALRPOP, I assume we'd have to fix various bugs or add new features to scale it up that large.)

@nikomatsakis
Copy link
Contributor Author

So this morning I started porting the yacc grammar in the repository to LALRPOP, just to see what would happen. I made some progress. But I was wondering: is that grammar the most up to date version, or is the one in the repo more up to date? Does anyone know what the differences are between them?

@fhahn
Copy link
Contributor

fhahn commented Jan 28, 2016

I was thinking about working on that as well, but I unfortunately don't have too much
time in the next month to work on this. What do you think about creating a
repo to collaborate? (I could take care of that as well)
On Jan 28, 2016 18:51, "Niko Matsakis" [email protected] wrote:

So this morning I started porting the yacc grammar in the repository to
LALRPOP, just to see what would happen. I made some progress. But I was
wondering: is that grammar the most up to date version, or is the one in
the repo more up to date? Does anyone know what the differences are between
them?


Reply to this email directly or view it on GitHub
#30942 (comment).

@nagisa
Copy link
Member

nagisa commented Jan 28, 2016

The (parser-lalr) grammar originally comes from https://github.com/bleibig/rust-grammar, which hasn’t been updated in a while.

@nikomatsakis
Copy link
Contributor Author

@fhahn

I was thinking about working on that as well, but I unfortunately don't have too much time in the next month to work on this. What do you think about creating a repo to collaborate? (I could take care of that as well)

I will do this. I am right now just working through some LALRPOP issues that arose (I'm not sure if it's a bug or if LALRPOP just needs optimization; I suspect both :) but once I get that sorted out a bit I will open up a repo and post the link here. Hopefully tomorrow. It seems @TyOverby may be interested as well, and no doubt others. For example, I was talking to @jorendorff, who developed https://github.com/jorendorff/rust-grammar/, as well.

@nikomatsakis
Copy link
Contributor Author

OK, I put up a definite work in progress here: https://github.com/nikomatsakis/rustypop

I also described some conventions that I am aiming towards, but have not yet achieved.

This is hard work to parallelize in the early stages, but if you're interested in hacking on it, let me know over IRC or what have you (or just open some PRs). I'll probably alternate between working on it and improving LALRPOP (for example, I am now highly motivated to get a better printout on shift-reduce errors). =)

@glaebhoerl
Copy link
Contributor

If we can generate a parser from the official grammar, and if we can run rustc in parse-only mode, and if we can make the two of them produce output and errors in identical format, then I wonder if we could use AFL to automatically generate test cases for them. Of course even this could not prove that they match, but it might give a higher level of assurance than a relatively smaller number of ad hoc human-devised test cases.

@eternaleye
Copy link
Contributor

@glaebhoerl: Well, there's also that the entire formalism the official grammar is based on on is called "generative grammars" - using them to create working parsers came after using them to create exemplars, which is essentially a matter of transforming the grammar into a tree and executing a depth-first search.

For example:

start := possibility start?
possibility := light | neutral | dark
light := "Solitari" | "Gandalf"
neutral := "Lunitari" | "Switzerland"
dark := "Nuitari" | "Chocolate"

We form the following tree:

  • start
    • possibility
      • light
        • "Solitari"
        • "Gandalf"
      • ...
    • possibility start
    • ...

With that, we then walk the tree, generating each possible string covered by the grammar.

One can do this efficiently, without actually generating the tree, by assigning each optional subrule a bit, treating the collection of those bits as a number, and counting.

Enabling a subrule will sometimes reveal another optional; in that case, push the current "number" onto a queue, and when you've finished counting the "basic optionals", pop items off the queue and count the newly-revealed optionals again, adding them to the queue if they reveal more.

That ensures one will generate every rule, in their shortest exemplars, before continuing on to the next shortest, etc.

@nikomatsakis
Copy link
Contributor Author

OK, so there's probably something horribly wrong with it, but the rustypop crate now builds without any shift/reduce conflicts. I haven't actually tried RUNNING the code it produces, of course, and I have all empty actions, so it will only yield a "true/false" result. Plus I need to adapt the rustc tokenizer. But it seems like progress. :)

Enough progress that it may be possible to start parallelizing the work (debugging shift/reduce conflicts is kind of a serial task...).

Of course, I also expect that as soon as we try using this grammar we'll find that I did some bone-headed things that resolved all conflicts by just not parsing anything at all, or something like that.

Anyway, I plan to write up a blog post about the approach I took, since it made use of a number of LALRPOP features to try and pare down duplication. I also took the liberty of ripping out various bits of obsolete syntax from the existing rust-parser.y, and made a few arbitrary judgement calls about dubious bits of our grammar. ;)

@nikomatsakis
Copy link
Contributor Author

Oh, I should mention that it generates a 500MB .rs file. Definitely need to do some work reducing the size of LALRPOP's output (@fhahn has actually been pursuing the most immediately obvious strategy, and I've been meaning to implement various other cool optimizations that I've read about...)

@nikomatsakis
Copy link
Contributor Author

Hmm, now I see some more conflicts. So maybe premature. But still, getting close I think. :)

@matklad
Copy link
Member

matklad commented May 17, 2016

Another tool which can be used for the canonical grammar is antlr4. I have not used it myself, but I think it should be mentioned in this thread.

@nagisa
Copy link
Member

nagisa commented May 17, 2016

I believe we used to use antlr4 before the yacc-based grammar got merged.

@matklad
Copy link
Member

matklad commented May 17, 2016

@jorendorff
Copy link
Contributor

I used antlr4 to make https://github.com/jorendorff/rust-grammar and it has a few drawbacks:

  • antlr4 is too lax about ambiguity: if two productions match a program, antlr simply selects whichever production appeared first in the grammar. So my grammar is almost certainly full of ambiguities I don't know about.
  • antlr4 doesn't support lookahead except by a gruesome hack
  • There are places where we really want a production to take boolean arguments, instead of having two productions, like assign_expr and assign_expr_no_struct (the production for if-statements is IF expr_no_struct block to force parentheses around extremely strange conditions like if Range { start: a, stop: b }.count() > 0 { do_something() }). antlr4's support for arguments to productions is unusable, which is why my grammar has a second complete copy of the expression grammar tagged with _no_struct.

@DemiMarie
Copy link
Contributor

@jorendorff I suggest using a macro preprocessor or build step to reduce some of the code duplication.

@alexcrichton
Copy link
Member

cc #15880 (we'll want a bot for this)

@matklad
Copy link
Member

matklad commented Sep 17, 2016

Hey, and what about lexer / parser split?

Perhaps we should create a canonical lexical structure grammar before jumping onto the grammar for the whole language?

cc @dns2utf8

@nagisa
Copy link
Member

nagisa commented Sep 17, 2016

@matklad I’d consider formalising lexer an inherent part of grammar formalisation.

That being said, unlike the grammar, which has been extended significantly over time, lexer has stayed considerably constant (i.e. is a different problem space) and the reference is still pretty good at capturing the lexical structure of the language.

@matklad
Copy link
Member

matklad commented Sep 17, 2016

I’d consider formalising lexer an inherent part of grammar formalisation.

The point is that lexer can be formalized before the rest of the grammar, so it is a good independent first step. Having an executable semi declarative specification of the lexer would help for technical reasons:

  • We will be able to add grammar testing infrastructure and make sure that it is executed during tests.
  • Having a tool which takes a rust file and produces tokens and spans will be useful for creating the grammar itself, because you wouldn't need to implement your own lexer. And it is important, because if you test your grammar against a large corpus of Rust code, you'll hit a lot of lexer bugs with a hand written lexer.

is still pretty good at capturing the lexical structure of the language.

It can be better though! There are some corner cases like 16 >> 2 vs collect::<Vec<_>>, raw string literals, self::foo::bar vs self::foo ::bar vs self ::foo::bar. Also, the reference is not executable, so it must have some bugs (like this one)

@dns2utf8
Copy link
Contributor

I would like to create a complete grammar first.
If the src/grammar/*.g4 files covered everything we were able to run a generated lexer against the tests and spot cases where one of them accepts where the other does not and find differences.

I am currently at RustFest.eu if somebody is here too and would like to talk a little about it.

@Mark-Simulacrum Mark-Simulacrum removed the C-feature-request Category: A feature request, i.e: not implemented / a PR. label Jul 27, 2017
kennytm added a commit to kennytm/rust that referenced this issue Oct 9, 2017
Update grammar to parse current rust syntax

Mainly addressing rust-lang#32723. This PR updates the bison grammar so that it can parse the current rust syntax, except for feature-gated syntax additions. It has been tested with all the tests in run-pass.

The grammar in this repo doesn't have build logic anymore, but you can test it out in https://github.com/bleibig/rust-grammar, which has all of what's in this PR. If you are interested in having build logic and grammar tests again, I can look into implementing that as well.

I'm aware that things are somewhat undecided as to what an official rust grammar should be from the discussion in rust-lang#30942. With this PR we can go back to having an up-to-date flex/bison based grammar, but the rustypop grammar looks interesting as well.
@willy610
Copy link

Hello

Each open source jungle is difficult in the beginning.
Therefore, the following posts may be placed in the wrong place. Someone can certainly correct this to the right forum.

Well well

I lack a grammar for Rust. For users. There are fragments in some documents but no cohesive.
The grammar does not have to be the basis for compilation but should help to understand the structure of the language RUST.

I am 70 years old and have used RUST for a few months and have done a tool

https://github.com/willy610/bnf2railroad

who reads a grammar in EBNF style and produces so-called railroad views. The views are raw TTY, html with anchors and svg.

I have always thought that these railroad views have been a good complement to other documentation such as examples and small projects.
My long experience of C, Smalltalk, SQL, Mathematica, JS and Java has of course helped me to get RUST relatively quickly. RUST is a very ambitious and radical approach to programming.

Upon learning, I have interpreted the grammar files contained in RUST documentation. They are a bit incomplete and do not have the quality that other languages ​​offer. Much is undefined.

Together with the tool there are grammar for Pascal, Smalltalk, Lua, Json, rege and fragments of RUST.

The working method could be:

  1. Create an EBNF file for a section to be documented
  2. Produce output in the form of railroad
  3. Finally, run all the files in a scan to get complete documentation
  4. If necessary, do an analysis using the tool for undefined, unused and duplicate

I have also used the tool to generate better help information for using the tool (parameters of the program)

Perhaps the tool could be useful internally for developers of RUST for, for example, specification, bug report etc

@nikomatsakis
Copy link
Contributor Author

@harpocrates I'm curious, what is the status of your rust parser? @matklad and I have been talking about trying to get a better effort going here. The rough plan is to start with your grammar (probably converted to LALRPOP and then perhaps clean it up). Also, would you be interested in being involved in any such effort?

@harpocrates
Copy link
Contributor

@nikomatsakis my rust parser is complete. I would be interested in helping out with any effort to convert to LALRPOP. What would be the best way to proceed?

Note that converting to LALRPOP may not be completely trivial. My grammar is currently written for Happy (a Haskell parser generator) and makes use of a handful of Happy features, namely:

  • parameterized productions (e.g. to reduce the boilerplate of defining many very similar productions for the different classes of expressions in Rust)
  • pushing back tokens (e.g. a >> token in something like Vec<Vec<i32>>, gets split into > and >)

@nikomatsakis
Copy link
Contributor Author

@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that.

@nikomatsakis
Copy link
Contributor Author

nikomatsakis commented Mar 29, 2018

@harpocrates

Note that converting to LALRPOP may not be completely trivial. My grammar is currently written for Happy (a Haskell parser generator) and makes use of a handful of Happy features, namely:

LALRPOP supports parameterized productions. It does not support pushing back tokens, but there is an alternative way to handle >> in any case. (You distinguish "> followed by >" as one token and "> followed by something else" as another kind of token.)

The trickier thing will be precedence, since LALRPOP does not support those sorts of declarations. We ought to be able to refactor the grammar though (I actually converted the existing .y grammar once already, in rustypop, but I figured I would rather start with your grammar since it seems far more complete and better tested.)

The first thing I had planned to do, in any case, is to write a tool to convert your grammar into LALRPOP syntax, and see how LALRPOP feels about it. The next priority would be a testing harness. It's rather late here so I'll have to write more later but I'd love to collaborate, particularly since my personal time is quite limited.

Separately, @matklad and I have been talking about extending LALRPOP with a way to generate default actions that will fire off events suitable for building a generic parse tree, roughly as described in their RFC.

@willy610
Copy link

willy610 commented Mar 30, 2018 via email

@willy610
Copy link

willy610 commented Mar 31, 2018 via email

@ehuss
Copy link
Contributor

ehuss commented Mar 31, 2018

You may also be interested in following rust-lang/reference#221. The people working on the reference have been making great progress, and the up-to-date grammar (at https://brauliobz.github.io/rust-reference/grammar.html) appears to be getting closer to complete.

@nikomatsakis
Copy link
Contributor Author

@willy610

The status of my product is stable and no bigger issues more than proper UNICOE/UTF processing remains.

Sorry for disappearing! After writing those messages, I got totally overwhelmed (and this week i'm actually on vacation, so I'll probably be slow to reply again.) I'm a bit confused about which project you are referring to -- do you mean the railroad diagrams you referenced here? If so, that seems like a cool visualization technique, but presumably it requires an EBNF grammar to start? At the moment, that last part is what I am most interesting in obtaining; note though that if we had a working LALRPOP grammar, it "desugars down" to a plain CFG internally, so we ought to be able to use your tool to visualize it.


@ehuss

The people working on the reference have been making great progress, and the up-to-date grammar...appears to be getting closer to complete

That's great! Is it presently being tested? And, if so, how?

@willy610
Copy link

willy610 commented Apr 18, 2018 via email

@nagisa
Copy link
Member

nagisa commented Apr 19, 2018 via email

@ehuss
Copy link
Contributor

ehuss commented Apr 19, 2018

That's great! Is it presently being tested? And, if so, how?

I don't know, I don't think there is anything formal set up. I think @brauliobz is doing most of the work, and he once mentioned that he was using Antlr4 to test.

@harpocrates
Copy link
Contributor

@nikomatsakis I'm sorry for the long overdue response.

@harpocrates looking at your parser, I was curious, how have you tested it? I see a few tests in the repository, but I was wondering if you tested against e.g. the sources of the Rust compiler, or crates.io, or something like that.

I have a script that automatically scrapes large files from repos under the rust-lang organization. I use rustc -Z ast-json -Z parse-only as an oracle to tell me if my parser's output is correct. Any difference in AST outputs causes tests to fail. Last time I ran the script, it collected (and ran tests) on upwards of a million lines of Rust code.

The next priority would be a testing harness. It's rather late here so I'll have to write more later but I'd love to collaborate, particularly since my personal time is quite limited.

I should have more free time in the coming months and I'd be happy to help on whatever you guys need the most help on (porting over a concrete grammar, generally improving LALRPOP, etc.)

@steveklabnik
Copy link
Member

Triage: we have a grammar.md, but the intention is to move it into the reference.

@steveklabnik
Copy link
Member

Triage: we have a grammar WG working on the grammar now, and their work will end up in the reference.

Is this issue still worth keeping open?

@pmatos
Copy link
Contributor

pmatos commented Jan 9, 2019

@steveklabnik excellent. is there a way to keep up with how things are going with regards to the work the grammar wg is doing?

@glaebhoerl
Copy link
Contributor

@DevQps
Copy link
Contributor

DevQps commented Aug 8, 2019

@steveklabnik Since there is now a dedicated repository and nobody has responded since January 8, I guess we can close this issue now right?

@steveklabnik
Copy link
Member

Yep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools A-grammar Area: The grammar of Rust B-RFC-approved Blocker: Approved by a merged RFC but not yet implemented. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. P-medium Medium priority
Projects
None yet
Development

No branches or pull requests