Code Review Asked by CAD97 on October 27, 2021
As part of my prep for the Code Review community-challenge (which looks like it will be Write your own programming language), I’ve been working on an LL(1) lexer generator in Rust.
Yes, lexer generators exist already, and full-on lexer/parser generators as well. My time might’ve been better spent building a Rust target for ANTLR (as that’s the tool I know best already).
Right now though, I’m just looking at my model for transitions in the finite automaton that gets built up to turn the regex-ish into an actual lexer. Because the lexer generator I’m building is Unicode-first, a simple table of input symbols would be 0x110000
characters wide. That’s pretty much an instant veto. Instead, I’m using a enum that allows globbing of sets of characters into one transition:
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, PartialOrd, Ord)]
pub enum Transition {
Empty,
Character(char),
Range(char, char),
}
In the future, it will also support specifying blocks of characters by Unicode properties.
For the NFA model, just using the character and/or ranges specified by the user is fine; the ambiguity is inbuilt into the model anyway. But to make a lexer, I have to disambiguate different transitions, such that only one can be chosen at any time.
As a simple example, take the simple grammar:
If: "if" ;
Ident: [[:alpha:]][[:alnum:]]* ;
The NFA is trivial:
But the DFA requires allowing for the 'i'
and 'f'
transitions being subsets of the transitions that the rule Ident
follows.
To that end, and the main point of asking this review, I have a function Transition::disambiguate
which takes two transitions and returns a set of transitions that cover all combinations of those two transitions. So disambiguate( 'a'..='e', 'c' )
returns the set { 'a'..='b', 'c', 'd'..='e' }
.
use std::{char, fmt};
use std::cmp::{max, min};
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, PartialOrd, Ord)]
pub enum Transition {
Empty,
Character(char),
Range(char, char),
}
impl Transition {
pub fn overlaps(&self, other: &Transition) -> bool {
match (*self, *other) {
(Transition::Empty, Transition::Empty) => true,
(Transition::Empty, _) | (_, Transition::Empty) => false,
(Transition::Character(x), Transition::Character(y)) => x == y,
(Transition::Range(lo, hi), Transition::Character(c)) |
(Transition::Character(c), Transition::Range(lo, hi)) => lo <= c && c <= hi,
(Transition::Range(lo1, hi1), Transition::Range(lo2, hi2)) => hi1 >= lo2 || lo1 <= hi2,
}
}
}
impl From<(char, char)> for Transition {
fn from(pair: (char, char)) -> Transition {
let (lo, hi) = pair;
if lo == hi {
Transition::Character(hi)
} else {
Transition::Range(lo, hi)
}
}
}
impl From<char> for Transition {
fn from(char: char) -> Transition {
Transition::Character(char)
}
}
// TODO: move
fn before(c: char) -> char {
if c as u32 == 0 {
panic!("Tried to get character before '\0'")
}
let mut point = c as u32 - 1;
if point == 0xDFFF {
point = 0xD800 - 1;
}
char::from_u32(point).expect("Unexpected surrogate codepoint")
}
// TODO: move
fn after(c: char) -> char {
if c == char::MAX {
panic!("Tried to get character after char::MAX")
}
let mut point = c as u32 + 1;
if point == 0xD800 {
point = 0xDFFF + 1;
}
char::from_u32(point).expect("Unexpected surrogate codepoint")
}
impl Transition {
pub fn disambiguate(lhs: Transition, rhs: Transition) -> Vec<Transition> {
if !Transition::overlaps(&lhs, &rhs) {
return vec![lhs, rhs];
}
if lhs == rhs {
return vec![lhs];
}
match (lhs, rhs) {
(Transition::Character(_), Transition::Character(_)) |
(Transition::Empty, _) |
(_, Transition::Empty) => unreachable!(),
(Transition::Range(lo, hi), Transition::Character(c)) |
(Transition::Character(c), Transition::Range(lo, hi)) => {
assert!(lo <= c && c <= hi);
if lo == c {
vec![Transition::from(c), Transition::from((after(c), hi))]
} else if c == hi {
vec![Transition::from((lo, before(c))), Transition::from(c)]
} else {
assert!(lo < c && c < hi);
vec![
Transition::from((lo, before(c))),
Transition::from(c),
Transition::from((after(c), hi)),
]
}
},
(Transition::Range(lo1, hi1), Transition::Range(lo2, hi2)) => {
assert!((lo2 <= lo1 && lo1 <= hi2) || (lo1 <= lo2 && lo2 <= hi1));
let points = [
min(lo1, lo2),
max(lo1, lo2),
min(hi1, hi2),
max(hi1, hi2),
];
vec![
Transition::from((points[0], before(points[1]))),
Transition::from((points[1], points[2])),
Transition::from((after(points[2]), points[3])),
]
},
}
}
}
impl fmt::Display for Transition {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match *self {
Transition::Empty => write!(f, "ε"),
Transition::Character(c) => write!(f, "{:?}", c),
Transition::Range(low, high) => write!(f, "{:?}..={:?}", low, high),
}
}
}
#[cfg(test)]
#[cfg_attr(rustfmt, rustfmt_skip)]
mod test {
use super::Transition;
#[test]
fn disambiguate_char_range() {
assert_eq!(
Transition::disambiguate(
Transition::Character('c'),
Transition::Range('a', 'e'),
),
vec![
Transition::Range('a', 'b'),
Transition::Character('c'),
Transition::Range('d', 'e'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Character('a'),
Transition::Range('a', 'b'),
),
vec![
Transition::Character('a'),
Transition::Character('b'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Character('b'),
Transition::Range('a', 'b'),
),
vec![
Transition::Character('a'),
Transition::Character('b'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Character('a'),
Transition::Range('a', 'c'),
),
vec![
Transition::Character('a'),
Transition::Range('b', 'c'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Character('c'),
Transition::Range('a', 'c'),
),
vec![
Transition::Range('a', 'b'),
Transition::Character('c'),
]
)
}
#[test]
fn disambiguate_range_range() {
assert_eq!(
Transition::disambiguate(
Transition::Range('a', 'c'),
Transition::Range('c', 'e'),
),
vec![
Transition::Range('a', 'b'),
Transition::Character('c'),
Transition::Range('d', 'e'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Range('a', 'b'),
Transition::Range('b', 'c'),
),
vec![
Transition::Character('a'),
Transition::Character('b'),
Transition::Character('c'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Range('a', 'c'),
Transition::Range('c', 'd'),
),
vec![
Transition::Range('a', 'b'),
Transition::Character('c'),
Transition::Character('d'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Range('a', 'b'),
Transition::Range('b', 'd'),
),
vec![
Transition::Character('a'),
Transition::Character('b'),
Transition::Range('c', 'd'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Range('a', 'e'),
Transition::Range('b', 'd'),
),
vec![
Transition::Character('a'),
Transition::Range('b', 'd'),
Transition::Character('e'),
]
);
assert_eq!(
Transition::disambiguate(
Transition::Range('a', 'c'),
Transition::Range('b', 'd'),
),
vec![
Transition::Character('a'),
Transition::Range('b', 'c'),
Transition::Character('d'),
]
)
}
}
I have cut down fn disambiguate
a lot from its first iteration, and all of the tests I added continue to pass. I’m just afraid that I may be missing some corner cases in my tests, and that reading that mess of a match isn’t the easiest task. unreachable!()
is also a (weak) smell, since your code could potentially be structured such that you don’t need it.
Transition::Range
is an inclusive range and constructing a decreasing or size one range is an error in use.
Oh, and rustfmt.toml
:
fn_single_line = true
match_block_trailing_comma = true
wrap_match_arms = false
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP