JavaScript RegEx Syntax -
i'm writing c# code parse javascript tokens, , knowledge of javascript not 100%.
one thing threw me javascript regular expressions not enclosed in quotes. how parser detect when start , end? looks start /
can contain character after that.
note not asking syntax needed match characters, results google searches about. want know rules determining how know regular expression starts , ends.
i consider following regexp reasonable approximation.
/(\\/|[^/])+/([a-za-z])*
the rules formally defined:
regularexpressionliteral :: see 7.8.5 / regularexpressionbody / regularexpressionflags regularexpressionbody :: see 7.8.5 regularexpressionfirstchar regularexpressionchars regularexpressionchars :: see 7.8.5 [empty] regularexpressionchars regularexpressionchar regularexpressionfirstchar :: see 7.8.5 regularexpressionnonterminator not 1 of * or \ or / or [ regularexpressionbackslashsequence regularexpressionclass regularexpressionchar :: see 7.8.5 regularexpressionnonterminator not \ or / or [ regularexpressionbackslashsequence regularexpressionclass regularexpressionbackslashsequence :: see 7.8.5 \ regularexpressionnonterminator regularexpressionnonterminator :: see 7.8.5 sourcecharacter not lineterminator regularexpressionclass :: see 7.8.5 [ regularexpressionclasschars ] regularexpressionclasschars :: see 7.8.5 [empty] regularexpressionclasschars regularexpressionclasschar regularexpressionclasschar :: see 7.8.5 regularexpressionnonterminator not ] or \ regularexpressionbackslashsequence regularexpressionflags :: see 7.8.5 [empty] regularexpressionflags identifierpart
here quick , dirty code might started.
class charstream { private readonly stack<int> _states; private readonly string _input; private readonly int _length; private int _index; public char current { { return _input[_index]; } } public charstream(string input) { _states = new stack<int>(); _input = input; _length = input.length; _index = -1; } public bool next() { if (_index < 0) _index++; if (_index == _length) return false; _index++; return true; } public bool expectnext(char c) { if (_index == _length) return false; if (_input[_index + 1] != c) return false; _index++; return true; } public bool back() { if (_index == 0) return false; _index--; return true; } public void pushstate() { _states.push(_index); } public t popstate<t>() { _index = _states.pop(); return default(t); } } static string parseregularexpressionliteral(charstream cs) { string body, flags; cs.pushstate(); if (!cs.expectnext('/')) return cs.popstate<string>(); if ((body = parseregularexpressionbody(cs)) == null) return cs.popstate<string>(); if (!cs.expectnext('/')) return cs.popstate<string>(); if ((flags = parseregularexpressionflags(cs)) == null) return cs.popstate<string>(); return "/" + body + "/" + flags; } static string parseregularexpressionbody(charstream cs) { string firstchar, chars; cs.pushstate(); if ((firstchar = parseregularexpressionfirstchar(cs)) == null) return cs.popstate<string>(); if ((chars = parseregularexpressionchars(cs)) == null) return cs.popstate<string>(); return firstchar + chars; } static string parseregularexpressionchars(charstream cs) { var sb = new stringbuilder(); string @char; while ((@char = parseregularexpressionchar(cs)) != null) sb.append(@char); return sb.tostring(); } static string parseregularexpressionfirstchar(charstream cs) { return null; } static string parseregularexpressionchar(charstream cs) { return null; } static string parseregularexpressionbackslashsequence(charstream cs) { return null; } static string parseregularexpressionnonterminator(charstream cs) { return null; } static string parseregularexpressionclass(charstream cs) { return null; } static string parseregularexpressionclasschars(charstream cs) { return null; } static string parseregularexpressionclasschar(charstream cs) { return null; } static string parseregularexpressionflags(charstream cs) { return null; }
as how find end of literal? trick recursively follow productions have listed. consider production regularexpressionbody
. reading production tells me requires regularexpressionfirstchar
followed regularexpressionchars
. notice how regularexpressionchars
has either [empty]
or regularexpressionchars regularexpressionchar
. defined itself. once production terminates [empty]
know valid character should closing /
. if not found not valid literal.
Comments
Post a Comment