| 1 | Perl6regex on Perl5regex |
|---|
| 2 | |
|---|
| 3 | ### Note: This Document is a Draft |
|---|
| 4 | |
|---|
| 5 | ------------------ |
|---|
| 6 | Introduction: |
|---|
| 7 | |
|---|
| 8 | "Perl6regex-on-Perl5regex" is a "Perl 6 regex engine" that uses Perl 5 regexes to implement the matching, and Perl 5 code to implement the OO "Match" structure. |
|---|
| 9 | |
|---|
| 10 | The implementation so far is compatible with Perl 5.8.8. |
|---|
| 11 | |
|---|
| 12 | |
|---|
| 13 | ------------------ |
|---|
| 14 | The compilation is implemented as follows: |
|---|
| 15 | |
|---|
| 16 | - a regex Grammar is run on the Perl 6 regex source code, and returns an AST |
|---|
| 17 | |
|---|
| 18 | - the AST is annotated for positional capture numbering, and for "capture to array" flags |
|---|
| 19 | |
|---|
| 20 | - the Perl 5 regexes and the Perl 5 methods are emitted |
|---|
| 21 | |
|---|
| 22 | |
|---|
| 23 | ------------------ |
|---|
| 24 | At runtime: |
|---|
| 25 | |
|---|
| 26 | - while the regex is matching, it generates a linked list of operations |
|---|
| 27 | |
|---|
| 28 | - the operation list is rolled-back on backtracking. |
|---|
| 29 | "Safe-backtracking" is implemented with "local" redeclarations inside the Perl 5 regex (see [1], [2]). |
|---|
| 30 | |
|---|
| 31 | - after the match finishes, the operations are interpreted, and the result is a Match object. |
|---|
| 32 | The interpreter is implemented on a subroutine in the Match class. |
|---|
| 33 | |
|---|
| 34 | |
|---|
| 35 | ------------------ |
|---|
| 36 | The operations mini-language is implemented like this: |
|---|
| 37 | |
|---|
| 38 | op-list |
|---|
| 39 | ... TODO ... |
|---|
| 40 | |
|---|
| 41 | ------------------ |
|---|
| 42 | Differences from the Perl 6 specification |
|---|
| 43 | |
|---|
| 44 | * <after> only matches fixed-width patterns, |
|---|
| 45 | because that's how Perl 5 "(?<=pattern)" works. |
|---|
| 46 | There is no fix for this problem yet. |
|---|
| 47 | |
|---|
| 48 | ------------------ |
|---|
| 49 | Fixable Differences from the Perl 6 specification |
|---|
| 50 | |
|---|
| 51 | * <?after> and <?before> do not create a lexical scope: |
|---|
| 52 | this means that <?before (.) > wrongly does a positional capture. |
|---|
| 53 | This is fixable, by adding a discard_capture operation. |
|---|
| 54 | |
|---|
| 55 | * return() in blocks don't cause the regex to succeed, and don't terminate the regex. |
|---|
| 56 | The Perl 5.10 version should use (*ACCEPT). |
|---|
| 57 | |
|---|
| 58 | * The $/ inside regex closures is a copy of the matching $/. |
|---|
| 59 | This means that modifying $/ inside a closure does not modify the match. |
|---|
| 60 | This can be fixed with some magic in the Match class. |
|---|
| 61 | |
|---|
| 62 | ------------------ |
|---|
| 63 | TODO list: |
|---|
| 64 | |
|---|
| 65 | * longest-token and multi-regex |
|---|
| 66 | |
|---|
| 67 | * identify possible perl5.8 bugs, that could justify requiring perl5.10 |
|---|
| 68 | |
|---|
| 69 | * regexes inside code blocks may have side-effects inside a regex; this needs further testing |
|---|
| 70 | |
|---|
| 71 | * the Match class needs some tweaks to follow the MOP calling convention better |
|---|
| 72 | ** hash, array, from, to should be Perl 6 objects; autoboxing can fix that |
|---|
| 73 | |
|---|
| 74 | * backtracking controls; token/rule/regex |
|---|
| 75 | |
|---|
| 76 | * the $_ and $/ scopes need to be fixed |
|---|
| 77 | |
|---|
| 78 | * in order to support Matcher methods, OUTER::<$/> needs to be implemented |
|---|
| 79 | |
|---|
| 80 | * rule/subrule parameters; |
|---|
| 81 | |
|---|
| 82 | * the way inheritance works right now is by eval'ing the regex variable in the grammar's namespace; |
|---|
| 83 | this is supposed to be refined later |
|---|
| 84 | |
|---|
| 85 | * calling subrules in other grammars |
|---|
| 86 | ** there should probably be a method that returns the regex, because directly accessing the $_regex_name variable doesn't work with inheritance. |
|---|
| 87 | ** code blocks should probably be installed as methods, because regexes are inlined as string, which breaks lexical scoping, package names, and inheritance. |
|---|
| 88 | |
|---|
| 89 | * <at()> |
|---|
| 90 | |
|---|
| 91 | * infix:<~~> |
|---|
| 92 | |
|---|
| 93 | * variable interpolation |
|---|
| 94 | |
|---|
| 95 | ------------------ |
|---|
| 96 | Blogs: |
|---|
| 97 | |
|---|
| 98 | http://pugs.blogs.com/pugs/2007/07/perl6-regex-on-.html |
|---|
| 99 | |
|---|
| 100 | ------------------ |
|---|
| 101 | References: |
|---|
| 102 | |
|---|
| 103 | [1] http://www.justatheory.com/computers/programming/perl/regex_named_captures.html |
|---|
| 104 | |
|---|
| 105 | [2] perldoc perlre |
|---|