| 1 |
PARSER_DOC written by Mitchell Foral |
|---|
| 2 |
|
|---|
| 3 |
Overview: |
|---|
| 4 |
I will assume the reader has a decent knowledge of how Ragel works and the |
|---|
| 5 |
Ragel syntax. If not, please review the Ragel manual found at: |
|---|
| 6 |
http://research.cs.queensu.ca/~thurston/ragel/ |
|---|
| 7 |
|
|---|
| 8 |
All parsers must at least: |
|---|
| 9 |
* Call a callback function when a line of code is parsed. |
|---|
| 10 |
* Call a callback function when a line of comment is parsed. |
|---|
| 11 |
* Call a callback function when a blank line is parsed. |
|---|
| 12 |
Additionally a parser can call the callback function for each position of |
|---|
| 13 |
entities parsed. |
|---|
| 14 |
|
|---|
| 15 |
Take a look at c.rl and even keep it open for reference when reading this |
|---|
| 16 |
document to better understand how parsers work and how to write one. |
|---|
| 17 |
|
|---|
| 18 |
Writing a Parser: |
|---|
| 19 |
First create your parser in ext/ohcount_native/ragel_parsers/. Its name |
|---|
| 20 |
should be the language you are parsing with a '.rl' extension. You will not |
|---|
| 21 |
have to manually compile any parsers, as the Rakefile does this automatically |
|---|
| 22 |
for you. Every parser must have the following at the top: |
|---|
| 23 |
|
|---|
| 24 |
/************************* Required for every parser *************************/ |
|---|
| 25 |
#ifndef RAGEL_C_PARSER |
|---|
| 26 |
#define RAGEL_C_PARSER |
|---|
| 27 |
|
|---|
| 28 |
#include "ragel_parser_macros.h" |
|---|
| 29 |
|
|---|
| 30 |
// the name of the language |
|---|
| 31 |
const char *C_LANG = "c"; |
|---|
| 32 |
|
|---|
| 33 |
// the languages entities |
|---|
| 34 |
const char *c_entities[] = { |
|---|
| 35 |
"space", "comment", "string", "number", "preproc", |
|---|
| 36 |
"keyword", "identifier", "operator", "any" |
|---|
| 37 |
}; |
|---|
| 38 |
|
|---|
| 39 |
// constants associated with the entities |
|---|
| 40 |
enum { |
|---|
| 41 |
C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC, |
|---|
| 42 |
C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY |
|---|
| 43 |
}; |
|---|
| 44 |
|
|---|
| 45 |
/*****************************************************************************/ |
|---|
| 46 |
|
|---|
| 47 |
And the following at the bottom: |
|---|
| 48 |
|
|---|
| 49 |
/************************* Required for every parser *************************/ |
|---|
| 50 |
|
|---|
| 51 |
/* Parses a string buffer with C/C++ code. |
|---|
| 52 |
* |
|---|
| 53 |
* @param *buffer The string to parse. |
|---|
| 54 |
* @param length The length of the string to parse. |
|---|
| 55 |
* @param count Integer flag specifying whether or not to count lines. If yes, |
|---|
| 56 |
* uses the Ragel machine optimized for counting. Otherwise uses the Ragel |
|---|
| 57 |
* machine optimized for returning entity positions. |
|---|
| 58 |
* @param *callback Callback function. If count is set, callback is called for |
|---|
| 59 |
* every line of code, comment, or blank with 'lcode', 'lcomment', and |
|---|
| 60 |
* 'lblank' respectively. Otherwise callback is called for each entity found. |
|---|
| 61 |
*/ |
|---|
| 62 |
void parse_c(char *buffer, int length, int count, |
|---|
| 63 |
void (*callback) (const char *lang, const char *entity, int start, int end) |
|---|
| 64 |
) { |
|---|
| 65 |
init |
|---|
| 66 |
|
|---|
| 67 |
%% write init; |
|---|
| 68 |
cs = (count) ? c_en_c_line : c_en_c_entity; |
|---|
| 69 |
%% write exec; |
|---|
| 70 |
|
|---|
| 71 |
// if no newline at EOF; callback contents of last line |
|---|
| 72 |
if (count) { process_last_line(C_LANG) } |
|---|
| 73 |
} |
|---|
| 74 |
|
|---|
| 75 |
#endif |
|---|
| 76 |
|
|---|
| 77 |
/*****************************************************************************/ |
|---|
| 78 |
|
|---|
| 79 |
(Your parser will go between these two blocks.) |
|---|
| 80 |
|
|---|
| 81 |
The code can be found in the existing c.rl parser. You will need to change: |
|---|
| 82 |
* RAGEL_[lang]_PARSER - Replace [lang] with your language name. So if you |
|---|
| 83 |
are writing a C parser, it would be RAGEL_C_PARSER. |
|---|
| 84 |
* [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be |
|---|
| 85 |
the name of your language to parse. [lang] is your language name. For C it |
|---|
| 86 |
would be C_LANG. |
|---|
| 87 |
* [lang]_entities - Set the variable name to be [lang]_entities (e.g. |
|---|
| 88 |
c_entries) The value is an array of string entities your language has. |
|---|
| 89 |
For example C has comment, string, number, etc. entities. You should |
|---|
| 90 |
definately have "space", and "any" entities. "any" entities are typically |
|---|
| 91 |
used for entity machines (discussed later) and match any character that |
|---|
| 92 |
is not recognized so the parser does not do something unpredictable. |
|---|
| 93 |
* enum - Change the value of the enum to correspond with your entities. So |
|---|
| 94 |
if in your parser you look up [lang]_entities[ENTITY], you will get the |
|---|
| 95 |
associated entity's string name. |
|---|
| 96 |
* parse_[lang] - Set the function name to parse_[lang] where again, [lang] |
|---|
| 97 |
is the name of your language. In the case of C, it is parse_c. |
|---|
| 98 |
* [lang]_en_[lang]_line - The line counting machine. |
|---|
| 99 |
* [lang]_en_[lang]_entity - The entity machine. |
|---|
| 100 |
|
|---|
| 101 |
You may be asking why you have to rename variables and functions. Well if |
|---|
| 102 |
variables have the same name in header files (which is what parsers are), |
|---|
| 103 |
the compiler complains. Also, when you have languages embedded inside each |
|---|
| 104 |
other, any identifiers with the same name can easily be mixed up. It is also |
|---|
| 105 |
important to prefix your Ragel definitions with your language to avoid |
|---|
| 106 |
conflicts with other parsers. |
|---|
| 107 |
|
|---|
| 108 |
Additional variables available to parsers are in the "ragel_parser_macros.h" |
|---|
| 109 |
file. Take a look at it and try to understand what the variables are used for. |
|---|
| 110 |
They will make more sense later on. |
|---|
| 111 |
|
|---|
| 112 |
Now you can define your Ragel parser. Name your machine after your language, |
|---|
| 113 |
'write data', and include 'common.rl', a file with common Ragel definitions, |
|---|
| 114 |
actions, etc. For example: |
|---|
| 115 |
%%{ |
|---|
| 116 |
machine c; |
|---|
| 117 |
write data; |
|---|
| 118 |
include "common.rl"; |
|---|
| 119 |
|
|---|
| 120 |
... |
|---|
| 121 |
}%% |
|---|
| 122 |
|
|---|
| 123 |
Before you begin to write patterns for each entity in your language, you need |
|---|
| 124 |
to understand how the parser should work. |
|---|
| 125 |
|
|---|
| 126 |
Each parser has two machines: one optimized for counting lines of code, |
|---|
| 127 |
comments, and blanks; the other for identifying entity positions in the |
|---|
| 128 |
buffer. |
|---|
| 129 |
|
|---|
| 130 |
Line Counting Machine: |
|---|
| 131 |
This machine should be written as a line-by-line parser for multiple lines. |
|---|
| 132 |
This means you match any combination of entities except a newline up until |
|---|
| 133 |
you do reach a newline. If the line contains only spaces, or nothing at all, |
|---|
| 134 |
it is blank. If the line contains spaces at first, but then a comment, or |
|---|
| 135 |
just simply a comment, the line is a comment. If the line contains anything |
|---|
| 136 |
but a comment after spaces (if there are any), it is a line of code. You |
|---|
| 137 |
will do this using a Ragel scanner. |
|---|
| 138 |
The callback function will be called for each line parsed. |
|---|
| 139 |
|
|---|
| 140 |
Scanner Parser Structure: |
|---|
| 141 |
A scanner parser will look like this: |
|---|
| 142 |
[lang]_line := |* |
|---|
| 143 |
entity1 ${ entity = ENTITY1; } => [lang]_ccallback; |
|---|
| 144 |
entity2 ${ entity = ENTITY2; } => [lang]_ccallback; |
|---|
| 145 |
... |
|---|
| 146 |
entityn ${ entity = ENTITYN; } => [lang]_ccallback; |
|---|
| 147 |
*|; |
|---|
| 148 |
(As usual, replace [lang] with your language name.) |
|---|
| 149 |
Each entity is the pattern for an entity to match, the last one typically |
|---|
| 150 |
being the newline entity. For each match, the variable is set to a |
|---|
| 151 |
constant defined in the enum, and the main action is called (you will need |
|---|
| 152 |
to create this action above the scanner). |
|---|
| 153 |
|
|---|
| 154 |
When you detect whether or not a line is code or comment, you should call |
|---|
| 155 |
the appropriate 'code' or 'comment' action defined in common.rl as soon |
|---|
| 156 |
as possible. It is not necessary to worry about whether or not these |
|---|
| 157 |
actions are called more than once for a given line; the first call to |
|---|
| 158 |
either sets the status of the line permanently. Sometimes you cannot call |
|---|
| 159 |
'code' or 'comment' for one reason or another. Do not worry, as this is |
|---|
| 160 |
discussed later. |
|---|
| 161 |
|
|---|
| 162 |
When you reach a newline, you will need to decide whether the current line |
|---|
| 163 |
is a line of code, comment, or blank. This is easy. Simply check if the |
|---|
| 164 |
line_contains_code or whole_line_comment variables are set to 1. If |
|---|
| 165 |
neither of them are, the line is blank. Then call the callback function |
|---|
| 166 |
(not action) with an "lcode", "lcomment", or "lblank" string, and the |
|---|
| 167 |
start and end positions of that line (including the newline). The start |
|---|
| 168 |
position of the line is in the line_start variable. It should be set at |
|---|
| 169 |
the beginning of every line either through the 'code' or 'comment' |
|---|
| 170 |
actions, or manually in the main action. Finally the line_contains_code, |
|---|
| 171 |
whole_line_comment, and line_start state variables must be reset. All this |
|---|
| 172 |
should be done within the main action shown below. |
|---|
| 173 |
Note: For most parsers, the std_newline(lang) macro is sufficient and does |
|---|
| 174 |
everything in the main action mentioned above. The lang parameter is the |
|---|
| 175 |
[lang]_LANG string. |
|---|
| 176 |
|
|---|
| 177 |
Main Action Structure: |
|---|
| 178 |
The main action looks like this: |
|---|
| 179 |
action [lang]_ccallback { |
|---|
| 180 |
switch(entity) { |
|---|
| 181 |
when ENTITY1: |
|---|
| 182 |
... |
|---|
| 183 |
break; |
|---|
| 184 |
when ENTITY2: |
|---|
| 185 |
... |
|---|
| 186 |
break; |
|---|
| 187 |
... |
|---|
| 188 |
when ENTITYN: |
|---|
| 189 |
... |
|---|
| 190 |
break; |
|---|
| 191 |
} |
|---|
| 192 |
} |
|---|
| 193 |
|
|---|
| 194 |
Defining Patterns for Entities: |
|---|
| 195 |
Now it is time to write patterns for each entity in your language. That |
|---|
| 196 |
does not seem very hard, except when your entity can cover multiple lines. |
|---|
| 197 |
Comments and strings in particular can do this. To make an accurate line |
|---|
| 198 |
counter, you will need to count the lines covered by multi-line entities. |
|---|
| 199 |
When you detect a newline inside your multi-line entity, you should set |
|---|
| 200 |
the entity variable to be INTERNAL_NL (-2) and call the main action. The |
|---|
| 201 |
main action should have a case for INTERNAL_NL separate from the newline |
|---|
| 202 |
entity. In it, you will check if the current line is code or comment and |
|---|
| 203 |
call the callback function with the appropriate string ("lcode" or |
|---|
| 204 |
"lcomment") and beginning and end of the line (including the newline). |
|---|
| 205 |
Afterwards, you will reset the line_contains_code and whole_line_comment |
|---|
| 206 |
state variables. Then set the line_start variable to be p, the current |
|---|
| 207 |
Ragel buffer position. Because line_contains_code and whole_line_comment |
|---|
| 208 |
have been reset, any non-newline and non-space character in the multi-line |
|---|
| 209 |
pattern should set line_contains_code or whole_line_comment back to 1. |
|---|
| 210 |
Otherwise you would count the line as blank. |
|---|
| 211 |
Note: For most parsers, the std_internal_newline(lang) macro is sufficient |
|---|
| 212 |
and does everything in the main action mentioned above. The lang parameter |
|---|
| 213 |
is the [lang]_LANG string. |
|---|
| 214 |
|
|---|
| 215 |
For multi-line matches, it is important to call the 'code' or 'comment' |
|---|
| 216 |
actions (mentioned earlier) before an internal newline is detected so the |
|---|
| 217 |
line_contains_code and whole_line_comment variables are properly set. For |
|---|
| 218 |
other entities, you can use the 'code' macro inside the main action which |
|---|
| 219 |
executes the same code as the Ragel 'code' action. Other C macros are |
|---|
| 220 |
'comment' and 'ls', the latter is typically used for the SPACE entity when |
|---|
| 221 |
defining line_start. |
|---|
| 222 |
|
|---|
| 223 |
Also for multi-line matches, it may be necessary to use the 'enqueue' and |
|---|
| 224 |
'commit' actions. If it is possible that a multi-line entity will not have |
|---|
| 225 |
an ending delimiter (for example a string), use the 'enqueue' action as |
|---|
| 226 |
soon as the start delimitter has been detected, and the 'commit' action as |
|---|
| 227 |
soon as the end delimitter has been detected. This will eliminate the |
|---|
| 228 |
potential for any counting errors. |
|---|
| 229 |
|
|---|
| 230 |
Notes: |
|---|
| 231 |
* You can be a bit sloppy with the line counting machine. For example the |
|---|
| 232 |
only C entities that can contain newlines are strings and comments, so |
|---|
| 233 |
INTERNAL_NL would only be necessary inside them. Other than those, |
|---|
| 234 |
anything other than spaces is considered code, so do not waste your time |
|---|
| 235 |
defining specific patterns for other entities. |
|---|
| 236 |
|
|---|
| 237 |
Parsers with Embedded Languages: |
|---|
| 238 |
Notation: [lang] is the parent language, [elang] is the embedded language. |
|---|
| 239 |
|
|---|
| 240 |
To write a parser with embedded languages (such as HTML with embedded CSS |
|---|
| 241 |
and Javascript), you should first #include the parser(s) above your Ragel |
|---|
| 242 |
code. The header file is "[elang]_parser.h". |
|---|
| 243 |
|
|---|
| 244 |
Next, after the inclusion of 'common.rl', add '#EMBED([elang])' on |
|---|
| 245 |
separate lines for each embedded language. The Rakefile looks for these |
|---|
| 246 |
special comments to embed the language for you automatically. |
|---|
| 247 |
|
|---|
| 248 |
In your main action, you need to add another entity CHECK_BLANK_ENTRY. It |
|---|
| 249 |
should call the 'check_blank_entry([lang]_LANG)' macro. Blank entries are |
|---|
| 250 |
an entry into an embedded language, but the rest of the line is blank |
|---|
| 251 |
before a newline. For example, a CSS entry in HTML is something like: |
|---|
| 252 |
<style type="text/css"> |
|---|
| 253 |
If there is no CSS code after the entry (a blank entry), the line should |
|---|
| 254 |
be counted as HTML code, and the 'check_blank_entry' macro handles this. |
|---|
| 255 |
But you may be asking, "how do I get to the CHECK_BLANK_ENTRY entity?". |
|---|
| 256 |
This will be discussed in just a bit. |
|---|
| 257 |
Also use the emb_newline and emb_internal_newline macros instead of the |
|---|
| 258 |
std_newline and std_internal_newline macros. |
|---|
| 259 |
|
|---|
| 260 |
For each embedded language you will have to define an entry and outry. An |
|---|
| 261 |
entry is the pattern that transitions from the parent language into the |
|---|
| 262 |
child language. An outry is the pattern from child to parent. You will |
|---|
| 263 |
need to put your entries in your [lang]_line machine. You will also need |
|---|
| 264 |
to re-create each embedded language's line machine (define as |
|---|
| 265 |
[lang]_[elang]_line; e.g. html_css_line) and put outry patterns in those. |
|---|
| 266 |
Entries typically would be defined as [lang]_[elang]_entry, and outries |
|---|
| 267 |
as [lang]_[elang]_outry. |
|---|
| 268 |
Note: An outry should have a 'check_blank_outry' action so the line is not |
|---|
| 269 |
mistakenly counted as a line of embedded language code if it is actually a |
|---|
| 270 |
line of parent code. |
|---|
| 271 |
|
|---|
| 272 |
Entry pattern actions should be: |
|---|
| 273 |
[lang]_[elang]_entry @{ entity = CHECK_BLANK_ENTRY; } @[lang]_callback |
|---|
| 274 |
@{ saw([lang]_LANG)} => { fcall [lang]_[elang]_line; }; |
|---|
| 275 |
What this does is checks for a blank entry, and if it is, counts the line |
|---|
| 276 |
as a line of parent language code. If it is not, the macro will not do |
|---|
| 277 |
anything. The machine then transitions into the child language. |
|---|
| 278 |
|
|---|
| 279 |
Outry pattern actions should be: |
|---|
| 280 |
@{ p = ts; fret; }; |
|---|
| 281 |
What this does is sets the current Ragel parser position to the beginning |
|---|
| 282 |
of the outry so the line is counted as a line of parent language code if |
|---|
| 283 |
no child code is on the same line. The machine then transitions into the |
|---|
| 284 |
parent language. |
|---|
| 285 |
|
|---|
| 286 |
Entity Identifying Machine: |
|---|
| 287 |
This machine does not have to be written as a line-by-line parser. It only |
|---|
| 288 |
has to identify the positions of language entities, such as whitespace, |
|---|
| 289 |
comments, strings, etc. in sequence. As a result they can be written much |
|---|
| 290 |
faster and more easily with less thought than a line counter. Using a |
|---|
| 291 |
scanner is most efficient. |
|---|
| 292 |
The callback function will be called for each entity parsed. |
|---|
| 293 |
|
|---|
| 294 |
Scanner Structure: |
|---|
| 295 |
[lang]_entity := |* |
|---|
| 296 |
entity1 ${ entity = ENTITY1; } => [lang]_ecallback; |
|---|
| 297 |
entity2 ${ entity = ENTITY2; } => [lang]_ecallback; |
|---|
| 298 |
... |
|---|
| 299 |
entityn ${ entity = ENTITYN; } => [lang]_ecallback; |
|---|
| 300 |
*|; |
|---|
| 301 |
|
|---|
| 302 |
Main Action Structure: |
|---|
| 303 |
action [lang]_ecallback { |
|---|
| 304 |
callback([lang]_LANG, [lang]_entities[entity], cint(ts), cint(te)); |
|---|
| 305 |
} |
|---|
| 306 |
|
|---|
| 307 |
Note: the 'ls', 'code', 'comment', 'queue' and 'commit' actions are |
|---|
| 308 |
completely unnecessary. |
|---|
| 309 |
|
|---|
| 310 |
Parsers for Embedded Languages: |
|---|
| 311 |
TODO: |
|---|