Changeset 60fa1b1b96dfa1b4a9b3972bdbb8735975b04c25
- Timestamp:
- 05/23/2008 02:13:19 PM
(8 months ago)
- Author:
- mitchell <mitchell@frost.(none)>
- git-committer:
- mitchell <mitchell@frost.(none)> 1211577199 -0400
- git-parent:
[8f0a32131acb4f7ee7d93dff7d4a8b35259d0959]
- git-author:
- mitchell <mitchell@frost.(none)> 1211577199 -0400
- Message:
Updated PARSER_DOC.
-
Files:
-
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
| rebaab65 |
r60fa1b1 |
|
| 3 | 3 | Overview: |
|---|
| 4 | 4 | I will assume the reader has a decent knowledge of how Ragel works and the |
|---|
| 5 | | Ragel syntax. |
|---|
| | 5 | Ragel syntax. If not, please review the Ragel manual found at: |
|---|
| | 6 | http://research.cs.queensu.ca/~thurston/ragel/ |
|---|
| | 7 | |
|---|
| 6 | 8 | All parsers must at least: |
|---|
| 7 | 9 | * Call a callback function when a line of code is parsed. |
|---|
| … | … | |
| 15 | 17 | |
|---|
| 16 | 18 | Writing a Parser: |
|---|
| 17 | | First create your parser in ext/ohcount_native/ragel_parsers/. It's name |
|---|
| 18 | | should be the language you're parsing with a '.rl' extension. Every parser |
|---|
| | 19 | First create your parser in ext/ohcount_native/ragel_parsers/. Its name |
|---|
| | 20 | should be the language you are parsing with a '.rl' extension. Every parser |
|---|
| 19 | 21 | must have the following at the top: |
|---|
| 20 | 22 | |
|---|
| … | … | |
| 28 | 30 | const char *c_entities[] = { |
|---|
| 29 | 31 | "space", "comment", "string", "number", "preproc", |
|---|
| 30 | | "keyword", "identifier", "operator", "newline", "any" |
|---|
| | 32 | "keyword", "identifier", "operator", "any" |
|---|
| 31 | 33 | }; |
|---|
| 32 | 34 | |
|---|
| … | … | |
| 34 | 36 | enum { |
|---|
| 35 | 37 | C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC, |
|---|
| 36 | | C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_NEWLINE, C_ANY |
|---|
| | 38 | C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY |
|---|
| 37 | 39 | }; |
|---|
| 38 | 40 | |
|---|
| … | … | |
| 102 | 104 | (Your parser will go between these two blocks.) |
|---|
| 103 | 105 | |
|---|
| 104 | | The code can be found in the existing c.rl parser. You'll need to change: |
|---|
| | 106 | The code can be found in the existing c.rl parser. You will need to change: |
|---|
| 105 | 107 | * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be |
|---|
| 106 | 108 | the name of your language to parse. [lang] is your language name. So if |
|---|
| 107 | | you're writing a C parser, it would be C_LANG. |
|---|
| | 109 | you are writing a C parser, it would be C_LANG. |
|---|
| 108 | 110 | * [lang]_entities - Set the variable name to be [lang]_entities (e.g. |
|---|
| 109 | 111 | c_entries) The value is an array of string entities your language has. |
|---|
| 110 | 112 | For example C has comment, string, number, etc. entities. You should |
|---|
| 111 | | definately have "space", and "newline" entities. If your language has |
|---|
| 112 | | escaped newlines (or continuations), have an "escaped_newline" entity as |
|---|
| 113 | | well. |
|---|
| | 113 | definately have "space", and "any" entities. "any" entities are typically |
|---|
| | 114 | used for entity machines (discussed later) and match any character that |
|---|
| | 115 | is not recognized so the parser does not do something unpredictable. |
|---|
| 114 | 116 | * enum - Change the value of the enum to correspond with your entities. So |
|---|
| 115 | | if in your parser you look up [lang]_entities[ENTITY], you'll get the |
|---|
| | 117 | if in your parser you look up [lang]_entities[ENTITY], you will get the |
|---|
| 116 | 118 | associated entity's string name. |
|---|
| 117 | 119 | * parse_[lang] - Set the function name to parse_[lang] where again, [lang] |
|---|
| … | … | |
| 123 | 125 | variables have the same name in header files (which is what parsers are), |
|---|
| 124 | 126 | the compiler complains. Also, when you have languages embedded inside each |
|---|
| 125 | | other, any identifiers with the same name can easily be mixed up. It's also |
|---|
| | 127 | other, any identifiers with the same name can easily be mixed up. It is also |
|---|
| 126 | 128 | important to prefix your Ragel definitions with your language to avoid |
|---|
| 127 | 129 | conflicts with other parsers. |
|---|
| … | … | |
| 162 | 164 | [lang]_line := |* |
|---|
| 163 | 165 | entity1 ${ entity = ENTITY1; } => [lang]_ccallback; |
|---|
| 164 | | entity1 ${ entity = ENTITY2; } => [lang]_ccallback; |
|---|
| | 166 | entity2 ${ entity = ENTITY2; } => [lang]_ccallback; |
|---|
| 165 | 167 | ... |
|---|
| 166 | 168 | entityn ${ entity = ENTITYN; } => [lang]_ccallback; |
|---|
| … | … | |
| 214 | 216 | Defining Patterns for Entities: |
|---|
| 215 | 217 | Now it is time to write patterns for each entity in your language. That |
|---|
| 216 | | doesn't seem very hard, except when your entity can cover multiple lines. |
|---|
| | 218 | does not seem very hard, except when your entity can cover multiple lines. |
|---|
| 217 | 219 | Comments and strings in particular can do this. To make an accurate line |
|---|
| 218 | 220 | counter, you will need to count the lines covered by multi-line entities. |
|---|
| 219 | 221 | When you detect a newline inside your multi-line entity, you should set |
|---|
| 220 | | the entity variable to be INTERNAL_NL (-1) and call the main action. The |
|---|
| | 222 | the entity variable to be INTERNAL_NL (-2) and call the main action. The |
|---|
| 221 | 223 | main action should have a case for INTERNAL_NL separate from the newline |
|---|
| 222 | 224 | entity. In it, you will check if the current line is code or comment and |
|---|
| … | … | |
| 244 | 246 | * You can be a bit sloppy with the line counting machine. For example the |
|---|
| 245 | 247 | only C entities that can contain newlines are strings and comments, so |
|---|
| 246 | | INTERNAL_NEWLINE would only be needed inside those. Other than those, |
|---|
| 247 | | anything other than spaces is considered code, so don't waste your time |
|---|
| | 248 | INTERNAL_NL would only be necessary inside them. Other than those, |
|---|
| | 249 | anything other than spaces is considered code, so do not waste your time |
|---|
| 248 | 250 | defining specific patterns for other entities. |
|---|
| 249 | 251 | |
|---|
| 250 | 252 | Entity Identifying Machine: |
|---|
| 251 | | This machine doesn't have to be written as a line-by-line parser. It only |
|---|
| | 253 | This machine does not have to be written as a line-by-line parser. It only |
|---|
| 252 | 254 | has to identify the positions of language entities, such as whitespace, |
|---|
| 253 | 255 | comments, strings, etc. in sequence. As a result they can be written much |
|---|
| … | … | |
| 259 | 261 | [lang]_entity := |* |
|---|
| 260 | 262 | entity1 ${ entity = ENTITY1; } => [lang]_ecallback; |
|---|
| 261 | | entity1 ${ entity = ENTITY2; } => [lang]_ecallback; |
|---|
| | 263 | entity2 ${ entity = ENTITY2; } => [lang]_ecallback; |
|---|
| 262 | 264 | ... |
|---|
| 263 | 265 | entityn ${ entity = ENTITYN; } => [lang]_ecallback; |
|---|