Changeset 60fa1b1b96dfa1b4a9b3972bdbb8735975b04c25

Show
Ignore:
Timestamp:
05/23/2008 02:13:19 PM (8 months ago)
Author:
mitchell <mitchell@frost.(none)>
git-committer:
mitchell <mitchell@frost.(none)> 1211577199 -0400
git-parent:

[8f0a32131acb4f7ee7d93dff7d4a8b35259d0959]

git-author:
mitchell <mitchell@frost.(none)> 1211577199 -0400
Message:

Updated PARSER_DOC.

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • PARSER_DOC

    rebaab65 r60fa1b1  
    33Overview: 
    44  I will assume the reader has a decent knowledge of how Ragel works and the 
    5   Ragel syntax. 
     5  Ragel syntax. If not, please review the Ragel manual found at: 
     6    http://research.cs.queensu.ca/~thurston/ragel/ 
     7 
    68  All parsers must at least: 
    79    * Call a callback function when a line of code is parsed. 
     
    1517 
    1618Writing a Parser: 
    17   First create your parser in ext/ohcount_native/ragel_parsers/. It's name 
    18   should be the language you're parsing with a '.rl' extension. Every parser 
     19  First create your parser in ext/ohcount_native/ragel_parsers/. Its name 
     20  should be the language you are parsing with a '.rl' extension. Every parser 
    1921  must have the following at the top: 
    2022 
     
    2830const char *c_entities[] = { 
    2931  "space", "comment", "string", "number", "preproc", 
    30   "keyword", "identifier", "operator", "newline", "any" 
     32  "keyword", "identifier", "operator", "any" 
    3133}; 
    3234 
     
    3436enum { 
    3537  C_SPACE = 0, C_COMMENT, C_STRING, C_NUMBER, C_PREPROC, 
    36   C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_NEWLINE, C_ANY 
     38  C_KEYWORD, C_IDENTIFIER, C_OPERATOR, C_ANY 
    3739}; 
    3840 
     
    102104  (Your parser will go between these two blocks.) 
    103105 
    104   The code can be found in the existing c.rl parser. You'll need to change: 
     106  The code can be found in the existing c.rl parser. You will need to change: 
    105107    * [lang]_LANG - Set the variable name to be [lang]_LANG and its value to be 
    106108      the name of your language to parse. [lang] is your language name. So if 
    107       you're writing a C parser, it would be C_LANG. 
     109      you are writing a C parser, it would be C_LANG. 
    108110    * [lang]_entities - Set the variable name to be [lang]_entities (e.g. 
    109111      c_entries) The value is an array of string entities your language has. 
    110112      For example C has comment, string, number, etc. entities. You should 
    111       definately have "space", and "newline" entities. If your language has 
    112       escaped newlines (or continuations), have an "escaped_newline" entity as 
    113       well
     113      definately have "space", and "any" entities. "any" entities are typically 
     114      used for entity machines (discussed later) and match any character that 
     115      is not recognized so the parser does not do something unpredictable
    114116    * enum - Change the value of the enum to correspond with your entities. So 
    115       if in your parser you look up [lang]_entities[ENTITY], you'll get the 
     117      if in your parser you look up [lang]_entities[ENTITY], you will get the 
    116118      associated entity's string name. 
    117119    * parse_[lang] - Set the function name to parse_[lang] where again, [lang] 
     
    123125    variables have the same name in header files (which is what parsers are), 
    124126    the compiler complains. Also, when you have languages embedded inside each 
    125     other, any identifiers with the same name can easily be mixed up. It's also 
     127    other, any identifiers with the same name can easily be mixed up. It is also 
    126128    important to prefix your Ragel definitions with your language to avoid 
    127129    conflicts with other parsers. 
     
    162164        [lang]_line := |* 
    163165          entity1 ${ entity = ENTITY1; } => [lang]_ccallback; 
    164           entity1 ${ entity = ENTITY2; } => [lang]_ccallback; 
     166          entity2 ${ entity = ENTITY2; } => [lang]_ccallback; 
    165167          ... 
    166168          entityn ${ entity = ENTITYN; } => [lang]_ccallback; 
     
    214216    Defining Patterns for Entities: 
    215217      Now it is time to write patterns for each entity in your language. That 
    216       doesn't seem very hard, except when your entity can cover multiple lines. 
     218      does not seem very hard, except when your entity can cover multiple lines. 
    217219      Comments and strings in particular can do this. To make an accurate line 
    218220      counter, you will need to count the lines covered by multi-line entities. 
    219221      When you detect a newline inside your multi-line entity, you should set 
    220       the entity variable to be INTERNAL_NL (-1) and call the main action. The 
     222      the entity variable to be INTERNAL_NL (-2) and call the main action. The 
    221223      main action should have a case for INTERNAL_NL separate from the newline 
    222224      entity. In it, you will check if the current line is code or comment and 
     
    244246      * You can be a bit sloppy with the line counting machine. For example the 
    245247        only C entities that can contain newlines are strings and comments, so 
    246         INTERNAL_NEWLINE would only be needed inside those. Other than those, 
    247         anything other than spaces is considered code, so don't waste your time 
     248        INTERNAL_NL would only be necessary inside them. Other than those, 
     249        anything other than spaces is considered code, so do not waste your time 
    248250        defining specific patterns for other entities. 
    249251 
    250252  Entity Identifying Machine: 
    251     This machine doesn't have to be written as a line-by-line parser. It only 
     253    This machine does not have to be written as a line-by-line parser. It only 
    252254    has to identify the positions of language entities, such as whitespace, 
    253255    comments, strings, etc. in sequence. As a result they can be written much 
     
    259261      [lang]_entity := |* 
    260262        entity1 ${ entity = ENTITY1; } => [lang]_ecallback; 
    261         entity1 ${ entity = ENTITY2; } => [lang]_ecallback; 
     263        entity2 ${ entity = ENTITY2; } => [lang]_ecallback; 
    262264        ... 
    263265        entityn ${ entity = ENTITYN; } => [lang]_ecallback;