Hello,
I'm working on the implementation of a polyglot to support literate programming. I have some code ready to support WEB (Pascal + TeX, the 'first' literate programming framework) and CWEB (C + TeX), which I'm keeping in my git copy of the ohcount repository.
I've hit two problems so far.
One of the them is a Segmentation fault which I can work around by setting MAX_CS_STACK to 2048 in common.h. 1024 is still not enough, and I haven't enquired further on the sweet spot. The value seems to depend on the length of the file that gets parsed, so I suspect it might be some kind of parser state leak.
The second problem is the matter of code vs comment. Consider the typical WEB file: you can find TeX code and comments in it, and Pascal code and comments. The TeX code is actually the 'comment' area of the source file. This is why I tried two different approaches.
One of them the non-program code is considered comment, without further specification (lp-polyglot branch on my git); this correctly evaluates the comment/code ratio, but fails to identify the language used in the documentation area of the file (so it doesn't count towards TeX experience, for example).
The other approach parses the TeX code as TeX code and the Pascal code as Pascal code. This contributes correctly to the 'programming experience' of the authors, but fails to properly identify the comment/code ratio of the source.
So my second question is: is it possible to both identify the sections for their 'language', and consider one language as documentation of the other?
If I can solve these two issues (identification and stack overrun) I'll be able to submit the code for inclusion.