Integrating Ragel, a State Machine Compiler, into Ohcount

by Mitchell Foral University of Virginia, U.S.A.

Introduction

As indicated by its developers, Ohcount, a source code line counter written in Ruby and C, is "built on a custom-written state-machine engine" which is "tricky and error prone" and may discourage contributors of additional languages with its custom syntax. There has been a suggestion to replace the current engine with Ragel, a fast and flexible state machine compiler. According to its website, Ragel is good for (among other things) "lexical analysis of programming languages".

Project Proposal

I propose to replace Ohcount's current state machine engine with Ragel and port as many existing language descriptions as possible to Ragel's syntax.

My Background

I come from the beautiful city of Albuquerque, New Mexico in the southwestern part of the United States. I lived there for the first 16 years of my life before completing high school in Arlington, Virginia, U.S.A. with the highest academic honors and then heading off to college. I am currently a sophomore at the University of Virginia, dual-majoring in Aerospace Engineering and Physics. In addition to my strengths in mathematics and science, I have an equal aptitude in computer programming and software development. I have been coding for seven years starting in 7th grade with Visual Basic 6. My main project was an advanced numerical calculator (similar to Matlab) that would parse expressions that contained integrals, differentials, and other functions and return the result. It also had an advanced function grapher that could graph in both 2 and 3 dimensions (the latter using OpenGL). Since then I have had experience in programming C/C++, Java, Ruby, PHP, Python, and most recently Lua. I am a huge proponent of Open Source Software. I currently have five open-source projects under my name (see http://caladbolg.net/projects.php) and have contributed to two others. I ventured into the world of Linux my freshman year in high school and tried what seemed to be every distribution under the sun over the next couple years. I eventually settled on Archlinux and run it on all three of my computers. My programming philosophy is cleanliness, minimalism, and efficiency. I cannot stand dirty and/or bloated code. It has to look presentable and maintainable. I like to keep the code I write as minimal and efficient as possible. The dwm window manager is one of my biggest inspirations.

Related Experience

I believe that I am the most qualified candidate for completing this project because of my ongoing, direct experience with lexical analysis of programming languages using a parsing expression grammar (PEG), whose syntax is quite similar to Ragel’s. I am also very familiar with C/C++, Ruby, and embedded languages. I am currently the developer of SciTE-st (http://caladbolg.net/scite-st.php), a fork of Scintilla/SciTE that implements dynamic language lexers for source code coloring using the LPeg (http://www.inf.puc-rio.br/~roberto/lpeg.html) PEG library. I have written language descriptions in Lua using LPeg for at least 12 different languages including C/C++, HTML (including embedded CSS, JavaScript, Ruby, and PHP), and Ruby. On average, it did not take me more than two days to learn a language's syntax and write a working lexer for it. The amount of lexing bugs I encountered after writing the initial lexer for a given language was minimal because of the efficiency and design of PEGs. I believe that my experience with Ohcount and Ragel will be similar and errors will be kept to a minimum, solving one of the problems of the current Ohcount engine. In addition to SciTE-st, I am also the author of Textadept (http://caladbolg.net/textadept.php), a minimalist text editor that uses dynamic lexers and makes heavy use of Lua embedded in C for other extensibility purposes. While I have not had direct experience with embedded Ruby in C, I have no problems working with Lua in C and expect a smooth transition to Ruby in C. My familiarity with Ruby is pretty extensive. My two open-source projects, Mr. GUID for debugging Ruby code and rSQLiteGUI for simple administration of SQLite databases, were hobbies to familiarize myself with Ruby/GTK2 programming. Other smaller personal projects include an Asteroids clone written with Ruby and SDL, an implementation of the multi-player card game "Thirteen" (also known as Tien Len) with artificial intelligence, an RPN (reverse polish notation) interface to a remote Maxima session, and some Ruby on Rails websites.

Timeline

Before the Google Summer of Code actually starts, there is a period between Apr 24 and May 26 for "bonding". With the help of my mentor and current developers, I plan to use this time to familiarize myself with Ohcount's internals and understand how everything works. I will also have to understand how Ragel can be used with C/C++ in order to begin to formulate ways of performing the integration with Ohcount. When work actually begins, I will be available full-time (8+ hours a day) on a regular basis. I do not believe it will take long (no more than 2 weeks) to have a basic implementation of Ragel in Ohcount. I will take more time to interface it to the current API along with documentation (between a month and a month and a half). If a new API was necessary, another week would have been spent getting input from the developers on what kind of API they wanted. For the remaining period of time (a few weeks), as many of the current languages as possible will be ported to the new Ragel syntax. The first few will be the most difficult (may take up to a week and a half for 2 or 3 languages) as I get used to the syntax, general structure, efficiency, etc., but will be much easier after this time.

May 26 - Jun 6

Have a basic implementation of Ragel in Ohcount. There will be no API or method to extract all kinds of data yet. The documentation will be minimal for now as code could be quite volatile.

Jun 9 - Jun 13

If a new API is necessary, discuss it with the developers. Otherwise, it might be advantageous to spend this week developing a strategy for interfacing with the current API.

Jun 16 - Jul 18

Expand the basic implementation of Ragel to interface with the existing or new Ohcount API and then document it extensively.

Jul 21 - Aug 08

Port as many languages as possible to the Ragel syntax. The goal is at least 3 for a firm start.

Outcomes

At the completion of this project, Ohcount will use the Ragel engine instead of its custom-written one. The result will be a faster, more robust, and less error-prone implementation for analyzing source code. Since Ragel is more standardized, existing language descriptions in its syntax can be easily used with Ohcount needing little or no modification. The use of Ragel may also allow more information to be easily extracted from the source code being analyzed in order to provide more information to developers using Ohcount.