Google Summer of Code 2008
We have applied to be a mentoring organization for the GSoC 2008 program. We're looking for talented developers who share a passion for software development metrics, code analysis and data visualization. We are hoping to find candidates who are interested in rounding out Ohcount's feature set as well as those who are interested in pushing the features into new direction (see below).
Proposals
Please feel free to list your proposals here.
* http://labs.ohloh.net/ohcount/wiki/ragel_gsoc_proposal
Requirements for acceptance
- You must be proficient in Ruby and C. Some jobs require knowledge of web technologies (HTML, CSS & Javascript), but to successfully contribute to Ohcount, you must already know Ruby and C.
- You must be able to work on your Summer of Code project full time for the summer (i.e., we will favour applicants who have no summer school or full time job elsewhere);
- Knowledge and/or experience with parsers & compilers is advised. Parsing LoC's requires some understanding of the basic concepts behind compilers. Learning on the job is possible, but somewhat difficult.
- You must be willing and able to work in a structured way (ie: weekly progress reports). We'll be available daily on IM and/or IRC, but prefer targeted, scheduled conversations rather than ad-hoc, interrupt-driven ones.
- You must be ready to participate in a short, remote inteview (phone/ voice IM or just IM).
Project Ideas
Projects are governed by some constraints:
1. Specific: The project ideas must describe new functionality/architecture that is very scoped and targeted.
2. Measurable: Gauging the success (or failure) of a project should be obvious.
3. Attainable: Project ideas should constrain themselves to evolving current functionality, and not require complete re-architecting of the entire system.
4. Realistic: While what is attainable in a summer's coding can vary greatly depending on a student's abilities, it's important to realize that each project should be accomplishable in a matter weeks, not months (remember, devs are mostly (ahem) optimistic by nature).
5. Timely: Projects should identify some time-specific milestones. Avoid all-or-nothing type projects, where participants can't see ongoing progress ins clear, measurable increments. It serves no-one to work all summer only to realize the project will fail within a few days of the end of the project.
Improve Current Languages & Implement More Language(s)
Synopsis: Ohcount currently has over 20 open tickets to support more programming languages: BASIC, OCAML, Make as well as improve current ones (Emacs, Python Doc Strings...).
Community Benefit: Comprehensive language statistics is Ohcount's primary purpose - more languages obviously helps the general ohcount community.
Technical Details: Writing these languages requires researching and identifying the key characteristics of each language, defining the DFA required to parse the blanks, comments and code. Finally, the student will author the state machine using Ruby. As with all projects, every feature/new language will be implemented in a test-driven way, with unit tests being written before the actual implementation.
Difficulty: Easy/Medium (depending on which languages)
Skills: Familiarity w/ Ruby and Regular Expressions, with a little C. The student will have to learn about the syntax of whatever target languages that are to be improved or implemented.
Replace hand-coded State Machine Engine w/ Ragel, the State Machine Compiler
Synopsis: Ohcount is built on a custom-written state-machine engine. This was done for some expediency reasons up front. Meanwhile, a better alternative has come up: [Ragel](http://www.cs.queensu.ca/~thurston/ragel/). This feature would consist of replacing the internal engine with Ragel.
Community Benefit: While the current state machine engine works, it is slow and offers limited functionality. As a result, it makes authoring specific language detectors difficult - for two reasons.
- The current engine is built on regular expressions. This is always tricky and error prone.
- The current engine uses a custom syntax. Every contributor wanting to improve or author a new language needs to ramp up on Ohcount's custom syntax. Ragel syntaxes, on the other hand, are more broadly reusable.
By switching to Ragel, Ohcount will become significantly faster and more likely to attract more programming language contributions from the general community.
Technical Details:
- Integrate Ragel into the current native (C) build process
- Determine the Ohcount-specific Ragel routines to track each languages' code/blank/comments.
- Implement a proof-of-concept language (C, by default), using an already-specified Ragel definition, integrating the code/blank/comments tracking routines.
- Begin porting the current ohcount languages to the Ragel syntax.
Difficulty: Advanced
Skills: C & Ruby, Regular Expressions and experience with parsers/state machines/compilers
Author Web Visualization Front End
Synopsis Provide a way to visualize the results of Ohcount's analysis. Specifically, a web-based code browser that decorates code on a line-by-line basis, showing what language was detected - and whether it is code, comments or just blanks.
Community Benefit: After running Ohcount, most users are eager to better understand how the number came about. A general way to browse & visualize the results would be very useful. It would also encourage more accuracy, since more eyeballs could review the Ohcount results.
Technical Details: This would require tweaking Ohcount's output formats (or perhaps using them as-is) and merging this data with a general-purpose code-browsing library to emit decorated HTML. Each line of code would have a color-coded icon representing what language the line was attributed to, and a Code/Blank/Comment icon as well. The actual code would also be formatted by the generic code-browsing library.
Difficulty: medium
Skills: Requires javascript, html & css skills.
Make Ohcount More Precise: Count Characters
Synopsis: Ohcount classifies each line of a file as a blank, a line of code or a line of comment, including which language was detected. This project would add the additional feature of having Ohcount tallying up Code/Blanks/Comments on a character-by-character level as well, providing much more detailed reports.
Community Benefit: Ohloh uses the current Lines Of Code findings in its rolled-up general reports. Many users have asked for a visual breakdown of a source code file: showing, for each line, what language Ohcount determined it to be, and whether it thought it was a blank,code or comment. This would help the general community have a much better sense of apples-to-apples reports of projects and developers, since some people/project/languages prefer inline comments over block comments, something which Ohcount currently reports vastly differently (inline comments are mostly ignored).
Technical Details: Ohcount currently traverses State Machines whilst parsing a source code file. There is currently a "tally" trigger that fires on every NEWLINE. This tally routine looks at all the states that were contained on the current line and chooses the most significant one and labels this line with that state. All other states are discarded. Instead, the student would implement more flexible counters that would tally every state seen on every line.
Difficulty: Medium/Advanced
Skills: Requires strong C & Ruby skills, as well as experience with parsers.
As usual, feel free to contact us through email (info@ohloh.net) or in the ohloh forums for any more ideas or to discuss these ones above.