Ticket #211 (closed enhancement: fixed)

Opened 4 months ago

Last modified 2 months ago

Differentiate C from C++

Reported by: robin@ohloh.net Assigned to: robin@ohloh.net
Priority: like Milestone: current
Component: detector Version: current
Severity: bad Keywords:
Cc: ciaranm

Description

Currently, all C and C++ is thrown together into a bucket called C/C++.

With some work, the detector could be improved in this regard. I suspect most of the trouble comes in sorting out the *.h files.

If we find foo.cpp and foo.h, but foo.h does not use any C++ features (ie could compile as straight C), should foo.h be considered C or C++? Would the answer to this question change if foo.cpp were not present? If there is no foo.c or foo.cpp file, do we need to probe foo.h to make a determination?

Are there any cases where files are named with a *.c extension, but are actually C++? Does this necessitate a deep probe of all *.c files?

My main concern for this detection is performance. It might be a problem for Ohloh if the detection is slow and requires deeply probing a lot of source files.

Attachments

split-cncpp-into-c-cpp-1.patch (17.9 kB) - added by ciaranm on 03/12/2008 07:12:50 PM.

Change History

01/18/2008 08:24:13 PM changed by josh

To the best of my knowledge, you can't put C++ in a file named foo.c without going to a lot of extra effort to tell the compiler that you've used C++ (the same amount of effort you'd have to go to to convince it that a file named foo.anythingelse contained C++). I don't know of any projects which do this, and I can't think of any sane reason to do it.

Source files don't require much effort; a source file containing C++ should have one of these extensions: .cc, .cp, .cxx, .cpp, .CPP, .c++, or .C. (Note the case-sensitivity on the last one.)

As you said, the primary trouble occurs with header files. .hh or .H unambiguously identifies a C++ header. However, .h could have C, C++, Objective-C, or Objective-C++.

I don't think Ohcount should look at what files include a header. It should only look at content. I'd suggest that if Ohcount cannot determine a header file's language based on its contents, it should count it as C.

01/22/2008 08:41:47 AM changed by maciejkaminski

I don't think Ohcount should look at what files include a header. It should only look at content.

Certainly I agree.

What's more: .h are already have to be disambiguated, as they can contain objective_c programs, and IMO _exactly_ the same algorithm may be applied to detect if they are C++ (test if there are any source files and test for C++ - specific keywords).

03/12/2008 04:59:10 PM changed by ciaranm

It's probably more reliable to look at the includes for a .h header. If it includes any of the standard library headers (<string> etc), it's C++. If it includes any of the standard library wrappers for C headers (<cstdlib> and so on), it's C++. If it includes any headers named .hh, .H or .hpp (boost uses .hpp), it's C++.

03/12/2008 07:12:50 PM changed by ciaranm

  • attachment split-cncpp-into-c-cpp-1.patch added.

03/12/2008 07:12:58 PM changed by ciaranm

  • cc set to ciaranm.

This patch does the following:

  • Splits cncpp into c and cpp.
  • Uses the file extension to differentiate between the two, except for .h files. The current code does a lowercase lookup in the extensions list. To handle .C files, we first try a case sensitive lookup, and then fall back to a lowercase lookup.
  • For .h files, tries detecting cpp based upon whether they include any of the Standard Library or Technical Report headers, any header that is itself recognised as a C++ header through file extension, or a few C++ specific keywords that aren't likely to appear otherwise.

The keyword code is rather crude. In particular, it will incorrectly report cpp if a keyword is used inside a comment. Is there an easy way to reuse existing comment detection code to skip comments here?

03/14/2008 03:31:15 PM changed by robin@ohloh.net

  • status changed from new to closed.
  • resolution set to fixed.

Great patch.

I have applied this patch to the Ohloh main line.

This is a big patch for us, in the sense that it is going to take us a while to manage the transition. We have a LOT of code to recount, so C, C++, and C/C++ are going to live side-by-side for several months.

I think the strategy of looking at include file names is pretty clever.

Regarding the keyword match, I think you're out of luck on the ability to exclude comments from your grep. I'm sure some motivated person could get this to work, but one thing I fear is the amount of overwork this implies: parse the file using the cpp parser, and then look in the resulting code to see if it's really cpp.

Anyways, with a change this big I'm a bit nervous because something always seems to go wrong when you scale out to 2 billion lines of code, but I'm optimistic. Thanks very much for this patch!