Towards an integrated Fortran parser for g95 and gfortran ?

We have now two Fortran compilers available under the terms of the GNU public license; these are g95 and gfortran. The source code of these compilers is available for free and it contains something which could be of great interest to those who need to maintain large amounts of Fortran source code: a Fortran 95 parser. People who have to deal with large Fortran package will understand what I mean here; when the size of a package grows, the code has to be considered as data; who has data needs to be able to process it and data processing requires appropriate tools. Hence a parser giving access to the structure of the code is an invaluable tool.

I- First attempts:

This is how the idea of using the integrated parser of g95 came to me. It took me quite a long time to figure out what to do with the internal structures of the g95 compiler; my first idea was to dump this structures in a format which was very close to the internal representation of the compiler. This contained a lot of very useful information, but that was not what I wanted, because it was away from the source code, even with references to the positions of Fortran entities. Taking this the following piece of code as an example:

      REAL(KIND=4) :: X
      X = 1.
my first attempts yielded something like:
  <statement id="0xbdf7b30" type="PROGRAM" loc="[0,6,0,18]"/>
  <statement id="0xbdf8420" type="TYPE_DECLARATION" loc="[1,6,1,23]" 
    decl_type="0x705820" decl_kind="0xbdf7fe0" decl_symbols="0xbdf8290"/>
  <statement id="0xbdf8f90" type="ASSIGNMENT" loc="[2,6,2,12]" expr1="0xbdf8100" 
  <expr id="0xbdf8100" type="VARIABLE" loc="[2,6,2,7]" symbol="0xbdf8290"/>
  <expr id="0xbdf8b00" type="CONSTANT" loc="[2,10,2,12]" value="1.E+0"/>
  <statement id="0xbdf9550" type="END_PROGRAM" loc="[3,6,3,17]"/>

This was quite close from the internal representation of the compiler used, with a lot of cross references. This was also very verbose, and did not exhibit the tree like structures which should arise when parsing some code.

II- A more interesting approach:

Source code is both a document ( for the programmer ) and a syntax tree ( for the compiler ). I wanted something which could represent both these two aspects. The ability to model both data and document is a feature of the XML language which can be used both for encoding large structures or writing books. It is also very easy to search XML documents with XPath, transform them with XSLT, etc... The tools available for XML provide very interesting possibilities, and I will provide an example in section IV.

My second approach is therefore to represent a parser output using the data and document centric features of the XML language. For the example above, the parser output becomes:

<?xml version="1.0"?><fortran95 xmlns=""><G-S-lst><G-S N="MAIN_" defined="1" flavor="PROGRAM"/><G-S N="main" defined="1" flavor="PROGRAM"/></G-S-lst><file name="main2.F" width="72" form="fixed"><L/><executable-program>      <program-unit-lst><program-unit><S-lst><S G-N="MAIN_" defined="1" flavor="PROGRAM"/><S G-N="main" defined="1" flavor="PROGRAM"/><G-S-ref l-N="main" G-N="main"/><S defined="1" flavor="VARIABLE"/></S-lst><stmt-lst><stmt T="program">PROGRAM <program-N><s N="main"><c>MAIN</c></s></program-N></stmt>
<L/>      <stmt T="T-decl"><T-spec><T-spec T="REAL">REAL(KIND=<kind-selector><E T="literal" val="4" CST-T="INTEGER">4</E></kind-selector>)</T-spec></T-spec> :: <entity-decl-lst><entity-decl><obj-N><s N="x"><c>X</c></s></obj-N></entity-decl></entity-decl-lst></stmt>
<L/>      <stmt T="assgt"><var><E T="var"><S><s N="x"><c>X</c></s></S></E></var> <assgt><a N="="><c>=</c></a></assgt> <E><E T="literal" val="1.E+0" CST-T="REAL">1.</E></E></stmt>
<L/>      <stmt T="end-program">END PROGRAM</stmt></stmt-lst></program-unit></program-unit-lst></executable-program></file></fortran95>

Note that this time, the internal structures of the compiler and the source code are interleaved. The result is that we have some kind of annotated source code with XML tags. The tree syntax of XML is now used and no cross references appear.

It is very easy to manipulate such output: for instance search for statements which contain a particular symbol, add a new dimension to an array, expands some loops, etc...

III- How does it work ?

I have hacked the g95 parser so that it dump its output in XML. The result ( I mean the changes I made to the g95 code ) is not very nice, because I try to minimize the changes I make to the original code, but it could be improved if it was accepted in the official gfortran or g95 source.

The idea is to control the simplification of expression, add a concept of statement ( which neither g95 nor gfortran has ), and sort the Fortran source code being parsed in different categories ( comment, string, hollerith, etc... ); it is also necessary to have a very precise location of all items.

I have tested this on a one million statements package and it works...

IV- A Fortran browser:

The output of the parser can be viewed in Firefox using a stylesheet and XBL bindings. This approach is very powerful, I think it is possible to create a very nice source code browser using XBL. Here is a small example:


In the example above, you can click on local variables and get a small window where all statements where this variable appears are displayed. Then click on files position in the small window to make the main window jump to that location. It is also possible to left-click on any symbol to have a context menu and highlight any occurence of a symbol. It is possible to click on any included file to have it unfolded/folded. Note also that lines are numbered by the XBL itself ( the original XML does not contain line numbers ).

I have produced XML for larger files and the loading time of XML and XBL in Firefox is OK.

V- Download:

You can download the modified source of g95 here. Using it is straightforward:

$ f95 -xml main.F

The command above will create a main.xml file.