NOTE: To use the advanced features of this site you need javascript turned on.

Home Knowledge Base Software Translation Preprocessing


Universal programming languages offer a wide range of constructs to solve any category of problem. For some problems, though, the generality of universal languages becomes a limitation. Such is the case when special algorithms are required in order to carry out simple operations in the chosen domain. As a result, quality diminishes but the complexity of the source code increases.

In most cases the problem is solved by developing a specialized library or by employing a third-party one. There are situations when the syntax of the language is a limitation in itself. A case in point is the issue of working with relational databases from object-oriented programming languages, using the SQL language. The problem is solved by resorting to libraries for relational databases. If so, the SQL queries become strings of characters transmitted as arguments to methods that intermediate access to the database.

Result reading is done by using classes that iterate on the returned tuples and offers access to the fields of the current tuple. A tuple is an ordered list of values from the data set returned by the database engine. The problem is the incompatibility between the object-oriented language and the descriptive SQL language.

Syntax adaptation to the suggested problem makes the source code easier to write and understand. This is the reason for preprocessors emergence.

Let's consider two languages L and L' and their grammars G and G', so that G' ⊂ G and two programs P ∈ G and P' ∈ G'. A preprocessor is a program that implements a transformation algorithm for each P’ program into an equivalent P program. The transformation operation is called preprocessing. The L language becomes the preprocessed language and the L’ language the preprocessor’s language.

Preprocessing takes place before resorting to the translator used for the G language, in order to validate the source code expressed in G’ grammar into a source code in G grammar.

Most often, the derived grammar G’ only differs from G grammar from a lexical point of view, with terms that begin with a predefined characteristic - usually #. These markings have the effect of demarcation for supplementary constructions defined by L' language. The preprocessor scans the source text line by line and for each marked line it performs the translation into specific instructions for L language. Translation is made either based on a template, or on a more complex processing based on previously extracted information, followed by a transformation at several points of the generated code.

Preprocessors offer a mechanism of conditional code generation of L language. Conditional generation is connected to the capacity to define preprocessing symbols. A defined preprocessing symbol is used as a flag. Conditional preprocessing directives use preprocessing symbols according to the definition or non-definition that generates or ignores sequences in the code generated by the preprocessor. This mechanism is extensively used in the source code for C portable programming language, where different headers are included for various platforms. There are various terms used to define conditional preprocessing according to the type of translator assigned to L language: conditional compiling, conditional assembly and conditional interpretation.

Another usual preprocessor extension for the original language is the macro-instruction. The macro-instruction is a text section tagged with a name that is inserted into the code generated by the preprocessor, at each location where the tag is referred. Although there are usually no constraints, the text in the macro-instruction is a source code in L language. There are macro-instructions with parameters, which expand the inserted text taking into account parameter values.

The macro-instructions mechanism is used in various language translators: compilers, macro-processors, assembly macro-processors, etc. Their usage is limited though, due to their disadvantages:

Semantics modification on expanding – if the macro-instruction contains an expanded expression inside another expression, there is a risk of modifying the logic of the initial expression due to operators’ order of evaluating. This risk is dismissed by including the macro-instruction body into a high priority construction. In the case of programming language C, this construction is obtained by including it into round brackets. But the preprocessor does not verify the usage of this trick because the programmer is supposed to know what he is doing;

The impossibility to expand preprocessing directives – including a preprocessing directive into the macro-instruction body will leave the directive unprocessed;

Side-effects of expanding – there are various scenarios where macro-instructions have unwanted side-effects over local or global variables, due to faulty usage.

Preprocessors have spread due to their capacity to solve problems not tackled by the original language. A very widespread preprocessor was the FORTRAN language preprocessor. During the 70’s, the FORTRAN language preprocessor did not contain specific control structures for structured programming. That is the reason why keywords were defined in order to record the structure IF-THEN-ELSE, DO-UNTIL, WHILE. The programmer would write sequences using these keywords and the interpreter would transform them into sequences that a usual FORTRAN compiler could understand. The lines needing to be preprocessed were marked with #. Later evolution led to the emergence of an extended language: the interpreter operated on the extended language, it translated the product into the classic program and carried out the syntactic and semantic analysis. The resulting program was then taken over by the classic FORTRAN compiler.

Advantages of preprocessed languages:

  • Increased programming productivity – it is much easier to write a simplified source code;

  • Improved code maintenance – modifications in the system are easier to do because the code is more compact;

  • Increased code comprehension – additional instructions in L' language offer better understanding of the application in a shorter time;

  • Disconnection of the abstract description of the preprocessor’s language-specific instructions from the actual implementation – optimization of the code generated by the preprocessor does not require any modification in the original source code from L' language.

There are several uses of preprocessors, but their usage is decreasing mainly because of their disadvantages:

  • Difficult debugging of applications written in a preprocessed language,

  • Non-standard extensions of the original language,

  • Lack of tools for the new language.

Preprocessors delay application debugging because the error reporting offered by the L language compiler works on the source code generated by the preprocessor. Some programming languages, like C and C++, offer mapping between the lines of the original language and the source code generated as a result of preprocessing. For C and C++ programming languages this is achieved by the #line directive. Such directives are a partial solution though. They become useless in case an exact analysis of the source code for the preprocessed instructions in G’ is wanted, in order to detect the influence of the generated code over the program.

Non-standard extensions of the original language through preprocessing mechanisms add new constructions to the original language. Excessive use of these constructions in the source code diminishes program clarity for readers who are not familiar with the additional concepts.

Because L' language is different from L, all tools that work on source code in L language will stop working on the new source code in L'. Editors with syntax highlighting, code analyzers, software metric programs, visual interface editors etc. for L language do not work for L', especially when L' is a nonstandard extension. For standard or widely used preprocessors there are utility programs that support the modified syntax, but the offer might be a lot slimmer than for the original L language. For example, in the case of accessing relational databases for Java™ there is a standard SQLJ preprocessor promoted by the dominant database vendors. Still, SQLJ is not that widespread and there are few utility programs adapted for specific SQLJ extensions.

The preprocessor is a software product with a high degree of generality, that takes over any program written in a certain language, including constructions marked with a special character, and generates program sequences in a standard language. The result of preprocessing is a valid program in the standard language, directly usable by its translator.

The special character used for preprocessors is usually #. A line in the source code that begins with # is a preprocessed line.

Many of the domain specific languages (DSLs) use a preprocessor to transform the source code into a language recognizable by an existing translator.

Preprocessing also defines derived data types used in a large number of source files. There are situations when, due to frequent use, some data types and their operating structures are assimilated into language. For example, in the case of programming languages C and C++ for Windows, the low level interface Win32 API was used for a long time. The Win32 API defined a series of data types for working with the operating system resources. Simple data types like UINT, BOOL or PCHAR started to be used instead of the original unsigned int, char, char* types.