Deprecated: Assigning the return value of new by reference is deprecated in /var/www/virtual/ on line 943

Deprecated: Function split() is deprecated in /var/www/virtual/ on line 494

Deprecated: Function split() is deprecated in /var/www/virtual/ on line 494

Deprecated: Function split() is deprecated in /var/www/virtual/ on line 494
Table of Contents


Now you know a little more about how GeSHi works, you’re almost ready to begin writing some code. But first, you should think a little bit about the language you are going to do, and get a general idea of its structure in your head.

I suggest you do this on a piece of paper, although if you are adept with a drawing program you could do it without leaving your chair.

First Step

The first thing you need to think is:

From the first byte in a source file in my language, what can be highlighted?

I find it best to do these things by example, so here we go:

Say we are writing the language files for PHP. Here’s a simple example file:

echo "hello world";

What is the first byte of this file? It is the < at the start of the <?php.

Just before this byte, what “context” are we in?

Do you think we are in the PHP context? No!

One of the “features” of PHP files is that in fact they are HTML except for code between <?php markers. So in fact in our case, the “root” context is HTML!

However, largely, PHP (and other web scripting languages) are the exception, not the rule. If we think about C for example:

int main() {
  return 0;

From the first byte in this file we are in the “C” context. For example, the int keyword right at the start of the file is a C keyword. Compare this with the PHP example, where if “echo” was the first thing in the file it would be outputted as HTML.

So, think about this for your language. Can keywords be highlighted? What about symbols like + and -? Or perhaps are all keywords inside other markers like < (as in HTML)?

This context is known as the “root” context. For most languages we just refer to it as “the [language] context”, e.g. “the delphi context”. In PHP of course, the root context is “the HTML context”.


A large part of specifying a language for GeSHi is defining a “tree” of contexts, anchored by the root context. You now have to decide what contexts are children of the root context, and work out what should be highlighted in each of them.

This is an important task to get right - getting this wrong now may make it hard to correct later.

Following on our C example, we have a root context. In the root context we know that there are some things to be highlighted:

  • Keywords like function, return, if, while
  • Data types like int, double
  • Symbols like +, -, =, >, <
  • Numbers (both integers and double precision)
  • Variables

There are some more things, although my C is rusty at best. Your language may have a root context with things very similar to this, or maybe a lot different. You don’t need to worry about listing every symbol or keyword at this point, it’s just important to know that they exist and can be highlighted.

You’ll notice that strings and comments somehow missed out being mentioned. That’s because they are handled as child contexts. What kind of strings and comments does C have? Here’s a short list:

  • // comments, these comments end at the end of the line
  • #preprocessor commands, these may end at the end of the line, but only if there is not a \ at the end
  • “strings with double quotes”
  • character strings, only one character long, like ‘a’

These are all children of the root context. So, currently the tree looks like this:

+-- // comments
+-- #preprocessor
+-- "string"
+-- 'c' haracters

Now, for each child, do what you did for the root context. Work out what can be highlighted inside each one, and work out if it has any children.

Eventually you may end up with something like this:

|  - keywords
|  - symbols
|  - numbers
|  - variables
|  - struct fields (e.g. foo->bar)
+-  // comments
|    - start with //
|    - end with \n
|    - nothing else interesting
+-  # preprocessor
|    - start with #
|    - ends with \n, but only if no \ before it
|    - first part should be a valid preprocessor command (like #include)
+-  "strings"
|    - starts and ends with "
|    - \ is the escape character
|    - can't span multiple lines
+-  'c' haracters
|    - starts and ends with '
|    - must be zero or one character long
|    - \ is escape character
|    - can't span multiple lines

This is your “context tree”, and what we will be turning into code.

Note that in this example variables haven’t made it into the tree. You may find that in your language there are a few things that seem a little too hard to match by the tree structure described. However there’s no need to worry about them yet - just work out the general conditions when they appear (e.g. for variables, only ever appear in the root context, must be declared with a type before them, like int foo;), and remember them for later. GeSHi has a neat feature called the “Code Parser” that can deal with them.

Previous | Up | Next

lang/dev/tutorial/2.txt · Last modified: 2011/09/01 13:03
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki