GeSHi Bug Tracker - GeSHi
Viewing Issue Advanced Details
101 core feature always 02-01-07 02:42 02-01-07 09:32
Knut  
 
low  
new  
open  
none    
none  
0000101: GeSHiContext regex
The idea is to add an attribute to the class, which contains the delimiters.
This could help in several cases, for example to treat C family preprocessor directives differently.
When adding children in a language, you can specify the delimiters to be regex, by using '#REGEX #'. This does not, however, record the matches (or does it? Unsure here, fill me in, Nigel).

Anyway, the basic idea is that the matches are recorded, to be later treated.

Would this be useful, or would it be better just to implement it in the code parser?

Notes
(0000478)
BenBE   
02-01-07 07:28   
Basically GeSHi already does use the results gathered when executing the delimiter searching. However you can't fully trust that information alone to find a context being split into parts:

A nice example would be:
Text := 'Hello' + // '
  'World!';

Splitting this source by its delimiters (the single quotes for a string) would lead to a break in highlighting since you need to care about the // introducing a single line comment.

Anyway you could optimize the use of the information gathered here by avoiding subsequent calls to geshi_get_position (@nigel: Remember the mail about that function ;-)) by calculating offsets relative to the current parsing position.

Given the above source you get:
0x00 Text unknown
0x04 space
0x05 := symbol
0x07 space
0x08 ' single_string
0x09 Hello unknown
0x0E ' single_string
0x0F space
0x10 + symbol
0x11 space
0x12 // single_comment
0x14 space
0x15 ' single_string
0x16 \n
0x17 space
0x19 ' single_string
0x1A World unknown
0x20 ! unknown
0x21 ' single_string
0x22 ; symbol

Getting this list shouldn't be the problem (for long sources this should be limited to a reasonable size (cf. my mail).
Now you can use this:
a) for getting the first position (side effect)
AND
b) getting the next starter without new searching (if there is a context beginning at higher offset, then the current ended.

If now e.g. GeSHi finds the string at 0x00 and completes its highlighting it steps on to 0x05 and finds 0x05 to be the next token to be rendered (I'm ATM not sure if the first step includes 0x04 already, but I assume). Now you can skip all positions before 0x05 (i.e. 0x00 and 0x04. For now not much gain ...
But let's go on for 0x12:
GeSHi finds it, parses it as comment (as the starter implies) and returns at location 0x17. Without searching again for splitters GeSHi can now skip the space at 0x14, the ' at 0x15 and the \n (the ender) at 0x16. finding its next item to care at location 0x17 (or 0x19 if spaces are ignored as the belong to no context). What we gain here is quite simple: Although we only looked into the string once we gathered enough information to not have to look into it again (until we reach the end of the buffer we already analyzed.

The prerequisits to have this work would be:
a) all splitters and starters in a big regexp (cf. your mail ;-))
b) an intelligent "look-ahead" system to guess up to which position a analyzis session should go (if it reaches to far into a long context it wastes time, if it's to short it moves our performance gain to /dev/nul ;-)).
(0000482)
Knut   
02-01-07 09:32   
Splitting the source is not an idea. What I meant was that the preg_matches are recorded, so they can be stored. Ex.

function geshi_foo_foo (&$context)
{
    $context->addChild('bar');
}

function geshi_foo_foo_bar (&$context)
{
    $context->addDelimiters('REGEX#some(foo|bar)#', 'foobar');
}

This source fed to it is:

some bogus
wah
zomgh
s // Commentah
somefoo some content blah foobar


Then, the matches will be stored as array('somefoo', 'foobar'), for example.

But, if you feed it the source:

stgjnfdehg
fg zomg
somebar dgdfg foobar

The matches will be stored as array('somebar', 'foobar')

Even though it's the same context, really.

Get the idea?

~Knut