»
« home   paste   Anonymous | Login | Signup for a new account 05-24-2019 06:58 CEST
 
* X »
«
GeSHi - Generic Syntax Highlighter Syntax Coloriser for PHP
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000101 [GeSHi] core feature always 02-01-07 02:42 02-01-07 09:32
Reporter Knut View Status public  
Assigned To
Priority low Resolution open  
Status new   Product Version
Summary 0000101: GeSHiContext regex
Description The idea is to add an attribute to the class, which contains the delimiters.
This could help in several cases, for example to treat C family preprocessor directives differently.
Additional Information When adding children in a language, you can specify the delimiters to be regex, by using '#REGEX #'. This does not, however, record the matches (or does it? Unsure here, fill me in, Nigel).

Anyway, the basic idea is that the matches are recorded, to be later treated.

Would this be useful, or would it be better just to implement it in the code parser?
Attached Files

- Relationships

- Notes
(0000478)
BenBE
02-01-07 07:28

Basically GeSHi already does use the results gathered when executing the delimiter searching. However you can't fully trust that information alone to find a context being split into parts:

A nice example would be:
Text := 'Hello' + // '
  'World!';

Splitting this source by its delimiters (the single quotes for a string) would lead to a break in highlighting since you need to care about the // introducing a single line comment.

Anyway you could optimize the use of the information gathered here by avoiding subsequent calls to geshi_get_position (@nigel: Remember the mail about that function ;-)) by calculating offsets relative to the current parsing position.

Given the above source you get:
0x00 Text unknown
0x04 space
0x05 := symbol
0x07 space
0x08 ' single_string
0x09 Hello unknown
0x0E ' single_string
0x0F space
0x10 + symbol
0x11 space
0x12 // single_comment
0x14 space
0x15 ' single_string
0x16 \n
0x17 space
0x19 ' single_string
0x1A World unknown
0x20 ! unknown
0x21 ' single_string
0x22 ; symbol

Getting this list shouldn't be the problem (for long sources this should be limited to a reasonable size (cf. my mail).
Now you can use this:
a) for getting the first position (side effect)
AND
b) getting the next starter without new searching (if there is a context beginning at higher offset, then the current ended.

If now e.g. GeSHi finds the string at 0x00 and completes its highlighting it steps on to 0x05 and finds 0x05 to be the next token to be rendered (I'm ATM not sure if the first step includes 0x04 already, but I assume). Now you can skip all positions before 0x05 (i.e. 0x00 and 0x04. For now not much gain ...
But let's go on for 0x12:
GeSHi finds it, parses it as comment (as the starter implies) and returns at location 0x17. Without searching again for splitters GeSHi can now skip the space at 0x14, the ' at 0x15 and the \n (the ender) at 0x16. finding its next item to care at location 0x17 (or 0x19 if spaces are ignored as the belong to no context). What we gain here is quite simple: Although we only looked into the string once we gathered enough information to not have to look into it again (until we reach the end of the buffer we already analyzed.

The prerequisits to have this work would be:
a) all splitters and starters in a big regexp (cf. your mail ;-))
b) an intelligent "look-ahead" system to guess up to which position a analyzis session should go (if it reaches to far into a long context it wastes time, if it's to short it moves our performance gain to /dev/nul ;-)).
 
(0000482)
Knut
02-01-07 09:32

Splitting the source is not an idea. What I meant was that the preg_matches are recorded, so they can be stored. Ex.

function geshi_foo_foo (&$context)
{
    $context->addChild('bar');
}

function geshi_foo_foo_bar (&$context)
{
    $context->addDelimiters('REGEX#some(foo|bar)#', 'foobar');
}

This source fed to it is:

some bogus
wah
zomgh
s // Commentah
somefoo some content blah foobar


Then, the matches will be stored as array('somefoo', 'foobar'), for example.

But, if you feed it the source:

stgjnfdehg
fg zomg
somebar dgdfg foobar

The matches will be stored as array('somebar', 'foobar')

Even though it's the same context, really.

Get the idea?

~Knut
 

- Issue History
Date Modified Username Field Change
02-01-07 02:42 Knut New Issue
02-01-07 02:43 Knut Status new => assigned
02-01-07 02:43 Knut Assigned To  => Knut
02-01-07 02:43 Knut Assigned To Knut =>
02-01-07 02:43 Knut Status assigned => new
02-01-07 07:28 BenBE Note Added: 0000478
02-01-07 09:32 Knut Note Added: 0000482

  


Mantis 1.0.0rc2[^]
Copyright © 2000 - 2005 Mantis Group
32 total queries executed.
27 unique queries executed.
Powered by Mantis Bugtracker