GeSHi Bug Tracker - GeSHi
Viewing Issue Advanced Details
86 core minor always 07-28-06 23:17 09-14-06 00:05
nigel  
Netocrat  
normal  
assigned 1.1.2alpha2  
open  
none    
none  
0000086: Single character context should support proper single characters
The single character class does not support "wide" single characters like '\xFFFF' (for C). The issue here is that \xFFFF is actually a character of length one.

The fact that characters of length over 1 can only occur via regex may be helpful in detecting whether the single char context should start - i.e. after finding a ', just look at the next part of the string comparing with all the characters/regexes specified.

Which leads to an optimisation: if the single char context started we might know its length, then we just do a substr() to get the contents of the context.
 class.geshisinglecharcontext.php.patch [^] (6,225 bytes) 07-30-06 03:29
 functions.geshi.php.patch [^] (929 bytes) 07-30-06 03:30
 class.geshisinglecharcontext.php.v2.patch [^] (6,738 bytes) 08-01-06 12:50
 functions.geshi.php.v2.patch [^] (2,829 bytes) 08-01-06 12:51

Notes
(0000415)
Netocrat   
07-30-06 03:29   
I've generated patches that attempt to deal with this. They are against the files geshi/functions.geshi.php and geshi/classes/class.geshisinglecharcontext.php.

The new code performs full checking for validity - including the end delimiter - in getContextStartData(), and stores the data so that _getContextEndData() simply returns the stored data. This is as close to the optimisation that you were hoping for as it seems is possible. There's potential for other optimisation though.

As well as resolving this issue, the patches:
* remove any assumptions on the length of the start delimiter, to support e.g. C's wide characters that begin with L' (the end delimiter is still assumed to have length 1).
* add a setDisallowEmptyChars() method to specify that empty characters are illegal, as they are in C: '' is a syntax error
* introduce recognition of $offset to geshi_get_position() when $needle is a REGEX: this is required to support the first patch.

I'll attach the patches to this bug report.
(0000416)
nigel   
07-30-06 14:13   
Looks good so far. I'm guessing that supporting a delimiter longer than length one would not be too hard now if it was required.
(0000417)
Netocrat   
07-31-06 00:40   
It would be easier, and the places where that assumption has been made are easier to spot.
(0000423)
Netocrat   
08-01-06 12:56   
...and I've done that, as well as allowing for arbitrary-length escape characters, and also fixing the issue indicated by a // WARN comment: now the most inclusive matching escape sequence is found, rather than the first one encountered. The changes are in the v2 patches that I've just uploaded.
(0000425)
nigel   
08-08-06 00:00   
As mentioned in e-mail: you may add them :)
(0000447)
nigel   
09-14-06 00:05   
Netocrat: As far as I can tell (after finally doing escape character grouping and updating the C language file) this seems to be fixed. The only thing I haven't done is to put the test cases in my test system, but that's another bug. Is this OK to be marked as resolved?