»
« home   paste   Anonymous | Login | Signup for a new account 07-28-2017 16:50 CEST
 
* X »
«
GeSHi - Generic Syntax Highlighter Syntax Coloriser for PHP
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000028 [GeSHi] core feature N/A 12-02-05 10:44 12-11-05 22:31
Reporter BenBE View Status public  
Assigned To nigel
Priority low Resolution fixed  
Status closed   Product Version
Summary 0000028: Parse Spaces as separate tokens
Description When Parsing a source like

procedure ABC; register;

you get output like

GeSHiDelphiCodeParser::parseToken(procedure...)
GeSHiDelphiCodeParser::parseToken( ABC...)
GeSHiDelphiCodeParser::parseToken(;...)
GeSHiDelphiCodeParser::parseToken( ...)
GeSHiDelphiCodeParser::parseToken(register...)
GeSHiDelphiCodeParser::parseToken(;...)
GeSHiDelphiCodeParser::parseToken(...)

where sometimes the space is part of the token, sometimes it's a separate token.

Could you modify the parser to always give spaces as separate tokens?
Additional Information Seems as if the spaces may appear before and after the actual token. When you've a keyword, then they always seem to be a different token. Making spaces always be separate tokens would be bit more easy to handle inside the code parser IMHO.
Attached Files

- Relationships
has duplicate 0000029closed nigel Splitting tokens at breaks 

- Notes
(0000081)
nigel
12-02-05 17:34

Yes, this should be done. Most of the time people would want to ignore whitespace, so perhaps at the start of the parseToken method we could have this (only if needed):

if ($this->isWhitespace($token)) {
    return array($token, $context_name, $data);
}

But if you wanted it you could ignore that code.

I'll have to write the isWhitespace method and also split whitespace out.

I think I will send ALL whitespace in at once, if it occurs. e.g. if the code looks like this:

int foo;

Then all the spaces between "int" and "foo" (and newlines and tabs) will get passed in at once.
 
(0000084)
BenBE
12-03-05 09:21

Well, due to the usage I'd require a methode more like

if (trim($token) != $token) {
    parseCode($whitespacebefore);
    parseCode(trim($token));
    parseCode($whitespaceafter);
}

This also requires the stack feature of the code parser to be implemented which you already noted in the source.

Just skipping the white spaces would cause crazy results ;-) Doesn't it :P?

It's more like

if($this->isWhitespace($token))
{
    $this->StoreToken(array($token, $context_name, $data));
    return false;
}

Thus e.g. for the default keyword I only have to do a

return $this->StorageClear();

and every token inside the storage get's flushed to the result string.
 
(0000088)
nigel
12-03-05 18:20

Well when you return the token it is added to the highlighted source, so it's not so crazy.

The "stack" thing is not really an essential thing btw, it's just something that might be useful in some cases. You don't have to use it, and in fact you are free to use whatever way you like in working stuff out.

But the stack will be useful so I will be doing it.

Anyways, I'll get on to this shortly.
 
(0000101)
nigel
12-05-05 22:37

Okay, I have made it split by whitespace. There is an isWhitespace method if you need it.

Note that currently this source is not highlighted incorrectly:


                Angle1: Extended;
                Angle2: Extended;
                );
            1: (
                C: TVector3D;
                R: TVector3D;
                A1: Extended;
                A2: Extended;
                );
    End;

    TCircle2D = Packed Record
        Case Integer Of
            0: (
                Center: TVector2D;
                Radius: Extended;
                );
            1: (
                C: TVector2D;
                R: Extended;
                );

The 0 is gobbled.

I'm not sure if you understand totally how the parseToken method works, so I'll give a brief description:

parseToken takes:
  * $token - the stuff to parse
  * $context_name - the name of the context that the token is in
  * $data - extra stuff

This method will be called once for each token encountered.

It returns one of three things:

  * The default case: an array($token, $context_name, $data). This returns A token, not necessarily THE token that was passed in, remember. Although much of the time it will be the token passed in, e.g. in the case of whitespace or stuff that is already in the correct context.
  * false. This is if you want to store the token that you received. You're not interested in passing anything back just yet.

    However, of course since this method is called only a strict number of times, you need to be able to pass back more than one token every now and then.
  * array(array($token, $context_name, $data), array($token, $context_name, $data), ...). This is the case that allows you to pass back more than one. Often used just after you looked at the previous token and returned false.

So, you have to make sure that:

  * Every token that is passed in gets passed out *in order*. The third return value is considered ordered by ascending key, so put the oldest stuff first.
  * Every token *does* get passed out. You have to have a way of remembering what tokens were passed in if you return false at one time. That's why I was using a stack at the start. You don't have to use a stack, although it could be helpful, so that's why I'll probably move it into the parent.

You're free to define any number of extra fields and methods to help in parsing.


---

If you make a stack implementation that works, I can move it into the parent class. Hopefully those notes above should help you in finding out why some tokens are being destroyed.
 
(0000102)
nigel
12-05-05 23:11

Actually, it appears that a bug in my first attempt at splitting by spaces was causing the problems I was seeing, so scrub that whole thing ;).

Might be useful as a reference though, I should put it on the wiki.
 
(0000103)
BenBE
12-06-05 01:04

I can't seem to reproduce the gobble bug with this SRC using my current DCP. Seems as if it was a problem of another issue I silently fixed already.

I have to do some adapting anyway since whitespace now gets returned correctly and I no longer have to check it on my own.

I correctly know the return values of the CP; only the caching still causes some trouble. That's why there sometimes appear some annoying little bugs regarding caching.
 
(0000104)
BenBE
12-06-05 04:46

k, a completely wrong source for example:

TestAsmNOP
    MOV EAX, EAX
JMP @@FinishEnd

How does the starter\ender detection by regexp work?
#[^A-Za-z0-9]asm[^A-Za-z0-9]#

I'd require something to tell "starter and ender are single words surrounded by whitespace". The current method fails on any starter or ender occupied in another word. The current importance on getting this work is the block detection I integrated with the current release which internally relies on getting correct tokens.
 
(0000105)
BenBE
12-06-05 04:57

In connection with the whitespace detection I implemented a stack for the DCP to ease staying in order ;-) Maybe you can look over this and do the missing docs stuff. Maybe you even could use it as a basis for your stack implementation. My current implementation still requires direct access to the stack elements from outside; please keep this in mind when changing visibility.
 
(0000107)
nigel
12-06-05 09:52

Well, you want a non-word character OR the start of the string, followed by a non-word character OR the end of the string.

Looks a bit like:

/[^\w|^]asm[^\w|$]/

But that might say "not a word character AND not the start", I'm not too sure. You'll have to check that.

I will look at the stack stuff.
 
(0000111)
BenBE
12-06-05 12:40

The problem is fixed. Whitespace now get's returned as a separate token, as required.
 
(0000128)
nigel
12-11-05 22:31

Issue closed.
 

- Issue History
Date Modified Username Field Change
12-02-05 10:44 BenBE New Issue
12-02-05 10:44 BenBE Status new => assigned
12-02-05 10:44 BenBE Assigned To  => nigel
12-02-05 17:34 nigel Note Added: 0000081
12-03-05 02:27 BenBE Relationship added has duplicate 0000029
12-03-05 09:21 BenBE Note Added: 0000084
12-03-05 18:20 nigel Note Added: 0000088
12-05-05 22:37 nigel Note Added: 0000101
12-05-05 23:11 nigel Note Added: 0000102
12-06-05 01:04 BenBE Note Added: 0000103
12-06-05 04:46 BenBE Note Added: 0000104
12-06-05 04:57 BenBE Note Added: 0000105
12-06-05 09:52 nigel Note Added: 0000107
12-06-05 12:40 BenBE Status assigned => resolved
12-06-05 12:40 BenBE Fixed in Version  => 1.1.1alpha3
12-06-05 12:40 BenBE Resolution open => fixed
12-06-05 12:40 BenBE Note Added: 0000111
12-11-05 22:31 nigel Status resolved => closed
12-11-05 22:31 nigel Note Added: 0000128

  


Mantis 1.0.0rc2[^]
Copyright © 2000 - 2005 Mantis Group
54 total queries executed.
38 unique queries executed.
Powered by Mantis Bugtracker