Building a best Cmd_TokenizeString/COM_Parse

Discuss programming topics for the various GPL'd game engine sources.
Post Reply
OmegaPhlare
Posts: 4
Joined: Mon May 07, 2012 6:58 am

Building a best Cmd_TokenizeString/COM_Parse

Post by OmegaPhlare »

Please excuse me for being very long winded and giving you information you already know: please do take the time to read this because I'd like to start a conversation on it.

I am currently building my own game and I've been studying the Quake source for ideas and inspiration. I'm current dealing with the command buffer and developer console. My game is written in Microsoft's C#, so I don't have as much control with creating and destroying strings as I would with C/C++. I've reinterpreted the code into the best way I could see it working with immutable strings and object-oriented programming. I'm not having a problem with any of this, but what I am wondering about is what is the best way to go about processing strings. I'm not trying to fix something that isn't broken: I have the opportunity and desire to write something possibly better.

I've been making some observations with GLQuake, Half-Life, Half-Life2, and Rage. I could check on idTech 2, 3, and 4, but since Rage is the latest I think it's fine already. All these games have the "echo" command.

Code: Select all

] echo this is a test
this is a test
At a glance, everything you wrote after echo is tokenized and output to the console seperated by spaces. Confirming with the Quake source, your whole input line is given to the command buffer. The command buffer is a gigantic C-string which gets executed front to back. It searches for newlines '/n' or semicolons ';' then takes the preceding text to be executed. When executing that text, one of the first things to happen is the text is sent to be tokenized in Cmd_TokenizeString.

Including the command itself every token is separated by a single space and quotations are made into tokens that can have whitespace in them. The rules are actually different between games but that doesn't really matter when everyone follows convention. The tokenizing process was rewritten for Source engine and it was rewritten some time after idTech1, probably not by the same person so there had to be at least two people who felt it would be nice to change it.

The quake tokenizer is really greedy with words, so much that it will consume quotations ignoring any quotation rules.

Code: Select all

] echo foo bar
foo bar
] echo foo "bar"
foo bar
] echo foo"bar"
foo"bar"
] echo foo"bar
foo"bar
But both RAGE and Half-Life2 will see a quotation and immediately begin a new token or end it!

Code: Select all

] echo foo"bar"
foo bar
] echo"foo"
foo
] echo"foo"bar
foo bar
] "echo"foo"bar"
foo bar
Quake makes a final token out of a quotation left open.

Code: Select all

] echo testing "testing
testing testing
But both RAGE and Half-Life2 will discard the unfinished quotation.

Code: Select all

] echo testing "testing
testing
And then there's finally the issue that Quake is fooled into mishandling quotations even though it's already greedily consuming them within tokens. It happens because command buffer semicolon splitting is done prior to an individual command's tokenizing.

Code: Select all

] echo start ;end
start
Unknown command "end"
] echo start"" ;end
start""
Unknown command "end"
] echo start" ;end
start" ;end
But because RAGE and Half-Life2 discard open quotations, this just isn't possible with them to begin with.

So I'm wondering now, is there a best way to do it, or a proper way, a standard? How do the operating systems do it, because I know that the entry point of a C program conventionally has an argument count and argument vector (int argc and char* argv). I don't even know who is in charge of creating the argument vector, or where code for doing that is ever defined. If there is some kind of manifesto on how argument vector is handled, I'd like to read it.

I tried my hand at creating a tokenizer function that works exactly like Quake and it does work except that it doesn't create single tokens out of these characters: "{ } ) ( \ :"

Code: Select all

int i = 0;
int j = 0;
_argc = 0;
while (j < text.Length)
{
	if (!Char.IsWhiteSpace(text[j]))
	{
		if (text[j] == '"')
		{
			i = ++j;
			while (j < text.Length && text[j] != '"') j++;
			_argv[_argc++] = text.Substring(i, j - i);
		}
		else
		{
			if (text[j] == '/' && j + 1 < text.Length && text[j + 1] == '/')
				return;
			i = j++;
			while (j < text.Length && !Char.IsWhiteSpace(text[j])) j++;
			_argv[_argc++] = text.Substring(i, j - i);
		}
	}
	j++;
}
Anybody have any improvements or suggestions on how I could even begin to make my tokenizer act more like the one in RAGE? Is there any reason why the characters "{ } ) ( \ :" are parsed into their own tokens? On 2003-12-07, LordHavoc released an update to DarkPlaces, where one of his changes is:
Fixed a bug with console parsing that existed in almost all versions of quake except quakeworld by switching to the qwcl COM_Parse for console parsing (in english: fixed connect commands involving a port, like 127.0.0.1:26000), thanks very much to Fuh for mentioning this bug.
I think the reason this bug existed is because COM_Parse tokenized the colon away from the address, but I don't know for sure without looking at the changes. Half-Life's solution to this problem was not to read tokens, but instead read everything that written after "connect" as being the argument to connect.

Answer what you like: I am here with an open mind to learn.
Spike
Posts: 2914
Joined: Fri Nov 05, 2004 3:12 am
Location: UK
Contact:

Re: Building a best Cmd_TokenizeString/COM_Parse

Post by Spike »

com_parse is a generic function that can be used for all sorts of things, including console, saved games, and entity lumps from bsps.
both saved games and entity lumps make extensive use of { and } chars, for instance. and something like '{classname worldspawn}' miss-parsing would be annoying when you're just trying to debug something quickly.

No, you don't need those chars for the console. But for generic text parsing (like doom3's various text-based data files), handling '(2,3)' as 5 separate tokens means that you can actually parse such text in a sane way without caring about specific chars or whitespace.

If you want to be fancy, you can have some 'if (strchr(punctuation, c))' line to check if c is one of your punctuation chars, and then just pass in a string as an argument to the function, allowing arbitary-but-single-char punctuation to be parsed as separate tokens.
mh
Posts: 2292
Joined: Sat Jan 12, 2008 1:38 am

Re: Building a best Cmd_TokenizeString/COM_Parse

Post by mh »

Way I see it is that there are two goals here.

Be robust.
Maintain compatibility.

If the second goal is not a concern of your's (i.e. you're not handling Quake data) then you can do what you like. Convert everything to XML and use an XML parser, even. :twisted:

Otherwise you're going to find that you may need to retain what looks like bugs and weirdness in the original. Quake has a lot of odd cases where you might think that you've fixed what looks like a bug, but a few days/weeks/months down the line somebody's mod that relied on that behaviour is going to blow up.

Erring on the safe side and grabbing QuakeWorld's version of COM_Parse seems the best option here - at least it's known to work with the majority of Quake data, and is known to the community so you can ask for help if you run into trouble.
We had the power, we had the space, we had a sense of time and place
We knew the words, we knew the score, we knew what we were fighting for
OmegaPhlare
Posts: 4
Joined: Mon May 07, 2012 6:58 am

Re: Building a best Cmd_TokenizeString/COM_Parse

Post by OmegaPhlare »

Oh that's interesting. If you keep it in a way that creates single tokens using punctuation, then you can make files that are more easily human readable while avoiding having to write your own additional handling. Exactly like you said, Spike, if I had to parse a number coordinate like (2,8) then it would be a pain in the ass for a function using that data to have to strip the punctuation away from it and create it's own two number split, when it could have just been done already by creating those 5 tokens.

I had completely neglected to think about compatibility, and in my case I'm lucky that I don't actually need it. The developers were allowed to do so without any repercussions because there was no existing user content with backwards compatibility to be met. I agree it just wouldn't be wise to modify this code if I ever begin working on the Quake source. I'm going to take your advise and look up QuakeWorld's COM_Parse which sounds different.

Ultimately I think that I will end up making it more robust but incompatible, but what I learned from this is that there is not a good reason to create a better Cmd_TokenizeString or COM_Parse because doing so would bring more harm than good. Good judgement, I'll try to follow that.
Spike
Posts: 2914
Joined: Fri Nov 05, 2004 3:12 am
Location: UK
Contact:

Re: Building a best Cmd_TokenizeString/COM_Parse

Post by Spike »

pretty much.
if you don't need compatibility then keep it simple. seriously.
if you need a com_parse, make one, but don't make it part of the command parser.
If you do pass scripts via commands, just keep the original 'Cmd_Args()' result and parse that directly. Trying to do both commands and scripts with the same function (without some 'script' argument) is pointless as you cannot really all expectations.

Actually, skipping // comments in your command parser can be quite handy for stuffcmd-based extensions and not spamming people with old versions. :)
stuffcmd is evil though, if you choose to support it, it should have a separate set of commands from the regular console - an example bug: 'record pak0.pak' - yes, now you need to reinstall quake. Worst part is that example accepts multiple leading ../
Post Reply