Regular expressions cannot be used to handle very complex strings that have grammar, such as programming language source code, annotations describing compound data types for methods, mathematical expressions, calculations, formulas, and more. The reason is that these are such complex string forms containing many rules that we simply have to process them in smaller chunks.
When a computer processes PHP source code, for example, it first breaks it into many small parts that carry their own meaning. These parts are called "tokens", and they represent the smallest self-contained building blocks of the language.
The principle of string/language processing is divided into several phases:
Another big advantage of this approach is that we know the token's position in the string (both the line and the specific token start and end character) as we go through the token, so we can accurately address the location of the problem if an exception is thrown.
Imagine, for example, that you are implementing an algorithm to solve a mathematical example. Mathematics has a lot of rules, such as operator priorities, parentheses, function calls, and so on.
If we can split the input string into elemental tokens, we can work with it on a completely different level. For example, we can easily find individual parentheses, subtract tokens from the initial parenthesis to the final one, pass a subexpression to a recursive function for processing, and so on.
Tokenization allows us to solve even complex parsing problems very elegantly.
We don't need that much knowledge to write our own tokenizer. Basically, we just need to know the principle of regular expressions and write a small parsing object.
For the purposes of this article, I have prepared a basic version of a tokenizer based on the Latte (Nette) tokenizer. The author of the original implementation is David Grudl, whom I would like to thank for such a simple function that solves all the problems for you.
final class Token{public string $value;public int $offset;public string $type;}final class Tokenizer{public const TokenTypes = ['array' => 'array','<' => '\<','>' => '\>','{' => '\{','}' => '\}','or' => '\|','list' => '\[\]','type' => '[a-zA-Z]+','space' => '\s+','comma' => ',','other' => '.+?',];/*** @return array<int, Token>*/public static function tokenize(string $haystack): array{$re = '~(' . implode(')|(', self::TokenTypes) . ')~A';$types = array_keys(self::TokenTypes);preg_match_all($re, $haystack, $tokenMatch, PREG_SET_ORDER);$len = 0;$count = count($types);$tokens = [];foreach ($tokenMatch as $match) {$type = null;for ($i = 1; $i <= $count; $i++) {if (isset($match[$i]) === false) {break;}if ($match[$i] !== '') {$type = $types[$i - 1];break;}}$token = new Token;$token->value = $match[0];$token->offset = $len;$token->type = (string) $type;$tokens[] = $token;$len += strlen($match[0]);}if ($len !== strlen($haystack)) {$text = substr($haystack, 0, $len);$line = substr_count($text, "\n") + 1;$col = $len - strrpos("\n" . $text, "\n") + 1;$token = str_replace("\n", '\n', substr($haystack, $len, 10));throw new \LogicException(sprintf('Unexpected "%s" on line %s, column %s.', $token, $line, $col));}return $tokens;}}
This tokenizer can parse, for example, such a complex string (the format is deliberately interspersed with spaces to show that the tokenizer can handle a large range of cases):
array<int, array<bool, array<string, float> >
Jan Barášek Více o autorovi
Autor článku pracuje jako seniorní vývojář a software architekt v Praze. Navrhuje a spravuje velké webové aplikace, které znáte a používáte. Od roku 2009 nabral bohaté zkušenosti, které tímto webem předává dál.
Rád vám pomůžu:
Články píše Jan Barášek © 2009-2024 | Kontakt | Mapa webu
Status | Aktualizováno: ... | en