Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,9 +200,10 @@ This section discusses available shortcode parsers. Regardless of the parser tha
- mismatching closing shortcode (`[code]content[/codex]`) will be ignored, opening tag will be interpreted as self-closing shortcode, eg. `[code /]`,
- overlapping shortcodes (`[code]content[inner][/code]content[/inner]`) will be interpreted as self-closing, eg. `[code]content[inner /][/code]`, second closing tag will be ignored,

There are three included parsers in this library:
There are four included parsers in this library:

- `RegularParser` is the most powerful and correct parser available in this library. It contains the actual parser designed to handle all the issues with shortcodes like proper nesting or detecting invalid shortcode syntax. It is slightly slower than regex-based parser described below,
- `TarsParser` produces exactly the same result as `RegularParser`, including proper nesting and invalid syntax detection, but it lexes every tag in a single regular expression pass and resolves nesting with a flat stack instead of a recursive token parser. This makes it several times faster than `RegularParser` and much lighter on memory, since it never builds a token array for the whole input. It is a good choice when you want `RegularParser`'s correctness without its cost,
- `RegexParser` uses a handcrafted regular expression dedicated to handle shortcode syntax as much as regex engine allows. It is fastest among the parsers included in this library, but it can't handle nesting properly, which means that nested shortcodes with the same name are also considered overlapping - (assume that shortcode `[c]` returns its content) string `[c]x[c]y[/c]z[/c]` will be interpreted as `xyz[/c]` (first closing tag was matched to first opening tag). This can be solved by aliasing handler name, because for example `[c]x[d]y[/d]z[/c]` will be processed correctly,
- `WordpressParser` contains code copied from the latest currently available WordPress (4.3.1). It is also a regex-based parser, but the included regular expression is quite weak, it for example won't support BBCode syntax (`[name="param" /]`). This parser by default supports the shortcode name rule, but can break it when created with one of the named constructors (`createFromHandlers()` or `createFromNames()`) that change its behavior to catch only configured names. All of it is intentional to keep the compatibility with what WordPress is capable of if you need that compatibility.

Expand Down
264 changes: 264 additions & 0 deletions src/Parser/TarsParser.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
<?php
namespace Thunder\Shortcode\Parser;

use Thunder\Shortcode\Shortcode\ParsedShortcode;
use Thunder\Shortcode\Shortcode\Shortcode;
use Thunder\Shortcode\Syntax\CommonSyntax;
use Thunder\Shortcode\Syntax\SyntaxInterface;

/**
* TarsParser - a fast, robust shortcode parser.
*
* Strategy: a single PCRE pass lexes every individual shortcode tag (both
* opening and closing) in C, then a linear stack-based pass resolves nesting
* in PHP. This combines RegexParser's raw scanning speed with RegularParser's
* robustness: the lexer regex understands quoted values and escapes, so an
* unterminated quote like `[a k="v]` correctly fails to lex as a tag instead
* of inventing a bogus parameter. Nesting, mismatched closing tags and
* open-only shortcodes are then resolved exactly like the default parser.
*
* @author Andy Miller
*
* @psalm-type TarsNode = array{
* 0: string, 1: string, 2: string|null, 3: int, 4: int, 5: int,
* 6: int|null, 7: bool, 8: int|null, 9: int|null, 10: bool
* }
*/
final class TarsParser implements ParserInterface
{
/** @var non-empty-string */
private $tagRegex;
/** @var non-empty-string */
private $paramRegex;
/** @var non-empty-string */
private $delimiter;
/** @var positive-int */
private $delimiterLength;

/** @param SyntaxInterface|null $syntax */
public function __construct($syntax = null)
{
if(null !== $syntax && false === $syntax instanceof SyntaxInterface) {
throw new \LogicException('Parameter $syntax must be an instance of SyntaxInterface.');
}

$syntax = $syntax ?: new CommonSyntax();
$this->delimiter = $syntax->getParameterValueDelimiter();
$this->delimiterLength = strlen($this->delimiter);

$o = preg_quote($syntax->getOpeningTag(), '~');
$c = preg_quote($syntax->getClosingTag(), '~');
$m = preg_quote($syntax->getClosingTagMarker(), '~');
$e = preg_quote($syntax->getParameterValueSeparator(), '~');
$d = preg_quote($this->delimiter, '~');

$ws = '\s*';
$special = $o.'|'.$c.'|'.$m.'|'.$e.'|'.$d;
$notSpecial = '(?!'.$special.')';
// a single "string token": one escape sequence, or one maximal run of
// non-special, non-whitespace characters (possessive so it never gives back)
$stringTok = '(?:\\\\.|(?:'.$notSpecial.'[^\s\\\\])++)';
// a value globs consecutive string tokens; atomic so the lexer commits like
// RegularParser instead of backtracking into a different tokenization
$stringRun = '(?>'.$stringTok.'+)';
// a delimited value; the body is possessive so an escape sequence is never
// given back to let the value re-close at an earlier (escaped) delimiter
$quoted = $d.'(?:\\\\.|(?!'.$d.').)*+'.$d;
$value = '(?>'.$quoted.'|'.$stringRun.')';
// shortcode name; must end on a token boundary so `[foo.bar]` is rejected wholesale
$name = '[a-zA-Z0-9_*-]+';
$boundary = '(?=\s|'.$special.'|$)';
// a parameter name is a single string token, not a glued run
$params = '(?<params>(?:'.$ws.$stringTok.'(?:'.$ws.$e.$ws.$value.')?)*+)';
$bbCode = '(?:'.$e.$ws.'(?<bbCode>'.$value.')'.$ws.')?+';

$close = $o.$ws.$m.$ws.'(?<cname>'.$name.')'.$ws.$c;
$open = $o.$ws.'(?<name>'.$name.')'.$boundary.$ws.$bbCode.$params.$ws.'(?<self>'.$m.')?'.$ws.$c;

$this->tagRegex = '~(?:'.$close.'|'.$open.')~us';
$this->paramRegex = '~'.$ws.'(?<pn>'.$stringTok.')(?:'.$ws.$e.$ws.'(?<pv>'.$value.'))?~us';
}

/**
* @param string $text
*
* @return ParsedShortcode[]
*/
public function parse($text)
{
$count = preg_match_all($this->tagRegex, $text, $matches, PREG_OFFSET_CAPTURE);
if(false === $count || preg_last_error() !== PREG_NO_ERROR) {
throw new \RuntimeException(sprintf('PCRE failure `%s`.', preg_last_error()));
}
if(0 === $count) {
return array();
}

// pure-ASCII text lets us treat byte offsets as character offsets directly
$ascii = !preg_match('~[\x80-\xff]~', $text);
$lastByte = 0;
$lastChar = 0;

/** @psalm-var list<TarsNode> $nodes */
$nodes = array();
/** @psalm-var list<int> $stack */
$stack = array();
$depth = 0;
$cnames = $matches['cname'];
$names = $matches['name'];
$selfs = $matches['self'];
$bbCodes = $matches['bbCode'];
$params = $matches['params'];

foreach($matches[0] as $i => $whole) {
$byteStart = $whole[1];
$byteEnd = $byteStart + strlen($whole[0]);

if(isset($cnames[$i][1]) && $cnames[$i][1] !== -1) {
// closing tag: match the innermost open node of the same name.
// RegularParser rejects a closing name that is falsy in PHP (`'0'`)
// via `if(!$closingName = ...)`, so we faithfully ignore it too.
$cname = $cnames[$i][0];
if('0' === $cname) {
continue;
}
for($s = $depth - 1; $s >= 0; $s--) {
$node = $stack[$s];
if($nodes[$node][0] === $cname) {
$nodes[$node][7] = true; // closed
$nodes[$node][8] = $byteStart; // closeStart
$nodes[$node][9] = $byteEnd; // closeEnd
$stack = array_slice($stack, 0, $s);
$depth = $s;
break;
}
}
continue;
}

// opening tag — char offset (byte offset is fine for pure-ASCII text)
if($ascii) {
$offset = $byteStart;
} else {
if($byteStart > $lastByte) {
/** @psalm-suppress PossiblyFalseArgument */
$lastChar += mb_strlen(substr($text, $lastByte, $byteStart - $lastByte), 'utf-8');
$lastByte = $byteStart;
}
$offset = $lastChar;
}

$self = isset($selfs[$i][1]) && $selfs[$i][1] !== -1;

// node tuple: [0]name [1]paramsRaw [2]bbCodeRaw [3]offset [4]start
// [5]openEnd [6]parent [7]closed [8]closeStart [9]closeEnd [10]selfClosing
// parameter/bbCode parsing is deferred to build() so absorbed nodes never pay for it
$nodes[] = array(
$names[$i][0],
isset($params[$i][1]) && $params[$i][1] !== -1 ? $params[$i][0] : '',
isset($bbCodes[$i][1]) && $bbCodes[$i][1] !== -1 ? $bbCodes[$i][0] : null,
$offset,
$byteStart,
$byteEnd,
$depth ? $stack[$depth - 1] : null,
$self,
$self ? $byteEnd : null,
$self ? $byteEnd : null,
$self,
);

if(false === $self) {
$stack[$depth++] = count($nodes) - 1;
}
}

return $this->build($nodes, $text);
}

/**
* @psalm-param array<int, TarsNode> $nodes
* @param string $text
*
* @return ParsedShortcode[]
*/
private function build(array $nodes, $text)
{
$shortcodes = array();
// A node is absorbed (part of a closed ancestor's content) iff its parent is
// closed or itself absorbed. Parents always precede children, so a single
// forward pass resolves this in O(1) per node instead of walking ancestors.
/** @psalm-var array<int,bool> $absorbed */
$absorbed = array();
foreach($nodes as $id => $node) {
$parent = $node[6];
if(null !== $parent && ($nodes[$parent][7] || $absorbed[$parent])) {
$absorbed[$id] = true;
continue;
}
$absorbed[$id] = false;

if($node[7]) {
// a closed node always has integer close offsets (set on close or self-close)
/** @psalm-suppress PossiblyNullOperand */
$content = $node[10] ? null : substr($text, $node[5], $node[8] - $node[5]);
/** @psalm-suppress PossiblyNullOperand */
$string = substr($text, $node[4], $node[9] - $node[4]);
} else {
$content = null;
$string = substr($text, $node[4], $node[5] - $node[4]);
}

$parameters = '' === $node[1] ? array() : $this->parseParameters($node[1]);
$bbCode = null === $node[2] ? null : $this->extractValue($node[2]);

/** @psalm-suppress PossiblyFalseArgument */
$shortcode = new Shortcode($node[0], $parameters, $content, $bbCode);
/** @psalm-suppress PossiblyFalseArgument */
$shortcodes[] = new ParsedShortcode($shortcode, $string, $node[3]);
}

return $shortcodes;
}

/**
* @param string $text
*
* @psalm-return array<string,string|null>
*/
private function parseParameters($text)
{
if('' === $text || false === preg_match_all($this->paramRegex, $text, $matches, PREG_SET_ORDER)) {
return array();
}

$parameters = array();
foreach($matches as $match) {
if(!isset($match['pn']) || '' === $match['pn']) {
continue;
}
$hasValue = isset($match['pv']) && '' !== $match['pv'];
$parameters[$match['pn']] = $hasValue ? $this->extractValue($match['pv']) : null;
}

return $parameters;
}

/**
* @param string $value
*
* @return string
* @psalm-suppress InvalidFalsableReturnType
*/
private function extractValue($value)
{
$dl = $this->delimiterLength;
if(strlen($value) >= 2 * $dl
&& strncmp($value, $this->delimiter, $dl) === 0
&& substr($value, -$dl) === $this->delimiter) {
/** @psalm-suppress FalsableReturnStatement */
return substr($value, $dl, -$dl);
}

return $value;
}
}
41 changes: 23 additions & 18 deletions tests/ParserTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
use PHPUnit\Framework\Attributes\DataProvider;
use Thunder\Shortcode\HandlerContainer\HandlerContainer;
use Thunder\Shortcode\Parser\RegularParser;
use Thunder\Shortcode\Parser\TarsParser;
use Thunder\Shortcode\Parser\ParserInterface;
use Thunder\Shortcode\Parser\RegexParser;
use Thunder\Shortcode\Parser\WordpressParser;
Expand Down Expand Up @@ -254,6 +255,7 @@ public static function provideShortcodes()
$syntax = array_shift($test);

$result[] = array_merge(array(new RegexParser($syntax)), $test);
$result[] = array_merge(array(new TarsParser($syntax)), $test);
$result[] = array_merge(array(new RegularParser($syntax)), $test);
if(!in_array($key, $wordpressSkip, true)) {
$result[] = array_merge(array(new WordpressParser()), $test);
Expand All @@ -265,17 +267,18 @@ public static function provideShortcodes()

public function testIssue77()
{
$parser = new RegularParser();
// TarsParser must reproduce RegularParser's backtracking behaviour exactly
foreach(array(new RegularParser(), new TarsParser()) as $parser) {
$this->assertShortcodes($parser->parse('[a][x][/x][x k="v][/x][y]x[/y]'), array(
new ParsedShortcode(new Shortcode('a', array(), null, null), '[a]', 0),
new ParsedShortcode(new Shortcode('x', array(), '', null), '[x][/x]', 3),
new ParsedShortcode(new Shortcode('y', array(), 'x', null), '[y]x[/y]', 22),
));

$this->assertShortcodes($parser->parse('[a][x][/x][x k="v][/x][y]x[/y]'), array(
new ParsedShortcode(new Shortcode('a', array(), null, null), '[a]', 0),
new ParsedShortcode(new Shortcode('x', array(), '', null), '[x][/x]', 3),
new ParsedShortcode(new Shortcode('y', array(), 'x', null), '[y]x[/y]', 22),
));

$this->assertShortcodes($parser->parse('[a k="v][x][/x]'), array(
new ParsedShortcode(new Shortcode('x', array(), '', null), '[x][/x]', 8),
));
$this->assertShortcodes($parser->parse('[a k="v][x][/x]'), array(
new ParsedShortcode(new Shortcode('x', array(), '', null), '[x][/x]', 8),
));
}
}

public function testIssue119()
Expand All @@ -287,15 +290,16 @@ public function testIssue119()
'[a k="x\"y"]inner[/a]' => new ParsedShortcode(new Shortcode('a', array('k' => 'x\"y'), 'inner', null), '[a k="x\"y"]inner[/a]', 0),
'[mention id=1 name="foo\"ff\""][/mention]' => new ParsedShortcode(new Shortcode('mention', array('id' => '1', 'name' => 'foo\"ff\"'), '', null), '[mention id=1 name="foo\"ff\""][/mention]', 0),
);
$parser = new RegularParser();
foreach($cases as $input => $expected) {
$this->assertShortcodes($parser->parse($input), array($expected));
}
foreach(array(new RegularParser(), new TarsParser()) as $parser) {
foreach($cases as $input => $expected) {
$this->assertShortcodes($parser->parse($input), array($expected));
}

$this->assertShortcodes($parser->parse('[a k="x\"y"]inner[/a] [mention id=1 name="foo\"ff\""][/mention]'), array(
new ParsedShortcode(new Shortcode('a', array('k' => 'x\"y'), 'inner', null), '[a k="x\"y"]inner[/a]', 0),
new ParsedShortcode(new Shortcode('mention', array('id' => '1', 'name' => 'foo\"ff\"'), '', null), '[mention id=1 name="foo\"ff\""][/mention]', 22),
));
$this->assertShortcodes($parser->parse('[a k="x\"y"]inner[/a] [mention id=1 name="foo\"ff\""][/mention]'), array(
new ParsedShortcode(new Shortcode('a', array('k' => 'x\"y'), 'inner', null), '[a k="x\"y"]inner[/a]', 0),
new ParsedShortcode(new Shortcode('mention', array('id' => '1', 'name' => 'foo\"ff\"'), '', null), '[mention id=1 name="foo\"ff\""][/mention]', 22),
));
}
}

public function testWordPress()
Expand Down Expand Up @@ -338,6 +342,7 @@ public function testWordpressInvalidNamesException()
public function testInstances()
{
static::assertInstanceOf('Thunder\Shortcode\Parser\WordPressParser', new WordpressParser());
static::assertInstanceOf('Thunder\Shortcode\Parser\TarsParser', new TarsParser());
static::assertInstanceOf('Thunder\Shortcode\Parser\RegularParser', new RegularParser());
}
}
Loading