Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 226 additions & 12 deletions fyi/semgrep-grammars/src/semgrep-php/grammar.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,21 @@
semgrep-php

Extends the standard php grammar with semgrep pattern constructs.

Notes on the lexer interaction with PHP variables:
- PHP variables natively start with `$` (e.g. `$foo`), so a metavariable
such as `$FOO` already parses as a `variable_name`. We do NOT need to
introduce a separate metavariable-as-variable token.
- For metavariables in identifier positions (class names, function names,
type names, etc.) we use `semgrep_metavar_ident`, a token that looks like
`$FOO` but matches uppercase metavariable convention. Since `$FOO` would
otherwise lex as `$` + `name`, we use a higher-precedence token. In
identifier positions a `variable_name` is not legal, so there is no
ambiguity.
- PHP's variable-variable syntax `$$F` lexes naturally as `$` + `$F`, where
`$F` is a `variable_name`. We rely on `semgrep_metavar_ident` only being
accepted in identifier positions, not variable positions, so `$$FOO` keeps
its variable-variable meaning.
*/

const base_grammar = require('tree-sitter-php/grammar');
Expand All @@ -10,22 +25,221 @@ module.exports = grammar(base_grammar, {
name: 'php',

conflicts: ($, previous) => previous.concat([
// semgrep_ellipsis as expression vs as statement: a bare `...` can be
// either an `expression_statement` (when followed by `;`) or a
// standalone `semgrep_ellipsis` statement.
[$._expression, $.program],
[$._expression, $.compound_statement],
[$._expression, $.while_statement],
[$._expression, $.if_statement],
[$._expression, $.foreach_statement],
[$._expression, $.else_clause],
[$._expression, $.else_if_clause],
[$._expression, $.colon_block],
[$._expression, $.default_statement],
[$._expression, $.declare_statement],
[$._expression, $.match_block],
[$._expression, $.for_statement],
]),

/*
Support for semgrep ellipsis ('...') and metavariables ('$FOO'),
if they're not already part of the base grammar.
*/
rules: {
/*
semgrep_ellipsis: $ => '...',
/*
Semgrep tokens
*/

// A bare ellipsis. PHP already uses `...` for variadic parameters and
// unpacking; we keep those rules untouched and prefer the native
// interpretations by giving `semgrep_ellipsis` a negative precedence.
semgrep_ellipsis: $ => prec(-1, '...'),

// `<... expr ...>` - matches an expression "deeply" inside another.
semgrep_deep_ellipsis: $ => seq('<...', $._expression, '...>'),

// `$...ARGS` - matches a sequence of arguments / parameters.
semgrep_variadic_metavariable: $ => /\$\.\.\.[A-Z_][A-Z_0-9]*/,

// `$FOO` used in identifier positions (class / function / type / attribute
// name). This is given a higher token precedence than the default
// `$` + `name` decomposition so it lexes as a single token, but it is only
// accepted in places where an identifier (not a variable) is expected, so
// it does not collide with `variable_name`.
semgrep_metavar_ident: $ => token(prec(1, /\$[A-Z_][A-Z_0-9]*/)),

// Convenience: a `name` or a metavariable used as an identifier.
_semgrep_extended_name: $ => choice($.name, $.semgrep_metavar_ident),

/*
Wire ellipsis into expression and statement positions
*/

_expression: ($, previous) => {
return choice(
_expression: ($, previous) => choice(
previous,
$.semgrep_ellipsis,
$.semgrep_deep_ellipsis,
),

_statement: ($, previous) => choice(
previous,
$.semgrep_ellipsis,
),

// Allow `...` inside class / interface / trait bodies.
_member_declaration: ($, previous) => choice(
previous,
$.semgrep_ellipsis,
),

// Allow `...` inside enum bodies.
_enum_member_declaration: ($, previous) => choice(
previous,
$.semgrep_ellipsis,
),

// Allow `...` and `$...ARGS` as a parameter.
formal_parameters: $ => seq(
'(',
commaSep(choice(
$.simple_parameter,
$.variadic_parameter,
$.property_promotion_parameter,
$.semgrep_ellipsis,
...previous.members
);
}
*/
$.semgrep_variadic_metavariable,
)),
optional(','),
')'
),

// Allow `$...ARGS` as a function-call argument. `...` (semgrep_ellipsis)
// already matches inside expressions so it works as an argument too.
argument: ($, previous) => choice(
previous,
$.semgrep_variadic_metavariable,
),

// Allow `...` as a match arm. We use `prec.dynamic` to prefer treating a
// standalone `...` as a top-level match arm rather than as the start of a
// `match_condition_list`.
match_block: $ => prec.left(seq(
'{',
commaSep1(choice(
$.match_conditional_expression,
$.match_default_expression,
prec.dynamic(1, $.semgrep_ellipsis),
)),
optional(','),
'}'
)),

// Allow ellipsis inside attribute argument lists. `arguments` is the
// generic call-arguments rule, which is already covered above via
// `argument` and the expression-level ellipsis. Nothing extra needed.

/*
Wire metavariable-as-identifier into name positions
*/

class_declaration: $ => prec.right(seq(
optional(field('attributes', $.attribute_list)),
optional(field('modifier', choice($.final_modifier, $.abstract_modifier))),
keyword('class'),
field('name', $._semgrep_extended_name),
optional($.base_clause),
optional($.class_interface_clause),
field('body', $.declaration_list),
optional($._semicolon)
)),

interface_declaration: $ => seq(
keyword('interface'),
field('name', $._semgrep_extended_name),
optional($.base_clause),
field('body', $.declaration_list)
),

trait_declaration: $ => seq(
keyword('trait'),
field('name', $._semgrep_extended_name),
field('body', $.declaration_list)
),

enum_declaration: $ => prec.right(seq(
optional(field('attributes', $.attribute_list)),
keyword('enum'),
field('name', $._semgrep_extended_name),
optional(seq(':', $._type)),
optional($.class_interface_clause),
field('body', $.enum_declaration_list)
)),

_function_definition_header: $ => seq(
keyword('function'),
optional('&'),
field('name', choice(
$.name,
alias($._reserved_identifier, $.name),
$.semgrep_metavar_ident,
)),
field('parameters', $.formal_parameters),
optional($._return_type)
),

// Type names (used in parameter types, return types, etc.).
named_type: $ => choice(
$.name,
$.qualified_name,
$.semgrep_metavar_ident,
),

// `extends` / `implements`: allow metavar in the base list.
base_clause: $ => seq(
keyword('extends'),
commaSep1(choice(
$.name,
alias($._reserved_identifier, $.name),
$.qualified_name,
$.semgrep_metavar_ident,
))
),

class_interface_clause: $ => seq(
keyword('implements'),
commaSep1(choice(
$.name,
alias($._reserved_identifier, $.name),
$.qualified_name,
$.semgrep_metavar_ident,
))
),

// Attribute names.
attribute: $ => seq(
choice(
$.name,
alias($._reserved_identifier, $.name),
$.qualified_name,
$.semgrep_metavar_ident,
),
optional(field('parameters', $.arguments))
),
}
});

function commaSep1(rule) {
return seq(rule, repeat(seq(',', rule)));
}

function commaSep(rule) {
return optional(commaSep1(rule));
}

// Mirrors the helper in tree-sitter-php's grammar.js so we can re-declare
// rules that use case-insensitive keywords.
function keyword(word, aliasAsWord = true) {
let pattern = '';
for (const letter of word) {
pattern += `[${letter}${letter.toLocaleUpperCase()}]`;
}
let result = new RegExp(pattern);
if (aliasAsWord) result = alias(result, word);
return result;
}
45 changes: 32 additions & 13 deletions fyi/versions
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,48 @@ File: semgrep-grammars/src/tree-sitter-php/LICENSE
Git repo name: tree-sitter-php
Latest commit in repo: 9f2586b1be335449181547ca51badb44c97c38b4
Last change in file:
commit 9f2586b1be335449181547ca51badb44c97c38b4
Author: Sjoerd Langkemper <sjoerd-github@linuxonly.nl>
Date: Sun Aug 1 13:20:58 2021 +0200
commit 026a06f28d1bc3bcab7712ca62b317104f7f721d
Author: Max Brunsfeld <maxbrunsfeld@gmail.com>
Date: Tue Jul 21 09:49:24 2020 -0700

Add grammar for fixed precedence
Add license
---
File: semgrep-grammars/src/tree-sitter-php/grammar.js
Git repo name: tree-sitter-php
Latest commit in repo: 9f2586b1be335449181547ca51badb44c97c38b4
Last change in file:
commit 9f2586b1be335449181547ca51badb44c97c38b4
commit a9714abbc40ca5bf7daff09d4fae8e46c3ca9b25
Author: Sjoerd Langkemper <sjoerd-github@linuxonly.nl>
Date: Sun Aug 1 13:20:58 2021 +0200
Date: Sun Aug 1 13:01:02 2021 +0200

Add grammar for fixed precedence
Fix precedence between concatenation and shift

Closes #93.
---
File: semgrep-grammars/src/semgrep-php/grammar.js
Git repo name: ocaml-tree-sitter-semgrep
Latest commit in repo: 091f5438fc0c15b80217f00e5b94ec0e55517383
Git repo name: agent-a3d5670e81f06ded6
Latest commit in repo: f7cc144dee9de93027d15575765209085538f10f
Last change in file:
commit dfdc3f32aedb879b16c9f8fedcf9f250ed002142
Author: pad <yoann.padioleau@gmail.com>
Date: Mon Aug 9 12:06:52 2021 +0200
commit fb6f5b03f42c3f798e9464017004f6c7ceb92010
Author: brandonspark <brandon@semgrep.com>
Date: Wed Apr 29 17:34:56 2026 -0700

Adding tree-sitter-php
[php] add Semgrep grammar augmentation (was empty)

Wire up `semgrep_ellipsis`, `semgrep_deep_ellipsis`,
`semgrep_variadic_metavariable` (`$...ARGS`), and `semgrep_metavar_ident`
into the previously-empty semgrep-php grammar so PHP patterns can use
metavariables in non-variable positions and ellipses in any expression,
statement, member, parameter, argument, and match-arm position.

PHP's native `$FOO` parses as a `variable_name`, so we only added a
metavariable token for *identifier* positions (class/interface/trait/
enum/function/method names, type references, base/interface clauses,
and attribute names). The variable-variable form `$$F` keeps its native
meaning since `semgrep_metavar_ident` is never accepted in variable
positions.

Closes LANG-474, LANG-475.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
Loading