Skip to main content

Context-Free Grammar

Main Source:

  • Book chapter 2
  • Neso Academy playlist 65
  • Neso Academy playlist 73-77
  • Neso Academy playlist 78-81
  • Neso Academy playlist 82-83

Context-free grammar is the grammar that generates context-free language, which is recognized by pushdown automata. The production rule is:

AαA \rightarrow \alpha, where α={VΣ}\alpha = \{V \cup \Sigma^*\} and AVA \in V.

Similar to the general and regular grammar, a non-terminal symbol can produce α\alpha which can be any symbol from the set of all possible strings of symbols that can be generated using both the non-terminal symbols and the terminal symbols of the grammar, including the empty string.

info

From different sources, TT is sometimes used as Σ\Sigma. This implies that ({VΣ}\{V \cup \Sigma^*\} is equivalent with {VT}\{V \cup T\}).

The difference between regular grammar and context-free grammar lies in the production rules. Regular grammar is more restrictive, as the production rule must place non-terminal symbols either on the right or left side (i.e., non-terminals cannot be placed in the middle of the string). In contrast, context-free grammar allows the placement of non-terminals anywhere.

For example, the grammar G={(S,A),(a,b),(SaAb,AaAbϵ)}G = \{(S, A), (a, b), (S \rightarrow aAb, A \rightarrow aAb| \epsilon)\} is considered as a context-free grammar.

We can generate: SaAbS \rightarrow aAb
aaAbb\rightarrow aaAbb (by AaAbA \rightarrow aAb)
aaaAbbb\rightarrow aaaAbbb (by AaAbA \rightarrow aAb)
aaabbb\rightarrow aaabbb (by AϵA \rightarrow \epsilon)

The language of this grammar will be in the form of anbna^n b^n.

Derivation Trees

Also known as parse tree, it is an ordered rooted tree that graphically represent the derivation process of a context-free grammar. The five properties of derivation tree:

  1. The root of a derivation tree is SS.
  2. Every leaf node is a terminal symbol T{λ}T \cup \{\lambda\}.
  3. Every vertex node is a non-terminal symbol in VV.
  4. When a vertex AVA \in V has a set of children labeled from a1,a2,...,ana_1, a_2, ..., a_n, then vertex AA will have production rule in the form of Aa1,a2,...,anA \rightarrow a_1, a_2, ..., a_n.
  5. A leaf labeled with λ\lambda has no siblings, that is, a vertex with a child labeled λ\lambda can have no other children.

In other word, each level of the tree represents a step in the derivation process. Starting from the root, each level corresponds to the application of a production rule. The nodes at a particular level are derived from the nodes at the previous level.

Derivation tree
Source: https://youtu.be/u4-rpIlV9NI?si=mAYV4hyYXS9j7sfh&t=244

We see in the example that starting from SS, it produces terminal 00 and non-terminal BB. We will make it as the child of SS, and continue the derivation process of BB. This is done for all production rule. If there exist multiple production rule for one non-terminal symbol, then we will have to use both of them in distinct node. When the production rule is recursive (i.e., A1AAA \rightarrow 1AA), we may stop the derivation and fill the leaf nodes with ϵ\epsilon.

We also call any step of the derivation as a sentential form. A derivation tree in which the leaves contain a label from VT{λ}V \cup T \cup \{\lambda\}, or in other words, if any of the leaf still contains a non-terminal symbol, then the tree is said to be a partial derivation tree.

info

See also tree data structure.
ϵ\epsilon is same as λ\lambda.

Left & Right Derivation

When deriving a string with a grammar, there are two approaches, leftmost derivation and rightmost derivation. In leftmost derivation, the leftmost non-terminal in the current sentential form is always selected for expansion. While in the rightmost derivation, we select the rightmost non-terminal symbols. There is also mixed derivation, in which the two approaches is combined.

For example, a simple grammar with production rule SABS \rightarrow AB, AaA \rightarrow a, BaB \rightarrow a.

Using this grammar, the string "ab" can be derived in the following ways:

  • Leftmost Derivation: SABaBabS \Rightarrow AB \Rightarrow aB \Rightarrow ab
  • Rightmost Derivation: SABAbabS \Rightarrow AB \Rightarrow Ab \Rightarrow ab

Another example with the graphical derivation tree:

Left and right derivation tree
Source: https://youtu.be/u4-rpIlV9NI?si=mWlwEeGp_w5Ky123&t=740

In the left derivation process, we chose to apply the production rule SaSSS \rightarrow aSS, because applying the other rule SaASS \rightarrow aAS would require us to replace AA. The production rule of AA involve occurrence of bb, if we use the AbaA \rightarrow ba production rule, we will have bb as the second character of the string (which is invalid). The process of derivation continues like this until we successfully create a parse tree with the correct characters combined from all the leaf nodes.

Parsing

Parsing is the process of finding a derivation for a string from a given grammar. It is a way to recognize a string and to determine if a string belong to the grammar.

The basic idea of parsing is, finding (or making) a parse tree. The manual way is to analyze a given string and the grammar, then carefully construct a parse tree (like what we did in the previous example above).

A brute force way of parsing, also called exhaustive search parsing involve generating all possible strings of the same length as the input and checking if any of them matches the input. This approach is obviously inefficient. A practical parsing algorithms would be designed to avoid exhaustive search by employing various optimization techniques.

Ambiguity

In parsing, ambiguity refers to a situation where a given grammar can produce more than one valid parse tree for a particular input string. Depending on which production rule is applied on each step, a different but valid parse tree could be generated.

The grammar G=({S},{a,b},S,SaSbbSaSSλ)G = (\{S\}, \{a, b\}, S, S \rightarrow aSb | bSa | SS | \lambda) generates strings having an equal number of "a"'s and "b"'s. The string "abab" can be generated in two distinct ways as shown in the following.

Ambiguity in parsing
Source: Book page 129

Furthermore, the string "abab" has two distinct leftmost derivations:

  • SaSbabSabababS \Rightarrow aSb \Rightarrow abSab \Rightarrow abab
  • SSSaSbSabSabaSbababS \Rightarrow SS \Rightarrow aSbS \Rightarrow abS \Rightarrow abaSb \Rightarrow abab

And two distinct rightmost derivations:

  • SaSbabSabababS \Rightarrow aSb \Rightarrow abSab \Rightarrow abab
  • SSSSaSbSabaSbabababS \Rightarrow SS \Rightarrow SaSb \Rightarrow Sab \Rightarrow aSbab \Rightarrow abab

Ambiguity can happen even in our daily lives, such as when encountering this ambiguous math expression "2 + 3 × 4". Without parenthesis, it wouldn't be clear in which order do we process this expression. We can interpret "2 + 3 × 4" as either "(2 + 3) × 4" or "2 + (3 × 4)".

We will say a string that leads to this situation "ambiguously derivable". A grammar GG is said to be ambiguous if there exists at least one string in L(G)L(G) which is ambiguously derivable. Ambiguity is a property of grammar, it is not always possible to find equivalent unambiguous grammar.

Simplification

The simplification of context-free grammar involves transforming or modifying the grammar to make it simpler or more manageable while preserving its language. This includes removing useless symbols, unreachable symbols, and simplifying the production rules.

For example, if we have production rule AaBaA \rightarrow aBa and ByB \rightarrow y, we can simplify this by directly substituting BB in the AA production rule. So, the simplified production rule is AayaA \rightarrow aya. This technique is also called substitution rule.

There are three steps in CFG simplification, reduction of CFG, removal of unit production, and removal of null production.

Reduction of CFG

The reduction involve eliminating useless symbols in the grammar, including non-terminals and terminals that do not contribute to generating any valid string in the language. It is divided into two phases.

  1. Removal of useless symbols, non-terminals and terminals that do not contribute to generating any valid string in the language.
  2. Removal of unreachable symbols, symbols that do not participate in the derivation of any sentential form.

A helpful procedure from the video:

The reduction of CFG
Source: https://youtu.be/EF09zxzpVbk?si=qtn-ff6Wb-TJ2Zxv&t=323

With example grammar GG with production rules PP: SACBS \rightarrow AC|B, AaA \rightarrow a, CcBCC \rightarrow c|BC, EaAeE \rightarrow aA|e.

Reduction phase 1
Source: https://youtu.be/EF09zxzpVbk?si=gc_WXN5N0yI5d6et&t=604

  1. Create a set w1w_1 that includes non-terminals that derives to some terminals.
  2. Create another set w2w_2 that includes non-terminals that derives to all symbol in the previous step.
  3. It is repeated until we obtained the same set in two consecutive steps (end at w3w_3).
  4. The new grammar GG' will contain non-terminal symbols from the last set w3w_3, and terminal symbols that are derived by those non-terminal symbols.
  5. A new production rule is created in which the symbol that doesn't appear in the non-terminal of GG' won't be included. In the example, we see that we removed production rule SBS \rightarrow B, because BB itself doesn't generate any terminal symbol (i.e., BB is a useless symbol).

Reduction phase 2
Source: https://youtu.be/EF09zxzpVbk?si=ra95MHcYogTYxDC9&t=779

  1. Create a set y1y_1 starting from the start symbol SS.
  2. Create another set y2y_2 that includes all the symbol (including non-terminals and terminals) that can be derived from the previous set.
  3. It is repeated until we obtained the same set in two consecutive steps (end in y4y_4).
  4. A new grammar GG'' we include the non-terminals and terminals from the last set y4y_4.
  5. We will alter the production rule by only including those that are in the non-terminals of GG''. In the example, we removed symbol EE entirely, because in fact no other symbol can reach it (i.e., EE is an unreachable symbol).

Removal of Unit Productions

Unit production refers to any production rule in the form ABA \rightarrow B, or a non-terminal that transform to another non-terminal. Unit production is sometimes redundant and can be simplified using substitution rule.

If BB production rule is in the form By1y2...ynB \rightarrow y_1|y_2|...|y_n and there is a ABA \xRightarrow{*} B or a sequence of derivation that leads from AA to BB, then we can delete ABA \rightarrow B and replace it with Ay1y2...ynA \rightarrow y_1|y_2|...|y_n.

Unit production removal
Source: https://youtu.be/B2o75KpzfU4?si=NUhYWR4P7fznC6Lp&t=219

Unit production removal part 1
Source: https://youtu.be/B2o75KpzfU4?si=PYaESnF-Flc8mljR&t=493

The production rule involves a sequence of derivation from non-terminals to another non-terminals. We can substitute the final derivation step that derive to terminals to each sequence of derivation. Finally, we can remove all unreachable symbols.

Removal of Null Productions

A null production is a production rule in the form of AϵA \rightarrow \epsilon. If the variable doesn't directly map to ϵ\epsilon, this mean it needs a sequence of derivation AϵA \xRightarrow{*} \epsilon, then we call the variable nullable.

Null production removal
Source: https://youtu.be/mlXYQ8ug2v4?si=i7_75uZdYmBzYOCL&t=131

Null production removal example part 1
Source: https://youtu.be/mlXYQ8ug2v4?si=sSOBcDT_ll4KWpl6&t=396

The first step is to remove all null productions from a variable and replace the variable occurrences in other production rules with epsilon. For example, we are going to remove AϵA \rightarrow \epsilon first. SABACS \rightarrow ABAC contains 2 AA, and there is three possibility of removal. Changing the first AA to ϵ\epsilon, so it becomes SABCS \rightarrow ABC, changing the second AA to become SBACS \rightarrow BAC, and changing both AA to be just SBCS \rightarrow BC. AA also appear in production rule of AA itself, and we can replace the AA in AaAA \rightarrow aA to ϵ\epsilon to become just AaA \rightarrow a.

Next step is to remove all null production of BB, and obtain the simplified CFG.

Null production removal example part 2
Source: https://youtu.be/mlXYQ8ug2v4?si=7OuiYOepKinpa4z7&t=492

Normal Forms

A context-free grammar is said to be in normal forms when they follow specific structure or standards.

Chomsky Normal Form

Chomsky Normal Form (CNF) is when the context-free grammar follow the production rule in form:

  • ABCA \rightarrow BC (where AA, BB, and CC are non-terminals)
  • AaA \rightarrow a (where AA is a non-terminal and aa is a terminal)
  • SϵS \rightarrow \epsilon (where SS is the start symbol and ϵ\epsilon represents the empty string)

In CNF, all production rules on the right-hand side should either be two non-terminals or a terminal, except for the rule allowing the start symbol to derive the empty string. Other standard are the elimination of unit productions and null productions, as well as the conversion of longer productions.

Procedure of conversion from CFG to CNF
Source: https://youtu.be/Mh-UQVmAxnw?si=EZD7ndYRKnodk7mJ&t=143

The step 4 is a way to enforce maximum of two non-terminals on the right-hand side. It is done by adding another non-terminals that has the same production rule. For example, instead of ABCDA \rightarrow B|C|D, we can add EE that ECDE \rightarrow C|D, and change AA to ABEA \rightarrow B|E.

Additionally, when a production rule contains both non-terminal and terminal symbols simultaneously, we can follow the step 5. The step 5 shows that we can separate non-terminal and a terminal by making another non-terminal that produce the terminal.

Example

CFG to CNF example
Source: https://youtu.be/FNPSlnj3Vt0?si=Ao7POmbn_eLJCk1O&t=725

  1. Start symbol can't appear on the right-hand side, so we can make a new start symbol SS' that produce SS.
  2. Removing null production by following this.
  3. Removing unit production by following this.
  4. A new non-terminal XX is added here.
  5. YY is added to replace aa that is combined with aBaB.

Greibach Normal Form

In Greibach Normal Form (GNF), every production rule in the grammar is in the form:

  • AaγA \rightarrow a\gamma (where AA is a non-terminal, aa is a terminal, and γ\gamma is a possibly empty sequence of non-terminals)

The right-hand side of the production rule starts with a terminal symbol, followed by a sequence of non-terminals.

CFG to GNF procedure
Source: https://youtu.be/ZCbJan6CGNM?si=pS0u9kuYGdlGbHC8&t=758

In order to convert from CFG to GNF, we have to convert it to CNF first. If CNF restrict the length of right-hand side of a production, GNF allows for longer right-hand sides compared to CNF.

Example

CFG to GNF example part 1
Source: https://youtu.be/ZCbJan6CGNM?si=DiBqkV6DAv0zHJwb&t=767

  1. Keep in mind that we have to confirm it is in CNF first. The step 3 told us to change the name of non-terminals to numbered AA, such as A1,A2A_1, A_2.
  2. The production rule will be altered so that the AA numbering in left-hand side is always lower than the right-hand side.

CFG to GNF example part 2
https://youtu.be/rauqqM0nfuI?si=FplhJA1avFCa6QsS&t=294

  1. A left recursion is when non-terminal on the left-hand side of a production rule appears as the first symbol on the right-hand side. Removing left recursion involve adding new variable to remove it. Upon adding the new variable, we will rewrite the production rule A4A_4 twice. The first time will be same, following the old A4A_4, the second time will be rewriting A4A_4 but adding the new variable to each production rule.

CFG to GNF example part 3
Source: https://youtu.be/rauqqM0nfuI?si=cSCk3rJx4KJAg9KT&t=633

  1. After altering A4A_4, there is still a problem, a non-terminal symbol appears as the first element on the right-hand side of the production rule for A1A_1. The A1A2A3A_1 \rightarrow A_2A_3 can be replaced to A1bA3A_1 \rightarrow bA_3. Then, the first A4A_4 in A1A4A4A_1 \rightarrow A_4A_4 is substituted with the whole A4A_4 production rule. For example, the first A4A_4 is changed to bb and then added with the second. Next, A4A_4 is replaced again with bA3A4bA_3A_4 and added with the second A4A_4, and so on for each production rule in A4A_4.
  2. There is also a problem in ZZ, we also address this by doing the similar thing.

Pumping Lemma for Context-Free Languages

Pumping lemma for context-free languages is to prove that a language is not context-free. In contrast, pumping lemma for regular language proves that a language is not regular.

The lemma states:

Any context-free language AA has a pumping length OO, such that any string SS in AA, where SP|S| \ge P can be divided into five parts: S=uvxyzS = uvxyz, satisfying the following conditions:

  1. uvixyizuv^{i}xy^{i}z is in AA for every i0i \ge 0
  2. vy>0|vy| > 0
  3. vxyP|vxy| \le P

Similar to the previous pumping lemma, it is a proof based on contradiction. A helpful procedure from the video:

Pumping lemma for context-free languages procedure
Source: https://youtu.be/jRhqx1_KcCk?si=PQWy2IpLX7wc_wfp&t=259

Example

Proof with pumping lemma part 1
Source: https://youtu.be/eQ0XkUk3qGk?si=2DO7eT_GCWy0jLFQ&t=429

  1. Pick a pumping length PP and a string SS. PP is chosen to be 4 and S=apbpcpS = a^pb^pc^p, so S=a4b4c4S = a^4b^4c^4.

Proof with pumping lemma part 2
Source: https://youtu.be/eQ0XkUk3qGk?si=RyCf4AbLvHGW-yxr&t=700

  1. Consider some case of language such that uvixyizAuv^ixy^iz \notin A for some ii. Remember that the language must be in the form of anbncna^nb^nc^n according to the original language.
  2. It is shown that in either two case, both fail to satisfy the three pumping lemma condition at the same time, so the language is proven to not be a context-free.