Parsing
{{redirect|Parser|the computer programming language|Parser (CGI language)}}
In [[computer science]] and [[linguistics]], '''parsing''', or, more formally, '''syntactic analysis''', is the process of analyzing a sequence of [[Token (parser)|tokens]] to determine grammatical structure with respect to a given (more or less) [[formal grammar]]. A '''parser''' is thus one of the components in an [[interpreter]] or [[compiler]], where it captures the implied hierarchy of the input text and transforms it into a form suitable for further processing (often some kind of [[parse tree]], [[abstract syntax tree]] or other hierarchical structure) and normally checks for syntax errors at the same time. The parser often uses a separate [[lexical analyser]] to create tokens from the sequence of input characters. Parsers may be programmed by hand or may be semi-automatically generated (in some programming language) by a tool (such as [[Yet Another Compiler Compiler|Yacc]]) from a grammar written in [[Backus-Naur form]].
Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of [[Inflection|inflected]] languages, such as the [[Romance languages|Romance languages]] or [[Latin]].
Parsers can also be constructed as executable specifications of grammars in functional programming languages. Frost, Hafiz and Callaghan Frost, R., Hafiz, R. and Callaghan, P. (2008) " Parser Combinators for Ambiguous Left-Recursive Grammars." '' 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN '', Volume 4902/2008, Pages: 167-181, January 2008, San Francisco. have built on the work of others to construct a set of [[higher-order function]]s (called [[parser combinators]]) which allow polynomial time and space complexity top-down parser to be constructed as executable specifications of ambiguous grammars containing left-recursive productions. The [http://www.cs.uwindsor.ca/~hafiz/proHome.html X-SAIGA] site has more about the algorithms and implementation details.
== Human languages ==
:''Also see [[:Category:Natural language parsing]]''
In some [[machine translation]] and [[natural language processing]] systems, human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial [[syntactic ambiguity|ambiguity]] in the structure of human language. In order to parse natural language data, researchers must first agree on the [[grammar]] to be used. The choice of syntax is affected by both [[linguistic]] and computational concerns; for instance some parsing systems use [[lexical functional grammar]], but in general, parsing for grammars of this type is known to be [[NP-complete]]. [[Head-driven phrase structure grammar]] is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn [[Treebank]]. [[Shallow parsing]] aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is [[dependency grammar]] parsing.
Most modern parsers are at least partly [[statistics|statistical]]; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. ''(See [[machine learning]].)'' Approaches which have been used include straightforward [[PCFG]]s (probabilistic context free grammars), [[maximum entropy]], and [[neural net]]s. Most of the more successful systems use ''lexical'' statistics (that is, they consider the identities of the words involved, as well as their [[part of speech]]). However such systems are vulnerable to [[overfitting]] and require some kind of smoothing to be effective.{{Fact|date=May 2008}}
Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages. As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not [[context-free]], some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the [[CKY algorithm]], usually with some [[heuristic (computer science)|heuristic]] to prune away unlikely analyses to save time. ''(See [[chart parsing]].)'' However some systems trade speed for accuracy using, eg, linear-time versions of the [[Shift-reduce parsing|shift-reduce]] algorithm. A somewhat recent development has been [[parse reranking]] in which the parser proposes some large number of analyses, and a more complex system selects the best option.
It is normally branching of one part and its subparts
== Programming languages ==
The most common use of a parser is as a component of a [[compiler]] or [[interpreter]]. This parses the [[source code]] of a [[computer programming language]] to create some form of internal representation. Programming languages tend to be specified in terms of a [[context-free grammar]] because fast and efficient parsers can be written for them. Parsers are written by hand or generated by [[parser generator]]s.
Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.
===Overview of process===
[[image:Parser_Flow.gif|right|Flow of data in a typical parser]]
The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
The first stage is the token generation, or [[lexical analysis]], by which the input character stream is split into meaningful symbols defined by a grammar of [[regular expression]]s. For example, a calculator program would look at an input such as "12*(3+4)^2
" and split it into the tokens 12
, *
, (
, 3
, +
, 4
, )
, ^
, and 2
, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *
, +
, ^
, (
and )
mark the start of a new token, so meaningless tokens like "12*
" or "(3
" will not be generated.
The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a [[context-free grammar]] which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with [[attribute grammar]]s.
The final phase is [[Semantic analysis (computer science)|semantic parsing]] or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.
==Types of parsers==
The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:
*[[Top-down parsing]] - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for [[parse tree|parse-trees]] using a top-down expansion of the given [[formal grammar]] rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate [[ambiguity]] by expanding all alternative right-hand-sides of grammar rules Aho, A.V., Sethi, R. and Ullman ,J.D. (1986) " Compilers: principles, techniques, and tools." '' Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. '' . [[LL parser]]s and [[recursive-descent parser]] are examples of top-down parsers, which cannot accommodate [[left recursion | left recursive]] productions. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous [[context-free grammar]]s, more sophisticated algorithm for top-down parsing have been created by Frost, Hafiz, and Callaghan Frost, R., Hafiz, R. and Callaghan, P. (2007) " Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars ." ''10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE '', Pages: 109 - 120, June 2007, Prague. Frost, R., Hafiz, R. and Callaghan, P. (2008) " Parser Combinators for Ambiguous Left-Recursive Grammars." '' 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN '', Volume 4902/2008, Pages: 167-181, January 2008, San Francisco. which accommodates [[ambiguity]] and [[left recursion]] in polynomial time and which generates polynomial-size representations of the potentially-exponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input w.r.t. a given CFG.
*[[Bottom-up parsing]] - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. [[LR parser]]s are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
Another important distinction is whether the parser generates a ''leftmost derivation'' or a ''rightmost derivation'' (see [[context-free grammar]]). LL parsers will generate a leftmost [[derivation]] and LR parsers will generate a rightmost derivation (although usually in reverse) {{Fact|date=January 2008}}.
== Examples of parsers ==
=== Top-down parsers ===
Some of the parsers that use [[top-down parsing]] include:
* [[Recursive descent parser]]
* [[LL parser]] ('''L'''eft-to-right, '''L'''eftmost derivation)
* [http://www.cs.uwindsor.ca/~hafiz/proHome.html X-SAIGA] - eXecutable SpecificAtIons of GrAmmars. Contains publications related to top-down parsing algorithm that supports left-recursion and ambiguity in polynomial time and space.
=== Bottom-up parsers ===
Some of the parsers that use [[bottom-up parsing]] include:
* Precedence parser
** [[Operator-precedence parser]]
** [[Simple precedence parser]]
* BC (bounded context) parsing
* [[LR parser]] ('''L'''eft-to-right, '''R'''ightmost derivation)
** [[SLR parser|Simple LR (SLR) parser]]
** [[LALR parser]]
** [[Canonical LR parser|Canonical LR (LR(1)) parser]]
** [[GLR parser]]
* [[CYK algorithm|CYK parser]]
==References==