[10640010] |OpenOffice.org
[10640020] |'''OpenOffice.org''' ('''OO.o''' or '''OOo''') is a [[cross-platform]] [[office suite|office application suite]] available for a number of different computer [[operating system]]s.
[10640030] |It supports the ISO standard '''[[OpenDocument]] Format (ODF)''' for data interchange as its default [[file format]], as well as [[Microsoft Office]] '97–2003 formats, [[Microsoft Office]] '2007 format (in version 3), among many others.
[10640040] |OpenOffice.org was originally derived from [[StarOffice]], an office suite developed by [[StarDivision]] and acquired by [[Sun Microsystems]] in August 1999.
[10640050] |The [[source code]] of the suite was released in July 2000 with the aim of reducing the dominant [[market share]] of [[Microsoft Office]] by providing a free, open and high-quality alternative; later versions of StarOffice are based upon OpenOffice.org with additional proprietary components.
[10640060] |OpenOffice.org is [[free software]], available under the [[GNU Lesser General Public License]] (LGPL).
[10640070] |The project and software are informally referred to as ''OpenOffice'', but this term is a [[trademark]] held by another party, requiring the project to adopt ''OpenOffice.org'' as its formal name.
[10640080] |== History==
[10640090] |Originally developed as the [[proprietary software]] application suite StarOffice by the German company [[StarDivision]], the code was purchased in 1999 by Sun Microsystems.
[10640100] |In August 1999 version 5.2 of StarOffice was made available free of charge.
[10640110] |On [[July 19]], [[2000]], Sun Microsystems announced that it was making the source code of StarOffice available for download under both the LGPL and the [[Sun Industry Standards Source License]] (SISSL) with the intention of building an open source development community around the software.
[10640120] |The new project was known as OpenOffice.org, and its website went live on [[October 13]], [[2000]].
[10640130] |Work on version 2.0 began in early 2003 with the following goals: better interoperability with Microsoft Office; better performance, with improved speed and lower memory usage; greater [[Scripting language|scripting]] capabilities; better integration, particularly with [[GNOME]]; an easier-to-find and use database front-end for creating reports, forms and queries; a new built-in [[SQL]] database; and improved [[usability]].
[10640140] |A [[beta version]] was released on [[March 4]], [[2005]].
[10640150] |On [[September 2]], [[2005]] Sun announced that it was retiring the SISSL.
[10640160] |As a consequence, the OpenOffice.org Community Council announced that it would no longer [[dual license]] the office suite, and future versions would use only the LGPL.
[10640170] |On [[October 20]], [[2005]], OpenOffice.org 2.0 was formally released to the public.
[10640180] |Eight weeks after the release of Version 2.0, an update, OpenOffice.org 2.0.1, was released.
[10640190] |It fixed minor bugs and introduced new features.
[10640200] |As of the 2.0.3 release, OpenOffice.org changed its release cycle from 18-months to releasing updates, feature enhancements and bug fixes every three months.
[10640210] |Currently, new versions including new features are released every six months (so-called "feature releases") alternating with so-called "bug fix releases" which are being released between two feature releases (Every 3 months).
[10640220] |=== StarOffice ===
[10640230] |Sun subsidizes the development of OpenOffice.org in order to use it as a base for its commercial [[proprietary software|proprietary]] StarOffice application software.
[10640240] |Releases of StarOffice since version 6.0 have been based on the OpenOffice.org source code, with some additional proprietary components, including:
[10640250] |* Additional bundled fonts (especially [[CJK|East Asian language]] fonts).
[10640260] |* [[Adabas D]] database.
[10640270] |* Additional document [[Template (word processing)|templates]].
[10640280] |* [[Clip art]].
[10640290] |* Sorting functionality for Asian versions.
[10640300] |* Additional file filters.
[10640310] |* Migration assessment tool (Enterprise Edition).
[10640320] |* Macro migration tool (Enterprise Edition).
[10640330] |* Configuration management tool (Enterprise Edition).
[10640340] |OpenOffice.org, therefore, inherited many features from the original StarOffice upon which it was based including the [[OpenOffice.org XML]] file format which it retained until version 2, when it was replaced by the ISO standard [[OpenDocument]] Format (ODF).
[10640350] |== Features ==
[10640360] |According to its [[mission statement]], the OpenOffice.org project aims "''To create, as a community, the leading international office suite that will run on all major platforms and provide access to all functionality and data through open-component based APIs and an XML-based file format.''"
[10640370] |OpenOffice.org aims to compete with Microsoft Office and emulate its look and feel where suitable.
[10640380] |It can read and write most of the [[file formats]] found in Microsoft Office, and many other applications; an essential feature of the suite for many users.
[10640390] |OpenOffice.org has been found to be able to open files of older versions of Microsoft Office and damaged files that newer versions of Microsoft Office itself cannot open.
[10640400] |However, it cannot open older Word for Macintosh (MCW) files.
[10640410] |=== Platforms ===
[10640420] |Platforms for which OO.o is available include [[Microsoft Windows]], [[Linux]], [[Solaris Operating System|Solaris]], [[BSD]], [[OpenVMS]], [[OS/2]] and [[IRIX]].
[10640430] |The current primary development platforms are Microsoft Windows, Linux and Solaris.
[10640440] |A port for [[Mac OS X]] exists for OS X machines which have the [[X Window System]] component installed.
[10640450] |A port to OS X's native [[Aqua (user interface)|Aqua user interface]] is in progress, and is scheduled for completion for the 3.0 milestone.
[10640460] |[[NeoOffice]] is an independent [[Fork (software development)|fork]] of OpenOffice, specially adapted for Mac OS X.
[10640470] |=== Version compatibility ===
[10640480] |*Windows 95: up to v1.1.5
[10640490] |*Windows 98-Vista: up to v2.4, development releases of v3.0
[10640500] |*Mac OS 10.2: up to v1.1.2
[10640510] |*Mac OS 10.3: up to v2.1
[10640520] |*Mac OS 10.4-10.5: up to v2.4, development releases of v3.0 ([[Apple-Intel architecture|intel]] only)
[10640530] |*OS/2 and eComStation: up to v2.0.4
[10640540] |=== Components ===
[10640550] |OpenOffice.org is a collection of applications that work together closely to provide the features expected from a modern office suite.
[10640560] |Many of the components are designed to mirror those available in Microsoft Office.
[10640570] |The components available include:
[10640580] |*[[QuickStart]]er
[10640590] |:A small program for Windows and Linux that runs when the computer starts for the first time.
[10640600] |It loads the core files and libraries for OpenOffice.org during computer startup and allows the suite applications to start more quickly when selected later.
[10640610] |The amount of time it takes to open OpenOffice.org applications was a common complaint in version 1.0 of the suite.
[10640620] |Substantial improvements were made in this area for version 2.2.
[10640630] |*The [[Macro (computer science)|macro]] recorder
[10640640] |:Is used to record user actions and replay them later to help with automating tasks, using [[OpenOffice.org Basic]] (see [[OpenOffice.org#OpenOffice.org Basic|below]]).
[10640650] |It is not possible to download these components individually on Windows, though they can be installed separately.
[10640660] |Most Linux distributions break the components into individual packages which may be downloaded and installed separately.
[10640670] |=== OpenOffice.org Basic ===
[10640680] |OpenOffice.org Basic is a programming language similar to Microsoft [[Visual Basic for Applications]] (VBA) based on [[StarOffice Basic]].
[10640690] |In addition to the macros, the upcoming Novell edition of OpenOffice.org 2.0 supports running Microsoft VBA macros, a feature expected to be incorporated into the mainstream version soon.
[10640700] |OpenOffice.org Basic is available in the Writer and Calc applications.
[10640710] |It is written in functions called subroutines or macros, with each macro performing a different task, such as counting the words in a paragraph.
[10640720] |OpenOffice.org Basic is especially useful in doing repetitive tasks that have not been integrated in the program.
[10640730] |As the OpenOffice.org database, called "Base", uses documents created under the Writer application for reports and forms, one could say that Base can also be programmed with OpenOffice.org Basic.
[10640740] |== File formats ==
[10640750] |OpenOffice.org pioneered the ISO/IEC standard [[OpenDocument]] file formats (ODF), which it uses natively, by default.
[10640760] |It also supports reading (and in some cases writing) a large number of legacy proprietary file formats (e.g.: [[WordPerfect]] through libwpd, [[StarOffice]], [[Lotus software]], [[Microsoft Works|MS Works]] through libwps, [[Rich Text Format]]), most notably including [[Microsoft Office]] formats after which the OpenDocument specification was "approved for release as an ISO and IEC International Standard" under the name ISO/IEC 26300:2006..
[10640770] |=== Microsoft Office interoperability ===
[10640780] |In response to Microsoft's recent movement towards using the [[Office Open XML]] format in [[Microsoft Office 2007]], [[Novell]] has released an [[Office Open XML]] converter for OOo under a liberal [[BSD license]] (along with [[GNU GPL]] and [[LGPL]] licensed libraries), that will be submitted for inclusion into the OpenOffice.org project.
[10640790] |This allows OOo to read and write Microsoft OpenXML-formatted word processing documents (.docx) in OpenOffice.org.
[10640800] |Currently it works only with the latest Novell edition of OpenOffice.org.
[10640810] |[[Sun Microsystems]] has developed an ODF plugin for Microsoft Office which enables users of Microsoft Office Word, Excel and PowerPoint to read and write ODF documents.
[10640820] |The plugin currently works with Microsoft Office 2003, Microsoft Office XP and Microsoft Office 2000.
[10640830] |Support for Microsoft Office 2007 is only available in combination with Microsoft Office 2007 SP1.
[10640840] |Several software companies (including Microsoft and Novell) are working on an add-in for Microsoft Office that allows reading and writing ODF files.
[10640850] |Currently it works only for Microsoft Word 2007 / XP / 2003.
[10640860] |Microsoft provides a compatibility pack to read and write Office Open XML files with Office 2000, XP and 2003.
[10640870] |The compatibility pack can also be used as a stand-alone converter with Microsoft Office 97.
[10640880] |This might be helpful to convert older Microsoft Office files via Office Open XML to ODF if a direct conversion doesn't work as expected.
[10640890] |The Office compatibility pack however does not install for Office 2000 or Office XP on [[Windows 9x]].
[10640900] |Note that some office applications built with Microsoft components may refuse to import OpenOffice data.
[10640910] |[[The Sage Group]]'s Simply Accounting, for example, can import Excel's .xls files, but refuses to accept OpenOffice.org-generated .xls files for the reason that the OOo .xls files are not "genuine Microsoft" .xls files.
[10640920] |== Development ==
[10640930] |=== Overview ===
[10640940] |The OpenOffice.org [[Application Programming Interface|API]] is based on a component technology known as [[Universal Network Objects]] (UNO).
[10640950] |It consists of a wide range of interfaces defined in a [[CORBA]]-like [[interface description language]].
[10640960] |The [[document file format]] used is based on [[XML]] and several export and import filters.
[10640970] |All external formats read by OpenOffice.org are converted back and forth from an internal XML representation.
[10640980] |By using [[data compression|compression]] when saving [[XML]] to disk, files are generally smaller than the equivalent binary Microsoft Office documents.
[10640990] |The native file format for storing documents in version 1.0 was used as the basis of the [[OASIS (organization)|OASIS]] OpenDocument file format standard, which has become the default file format in version 2.0.
[10641000] |Development versions of the suite are released every few weeks on the developer zone of the OpenOffice.org website.
[10641010] |The releases are meant for those who wish to test new features or are simply curious about forthcoming changes; they are not suitable for production use.
[10641020] |=== Native desktop integration ===
[10641030] |OpenOffice.org 1.0 was criticized for not having the [[look and feel]] of applications developed natively for the platforms on which it runs.
[10641040] |Starting with version 2.0, OpenOffice.org uses native [[widget toolkit]], icons, and font-rendering libraries across a variety of platforms, to better match native applications and provide a smoother experience for the user.
[10641050] |There are projects underway to further improve this integration on both [[GNOME]] and [[KDE]].
[10641060] |This issue has been particularly pronounced on Mac OS X, whose standard user interface looks noticeably different from either Windows or [[X11]]-based desktop environments and requires the use of programming toolkits unfamiliar to most OpenOffice.org developers.
[10641070] |There are two implementations of OpenOffice.org available for OS X:
[10641080] |;OpenOffice.org Mac OS X (X11):
[10641090] |This official implementation requires the installation of [[X11.app]] or [[XDarwin]], and is a close port of the well-tested Unix version.
[10641100] |It is functionally equivalent to the Unix version, and its user interface resembles the [[look and feel]] of that version; for example, the application uses its own [[menu bar]] instead of the OS X menu at the top of the screen.
[10641110] |It also requires system fonts to be converted to X11 format for OpenOffice.org to use them (which can be done during application installation).
[10641120] |;OpenOffice.org Aqua:
[10641130] |After a first step (completed) using [[Carbon (API)|Carbon]], OpenOffice.org Aqua switched to [[Cocoa (API)|Cocoa]] technology, and an [[Aqua (GUI)|Aqua]] version (based on [[Cocoa (API)|Cocoa]]) is also being developed under the aegis of OpenOffice.org, with a Beta version currently available.
[10641140] |Sun Microsystems is collaborating with OOo to further development of the Aqua version of OpenOffice.org for Mac.
[10641150] |=== Future ===
[10641160] |Currently, a developed preview of OpenOffice.org 3 (OOo-dev 3.0) is available for download.
[10641170] |Among the planned features for OOo 3.0, set to be released by September 2008 , are:
[10641180] |* Personal Information Manager ([[Personal Information Manager|PIM]]), probably based on [[Mozilla Thunderbird|Thunderbird]]/[[Lightning (software)|Lightning]]
[10641190] |* PDF import into Draw (to maintain correct layout of the original PDF)
[10641200] |* [[OOXML]] document support for opening documents created in [[Office 2007]]
[10641210] |* Support for [[Mac OS X]] [[Aqua (user interface)|Aqua]] platform
[10641220] |* Extensions, to add third party functionality.
[10641230] |* Presenter screen in Impress with multi-screen support
[10641240] |=== Other projects ===
[10641250] |A number of products are [http://wiki.services.openoffice.org/wiki/DerivedWorks derived from OpenOffice.org].
[10641260] |Among the more well-known ones are Sun StarOffice and NeoOffice.
[10641270] |The OpenOffice.org site also lists a large variety of [http://wiki.services.openoffice.org/wiki/OpenOffice.org_Solutions complementary products] including groupware solutions.
[10641280] |==== NeoOffice ====
[10641290] |[[NeoOffice]] is an independent [[porting|port]] that integrates with [[Mac OS X|OS X]]’s [[Aqua (GUI)|Aqua]] user interface using [[Java platform|Java]], [[Carbon (API)|Carbon]] and (increasingly) [[Cocoa (API)|Cocoa]] toolkits.
[10641300] |NeoOffice adheres fairly closely to OS X UI standards (for example, using native pull-down menus), and has direct access to OS X’s installed fonts and printers.
[10641310] |Its releases lag behind the official OpenOffice.org X11 releases, due to its small development team and the concurrent development of the technology used to port the user interface.
[10641320] |Other projects run alongside the main OpenOffice.org project and are easier to contribute to.
[10641330] |These include documentation, [[internationalisation and localisation]] and the API.
[10641340] |==== OpenGroupware.org ====
[10641350] |[[OpenGroupware.org]] is a set of extension programs to allow the sharing of OpenOffice.org documents, calendars, address books, [[e-mail]]s, [[instant messenger|instant messaging]] and blackboards, and provide access to other [[collaborative software|groupware]] applications.
[10641360] |There is also an effort to create and share assorted document templates and other useful additions at OOExtras.
[10641370] |A set of [[Perl]] extensions is available through the [[CPAN]] in order to allow OpenOffice.org document processing by external programs.
[10641380] |These libraries do not use the OpenOffice.org API.
[10641390] |They directly read or write the OpenOffice.org files using Perl standard file [[codec|compression/decompression]], XML access and [[UTF-8]] encoding modules.
[10641400] |==== Portable ====
[10641410] |A distribution of OpenOffice.org called OpenOffice.org Portable is designed to run the suite from a [[USB flash drive]].
[10641420] |==== OxygenOffice Professional ====
[10641430] |An enhancement of OpenOffice.org, providing: Current Version: 2.4
[10641440] |* Possibility to run Visual Basic for Application (VBA) macros in Calc (for testing)
[10641450] |* Improved Calc HTML export
[10641460] |* Enhanced Access support for Base
[10641470] |* Security fixes
[10641480] |* Enhanced performance
[10641490] |* Enhanced color-palette
[10641500] |* Enhanced help menu, additional User’s Manual, and extended tips for beginners
[10641510] |Optionally it provides, free for personal and professional use:
[10641520] |* More than 3,200 graphics, both clip art and photos.
[10641530] |* Several templates and sample documents
[10641540] |* Over 90 free fonts.
[10641550] |* Additional tools like OOoWikipedia
[10641560] |====Extensions====
[10641570] |Since version 2.0.4, OpenOffice.org has supported extensions in a similar manner to [[Mozilla Firefox]].
[10641580] |Extensions make it easy to add new functionality to an existing OpenOffice.org installation.
[10641590] |The [http://extensions.services.openoffice.org/most_pop_ext OpenOffice.org Extension Repository] lists already more than 80 extensions.
[10641600] |Developers can easily build new extensions for OpenOffice.org, for example by using the [http://wiki.services.openoffice.org/wiki/OpenOffice_NetBeans_Integration OpenOffice.org API Plugin for NetBeans].
[10641610] |==== The OpenOffice.org Bibliographic Project ====
[10641620] |This aims to incorporate a powerful [[reference management software]] into the suite.
[10641630] |The new major addition is slated for inclusion with the standard OpenOffice.org release on late-2007 to mid-2008, or possibly later depending upon the availability of programmers.
[10641640] |=== Security ===
[10641650] |OpenOffice.org includes a security team, and as of June 2008 the security organization [[Secunia]] reports no known unpatched security flaws for the software.
[10641660] |[[Kaspersky Lab]] has shown a [[proof of concept]] virus for OpenOffice.org.
[10641670] |This shows OOo viruses are possible, but there is no known virus "in the wild".
[10641680] |In a private meeting of the French Ministry of Defense, macro-related security issues were raised.
[10641690] |OpenOffice.org developers have responded and noted that the supposed vulnerability had not been announced through "well defined procedures" for disclosure and that the ministry had revealed nothing specific.
[10641700] |However, the developers have been in talks with the researcher concerning the supposed vulnerability.
[10641710] |As with Microsoft Word, documents created in OpenOffice can contain [[metadata]] which may include a complete history of what was changed, when and by whom.
[10641720] |== Ownership ==
[10641730] |The project and software are informally referred to as ''OpenOffice'', but project organizers report that this term is a [[trademark]] held by another party, requiring them to adopt ''OpenOffice.org'' as its formal name.
[10641740] |(Due to a similar trademark issue, the [[Brazilian Portuguese]] version of the suite is distributed under the name ''BrOffice.org''.)
[10641750] |Development is managed by staff members of StarOffice.
[10641760] |Some delay and difficulty in implementing external contributions to the core codebase (even those from the project's corporate sponsors) has been noted.
[10641770] |Currently, there are [http://wiki.services.openoffice.org/wiki/DerivedWorks several derived and/or proprietary works based on OOo], with some of them being:
[10641780] |* Sun Microsystem's [[StarOffice]], with various complementary add-ons.
[10641790] |* IBM's [[Lotus Symphony]], with a new interface based on [[Eclipse (software)|Eclipse]] (based on OO.o 1.x).
[10641800] |* OpenOffice.org Novell edition, integrated with [[Novell Evolution|Evolution]] and with a [[OOXML]] filter.
[10641810] |* Beijing [[Redflag]] Chinese 2000's [[RedOffice]], fully localized in Chinese characters.
[10641820] |* Planamesa's [[NeoOffice]] for [[Mac OS X]] with Aqua support via Java.
[10641830] |In [[May 23]], [[2007]], the OpenOffice.org community and Redflag Chinese 2000 Software Co, Ltd. announced a joint development effort focused on integrating the new features that have been added in the RedOffice localization of OpenOffice.org, as well as quality assurance and work on the core applications.
[10641840] |Additionally, Redflag Chinese 2000 made public its commitment to the global OO.o community stating it would "strengthen its support of the development of the world's leading free and open source productivity suite", adding around 50 engineers (that have been working on RedOffice since 2006) to the project.
[10641850] |In [[September 10]], [[2007]], the OO.o community announced that [[IBM]] had joined to support the development of OpenOffice.org.
[10641860] |"IBM will be making initial code contributions that it has been developing as part of its Lotus Notes product, including accessibility enhancements, and will be making ongoing contributions to the feature richness and code quality of OpenOffice.org.
[10641870] |Besides working with the community on the free productivity suite's software, IBM will also leverage OpenOffice.org technology in its products" as has been seen with [[Lotus Symphony]].
[10641880] |Sean Poulley, the vice president of business and strategy in IBM's [[Lotus Software]] division said that IBM plans to take a leadership role in the OpenOffice.org community together with other companies such as Sun Microsystems.
[10641890] |IBM will work within the leadership structure that exists.
[10641900] |As of [[October 02]], [[2007]], [[Michael Meeks]] announced (and generated an answer by Sun's [[Simon Phipps]] and Mathias Bauer) a derived OpenOffice.org work, under the wing of his employer [[Novell]], with the purpose of including new features and fixes that do not get easily integrated in the OOo-build up-stream core.
[10641910] |The work is called Go-OO (http://go-oo.org/) a name under which alternative OO.o software has been available for five years.
[10641920] |The new features are shared with Novell's edition of OOo and include:
[10641930] |* [[Visual Basic for Applications|VBA]] macros support.
[10641940] |* Faster start up time.
[10641950] |* "A [[Linear programming|linear optimization]] solver to optimize a cell value based on arbitrary constraints built into Calc".
[10641960] |* Multimedia content supports into documents, using the [[gstreamer]] multimedia framework.
[10641970] |* Support for [[Microsoft Works]] formats, [[WordPerfect]] graphics (WPG format) and T602 files imports.
[10641980] |[http://wiki.services.openoffice.org/wiki/Contributing_Patches Details about the patch handling including metrics] can be found on the OpenOffice.org site.
[10641990] |== Reactions ==
[10642000] |Federal Computer Week issue listed OpenOffice.org as one of the "5 stars of open-source products."
[10642010] |In contrast, OpenOffice.org was used in [[2005]] by ''[[The Guardian]]'' newspaper to illustrate what it claims are the limitations of open-source software, although the article does finish by stating that the software may be better than MS Word for books.
[10642020] |=== Market share ===
[10642030] |It is extremely difficult to estimate the market share of OpenOffice.org due to the fact that OpenOffice.org can be freely distributed via download sites including mirrors, peer-to-peer networks, CDs, Linux distros, etc.
[10642040] |Nevertheless, the OpenOffice.org tries to capture key adoption data in a market share analysis
[10642050] |Although Microsoft Office retains 95% of the general market as measured by revenue, OpenOffice.org and StarOffice have secured 14% of the large enterprise market as of 2004 and 19% of the small to midsize business market in 2005.
[10642060] |The OpenOffice.org web site reports more than 98 million downloads.
[10642070] |Other large scale users of OpenOffice.org include [[Ministry of Defence (Singapore)|Singapore’s Ministry of Defence]], and [[Bristol]] City Council in the UK.
[10642080] |In [[France]], OpenOffice.org has attracted the attention of both local and national government administrations who wish to rationalize their software procurement, as well as have stable, standard file formats for archival purposes.
[10642090] |It is now the official office suite for the [[French Gendarmerie]].
[10642100] |Several government organizations in India, such as [[IIT Bombay]] (a renowned technical institute), the [[Supreme Court of India]], the [[Allahabad High Court]], which use Linux, completely rely on OpenOffice.org for their administration.
[10642110] |On [[October 4]], [[2005]], Sun and [[Google]] announced a strategic partnership.
[10642120] |As part of this agreement, Sun will add a Google search bar to OpenOffice.org, Sun and Google will engage in joint marketing activities as well as joint research and development, and Google will help distribute OpenOffice.org.
[10642130] |Google is currently distributing StarOffice as part of the [[Google Pack]].
[10642140] |Besides StarOffice, there are still a number of OpenOffice.org derived commercial products.
[10642150] |Most of them are developed under [[SISSL]] license (which is valid up to OpenOffice.org 2.0 Beta 2).
[10642160] |In general they are targeted at local or niche market, with proprietary add-ons such as speech recognition module, automatic database connection, or better [[CJK]] support.
[10642170] |In July 2007 Everex, a division of First International Computer and the 9th largest PC supplier in the U.S., began shipping systems preloaded with OpenOffice.org 2.2 into Wal-Mart and Sam's Club throughout North America.
[10642180] |In September 2007 IBM announced that it would supply and support OpenOffice.org branded as [[Lotus Symphony]], and integrated into Lotus Notes.
[10642190] |IBM also announced 35 developers would be assigned to work on OpenOffice.org, and that it would join the OpenOffice.org foundation.
[10642200] |Commentators noted parallels between IBM's 2000 support of Linux and this announcement.
[10642210] |=== Java controversy ===
[10642220] |In the past OpenOffice.org was criticized for an increasing dependency on the [[Java Runtime Environment]] which was not [[free software]].
[10642230] |That Sun Microsystems is both the creator of Java and the chief supporter of OpenOffice.org drew accusations of ulterior motives for this technology choice.
[10642240] |Version 1 depended on the [[Java Runtime Environment]] (JRE) being present on the user’s computer for some auxiliary functions, but version 2 increased the suite’s use of Java requiring a JRE.
[10642250] |In response, [[Red Hat]] increased their efforts to improve [[free Java implementations]].
[10642260] |Red Hat’s [[Fedora (Linux distribution)|Fedora Core]] 4 (released on [[June 13]], [[2005]]) included a beta version of OpenOffice.org version 2, running on [[GNU Compiler for Java|GCJ]] and [[GNU Classpath]].
[10642270] |The issue of OpenOffice.org’s use of Java came to the fore in May 2005, when [[Richard Stallman]] appeared to call for a [[fork (software)|fork]] of the application in a posting on the [[Free Software Foundation]] website.
[10642280] |This led to discussions within the OpenOffice.org community and between Sun staff and developers involved in [[GNU Classpath]], a free replacement for Sun’s Java implementation.
[10642290] |Later that year, the OpenOffice.org developers also placed into their development guidelines various requirements to ensure that future versions of OpenOffice.org could be run on free implementations of Java and fixed the issues which previously prevented OpenOffice.org 2.0 from using free software Java implementations.
[10642300] |On [[November 13]], [[2006]], Sun committed to releasing Java under the [[GNU General Public License]] in the near future.
[10642310] |This process would end OpenOffice.org's dependence on [[non-free]] software.
[10642320] |Between November 2006 and May 2007, Sun Microsystems made available most of their Java technologies under the GNU General Public License, in compliance with the specifications of the Java Community Process, thus making almost all of Sun's Java also free software.
[10642330] |The following areas of OpenOffice.org 2.0 depend on the JRE being present:
[10642340] |* The [[media player (application software)|media player]] on Unix-like systems
[10642350] |* All document wizards in Writer
[10642360] |* Accessibility tools
[10642370] |* Report Autopilot
[10642380] |* [[JDBC]] driver support
[10642390] |* [[Hsqldb|HSQL]] database engine, which is used in OpenOffice.org Base
[10642400] |* [[XSLT]] filters
[10642410] |* [[BeanShell]], the [[NetBeans]] scripting language and the Java UNO bridge
[10642420] |* Export filters to the Aportis.doc (.pdb) format for the [[Palm OS]] or [[Pocket Word]] (.psw) format for the [[Pocket PC]]
[10642430] |* Export filter to [[LaTeX]]
[10642440] |* Export filter to [[MediaWiki]]'s [[wikitext]]
[10642450] |A common point of confusion is that [[mail merge]] to generate emails requires the Java API JavaMail in [[StarOffice]]; however, as of version 2.0.1, OpenOffice.org uses a [[Python (programming language)|Python]]-component instead.
[10642460] |=== Complementary software ===
[10642470] |OpenOffice.org provides replacement for MS Office's [[Microsoft Word]], [[Microsoft Excel]], [[Microsoft PowerPoint]], [[Microsoft Access]], [[Equation Editor|Microsoft Equation Editor]] and [[Microsoft Visio]].
[10642480] |But to level the equivalent functionality from the rest of MS Office, OOo can be complemented with other open source programs such as:
[10642490] |* [[Novell Evolution|Evolution]] or [[Mozilla Thunderbird|Thunderbird]]/[[Lightning (software)|Lightning]] for a PIM like [[Microsoft Outlook]].
[10642500] |* [[OpenProj]] (which seeks integration with OOo, but might be limited due to licensing issues) for [[Microsoft Project]].
[10642510] |* [[Scribus]] for [[Microsoft Publisher]]
[10642520] |* [[O3spaces]] for [[Sharepoint]]
[10642530] |Microsoft also provides Administrative Template Files ("adm files") that allow MS Office to be configured using Windows Group Policy.
[10642540] |Equivalent functionality for OpenOffice.org is provided by [http://openoffice-enterprise.com/ OpenOffice-Enterprise], a commercial product from Open Office Technology, Inc.
[10642550] |=== Issues ===
[10642560] |OpenOffice.org has been criticized for slow start times and extensive CPU and RAM usage in comparison to other competitive software such as Microsoft Office.
[10642570] |In comparison, tests between OpenOffice.org 2.2 and Microsoft Office 2007 have found that OpenOffice.org takes approximately 2 times the processing time and memory to load itself along with a blank file; and took approximately 4.7 times the processing time and 3.9 times the memory to open an extremely large spreadsheet file.
[10642580] |Critics have pointed to excessive code bloat and OpenOffice.org's loading of the [[Java Virtual Machine|Java Runtime Environment]] as possible reasons for the slow speeds and excessive memory usage.
[10642590] |However, since OpenOffice.org 2.2 the performance of OpenOffice.org has been improved dramatically.
[10642600] |One of the greatest challenges is its ability to be truly cross compatible with other applications.
[10642610] |Since Openoffice.org is forced to reverse engineer proprietary binary formats due to unavailability of open specifications, slight formatting incompatibilities tend to exist when files are saved in non-native format.
[10642620] |For example, a complex .doc document formatted under OpenOffice.org, is usually not displayed with the correct format when opened with Microsoft Office.
[10642630] |== Retail ==
[10642640] |The [[free software license]] under which OpenOffice.org is distributed allows unlimited use of the software for both home and business use, including unlimited redistribution of the software.
[10642650] |Several businesses sell the OpenOffice.org suite on auction websites such as [[eBay]], offering value-added services such as 24/7 technical support, download mirrors, and CD mailing.
[10642660] |However, often the 24/7 support offered is not provided by the company selling the software, but rather by the official OpenOffice.org mailing list.
[10650010] |Parsing
[10650020] |In [[computer science]] and [[linguistics]], '''parsing''', or, more formally, '''syntactic analysis''', is the process of analyzing a sequence of [[Token (parser)|tokens]] to determine grammatical structure with respect to a given (more or less) [[formal grammar]].
[10650030] |A '''parser''' is thus one of the components in an [[interpreter]] or [[compiler]], where it captures the implied hierarchy of the input text and transforms it into a form suitable for further processing (often some kind of [[parse tree]], [[abstract syntax tree]] or other hierarchical structure) and normally checks for syntax errors at the same time.
[10650040] |The parser often uses a separate [[lexical analyser]] to create tokens from the sequence of input characters.
[10650050] |Parsers may be programmed by hand or may be semi-automatically generated (in some programming language) by a tool (such as [[Yet Another Compiler Compiler|Yacc]]) from a grammar written in [[Backus-Naur form]].
[10650060] |Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of [[Inflection|inflected]] languages, such as the [[Romance languages|Romance languages]] or [[Latin]].
[10650070] |Parsers can also be constructed as executable specifications of grammars in functional programming languages.
[10650080] |Frost, Hafiz and Callaghan have built on the work of others to construct a set of [[higher-order function]]s (called [[parser combinators]]) which allow polynomial time and space complexity top-down parser to be constructed as executable specifications of ambiguous grammars containing left-recursive productions.
[10650090] |The [http://www.cs.uwindsor.ca/~hafiz/proHome.html X-SAIGA] site has more about the algorithms and implementation details.
[10650100] |== Human languages ==
[10650110] |:''Also see [[:Category:Natural language parsing]]''
[10650120] |In some [[machine translation]] and [[natural language processing]] systems, human languages are parsed by computer programs.
[10650130] |Human sentences are not easily parsed by programs, as there is substantial [[syntactic ambiguity|ambiguity]] in the structure of human language.
[10650140] |In order to parse natural language data, researchers must first agree on the [[grammar]] to be used.
[10650150] |The choice of syntax is affected by both [[linguistic]] and computational concerns; for instance some parsing systems use [[lexical functional grammar]], but in general, parsing for grammars of this type is known to be [[NP-complete]].
[10650160] |[[Head-driven phrase structure grammar]] is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn [[Treebank]].
[10650170] |[[Shallow parsing]] aims to find only the boundaries of major constituents such as noun phrases.
[10650180] |Another popular strategy for avoiding linguistic controversy is [[dependency grammar]] parsing.
[10650190] |Most modern parsers are at least partly [[statistics|statistical]]; that is, they rely on a corpus of training data which has already been annotated (parsed by hand).
[10650200] |This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts.
[10650210] |''(See [[machine learning]].)''
[10650220] |Approaches which have been used include straightforward [[PCFG]]s (probabilistic context free grammars), [[maximum entropy]], and [[neural net]]s.
[10650230] |Most of the more successful systems use ''lexical'' statistics (that is, they consider the identities of the words involved, as well as their [[part of speech]]).
[10650240] |However such systems are vulnerable to [[overfitting]] and require some kind of smoothing to be effective.
[10650250] |Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages.
[10650260] |As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not [[context-free]], some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the [[CKY algorithm]], usually with some [[heuristic (computer science)|heuristic]] to prune away unlikely analyses to save time.
[10650270] |''(See [[chart parsing]].)''
[10650280] |However some systems trade speed for accuracy using, eg, linear-time versions of the [[Shift-reduce parsing|shift-reduce]] algorithm.
[10650290] |A somewhat recent development has been [[parse reranking]] in which the parser proposes some large number of analyses, and a more complex system selects the best option.
[10650300] |It is normally branching of one part and its subparts
[10650310] |== Programming languages ==
[10650320] |The most common use of a parser is as a component of a [[compiler]] or [[interpreter]].
[10650330] |This parses the [[source code]] of a [[computer programming language]] to create some form of internal representation.
[10650340] |Programming languages tend to be specified in terms of a [[context-free grammar]] because fast and efficient parsers can be written for them.
[10650350] |Parsers are written by hand or generated by [[parser generator]]s.
[10650360] |Context-free grammars are limited in the extent to which they can express all of the requirements of a language.
[10650370] |Informally, the reason is that the memory of such a language is limited.
[10650380] |The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced.
[10650390] |More powerful grammars that can express this constraint, however, cannot be parsed efficiently.
[10650400] |Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.
[10650410] |===Overview of process===
[10650420] |[[image:Parser_Flow.gif|right|Flow of data in a typical parser]] The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
[10650430] |The first stage is the token generation, or [[lexical analysis]], by which the input character stream is split into meaningful symbols defined by a grammar of [[regular expression]]s.
[10650440] |For example, a calculator program would look at an input such as "12*(3+4)^2
" and split it into the tokens 12
, *
, (
, 3
, +
, 4
, )
, ^
, and 2
, each of which is a meaningful symbol in the context of an arithmetic expression.
[10650450] |The parser would contain rules to tell it that the characters *
, +
, ^
, (
and )
mark the start of a new token, so meaningless tokens like "12*
" or "(3
" will not be generated.
[10650460] |The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression.
[10650470] |This is usually done with reference to a [[context-free grammar]] which recursively defines components that can make up an expression and the order in which they must appear.
[10650480] |However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers.
[10650490] |These rules can be formally expressed with [[attribute grammar]]s.
[10650500] |The final phase is [[Semantic analysis (computer science)|semantic parsing]] or analysis, which is working out the implications of the expression just validated and taking the appropriate action.
[10650510] |In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code.
[10650520] |Attribute grammars can also be used to define these actions.
[10650530] |==Types of parsers==
[10650540] |The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar.
[10650550] |This can be done in essentially two ways:
[10650560] |*[[Top-down parsing]] - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for [[parse tree|parse-trees]] using a top-down expansion of the given [[formal grammar]] rules.
[10650570] |Tokens are consumed from left to right.
[10650580] |Inclusive choice is used to accommodate [[ambiguity]] by expanding all alternative right-hand-sides of grammar rules .
[10650590] |[[LL parser]]s and [[recursive-descent parser]] are examples of top-down parsers, which cannot accommodate [[left recursion | left recursive]] productions.
[10650600] |Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous [[context-free grammar]]s, more sophisticated algorithm for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodates [[ambiguity]] and [[left recursion]] in polynomial time and which generates polynomial-size representations of the potentially-exponential number of parse trees.
[10650610] |Their algorithm is able to produce both left-most and right-most derivations of an input w.r.t. a given CFG.
[10650620] |*[[Bottom-up parsing]] - A parser can start with the input and attempt to rewrite it to the start symbol.
[10650630] |Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on.
[10650640] |[[LR parser]]s are examples of bottom-up parsers.
[10650650] |Another term used for this type of parser is Shift-Reduce parsing.
[10650660] |Another important distinction is whether the parser generates a ''leftmost derivation'' or a ''rightmost derivation'' (see [[context-free grammar]]).
[10650670] |LL parsers will generate a leftmost [[derivation]] and LR parsers will generate a rightmost derivation (although usually in reverse) .
[10650680] |== Examples of parsers ==
[10650690] |=== Top-down parsers ===
[10650700] |Some of the parsers that use [[top-down parsing]] include:
[10650710] |* [[Recursive descent parser]]
[10650720] |* [[LL parser]] ('''L'''eft-to-right, '''L'''eftmost derivation)
[10650730] |* [http://www.cs.uwindsor.ca/~hafiz/proHome.html X-SAIGA] - eXecutable SpecificAtIons of GrAmmars.
[10650740] |Contains publications related to top-down parsing algorithm that supports left-recursion and ambiguity in polynomial time and space.
[10650750] |=== Bottom-up parsers ===
[10650760] |Some of the parsers that use [[bottom-up parsing]] include:
[10650770] |* Precedence parser
[10650780] |** [[Operator-precedence parser]]
[10650790] |** [[Simple precedence parser]]
[10650800] |* BC (bounded context) parsing
[10650810] |* [[LR parser]] ('''L'''eft-to-right, '''R'''ightmost derivation)
[10650820] |** [[SLR parser|Simple LR (SLR) parser]]
[10650830] |** [[LALR parser]]
[10650840] |** [[Canonical LR parser|Canonical LR (LR(1)) parser]]
[10650850] |** [[GLR parser]]
[10650860] |* [[CYK algorithm|CYK parser]]
[10660010] |Lexical category
[10660020] |In [[grammar]], a '''lexical category''' (also '''word class''', '''lexical class''', or in traditional grammar '''part of speech''') is a linguistic category of words (or more precisely ''lexical items''), which is generally defined by the [[syntactic]] or [[morphology (linguistics)|morphological]] behaviour of the lexical item in question.
[10660030] |Common linguistic categories include ''noun'' and ''verb'', among others.
[10660040] |There are [[open class word|open word classes]], which constantly acquire new members, and [[closed class word|closed word classes]], which acquire new members infrequently if at all.
[10660050] |Different languages may have different lexical categories, or they might associate different properties to the same one.
[10660060] |For example, [[Japanese language|Japanese]] has at least three classes of adjectives where English has one; Chinese and Japanese have [[measure word]]s while European languages have nothing resembling them; many languages don't have a distinction between adjectives and adverbs, or adjectives and nouns, etc.
[10660070] |Many linguists argue that the formal distinctions between parts of speech must be made within the framework of a specific language or language family, and should not be carried over to other languages or language families.
[10660080] |==History==
[10660090] |The classification of words into lexical categories is found from the earliest moments in the [[history of linguistics]].
[10660100] |In the ''[[Nirukta]]'', written in the [[5th century BCE|5th]] or [[6th century BCE]], the [[Sanskrit grammarian]] [[Yāska]] defined four main categories of words :
[10660110] |# nāma - [[noun]]s or substantives
[10660120] |# ākhyāta - [[verb]]s
[10660130] |# upasarga - pre-verbs or [[prefix]]es
[10660140] |# nipāta - [[Grammatical particle|particle]]s, invariant words (perhaps [[prepositions]])
[10660150] |These four were grouped into two large classes: [[inflection|inflected]] (nouns and verbs) and uninflected (pre-verbs and particles).
[10660160] |A century or two later, the [[Classical Greece|Greek]] scholar [[Plato]] wrote in the [[Cratylus (dialogue)|''Cratylus'' dialog]] that "... sentences are, I conceive, a combination of verbs [''rhēma''] and nouns [''ónoma'']".
[10660170] |Another class, "conjunctions" (covering [[Grammatical conjunction|conjunction]]s, [[pronoun]]s, and the [[article (grammar)|article]]), was later added by [[Aristotle]].
[10660180] |By the end of the [[2nd century BCE]], the classification scheme had been expanded into eight categories, seen in the ''[[Art of Grammar|Tékhnē grammatiké]]'':
[10660190] |# Noun: a part of speech inflected for case, signifying a concrete or abstract entity
[10660200] |# Verb: a part of speech without case inflection, but inflected for tense, person and number, signifying an activity or process performed or undergone
[10660210] |# Participle: a part of speech sharing the features of the verb and the noun
[10660220] |# Article: a part of speech inflected for case and preposed or postposed to nouns (the relative pronoun is meant by the postposed article)
[10660230] |# Pronoun: a part of speech substitutable for a noun and marked for person
[10660240] |# Preposition: a part of speech placed before other words in composition and in syntax
[10660250] |# Adverb: a part of speech without inflection, in modification of or in addition to a verb
[10660260] |# Conjunction: a part of speech binding together the discourse and filling gaps in its interpretation
[10660270] |The [[Latin grammar]]ian [[Priscian]] ([[floruit|fl.]] [[500 CE]]) modified the above eight-fold system, substituting "[[interjection]]" for "article".
[10660280] |It wasn't until 1767 that the [[adjective]] was taken as a separate class.
[10660290] |Traditional English grammar is patterned after the European tradition above, and is still taught in schools and used in [[dictionaries]].
[10660300] |It names eight parts of speech: [[noun]], [[verb]], [[adjective]], [[adverb]], [[pronoun]], [[preposition]], [[Grammatical conjunction|conjunction]], and [[interjection]] (sometimes called an exclamation).
[10660310] |==Controversies==
[10660320] |Since the Greek grammarians of 2nd century BCE, parts of speech have been defined by [[morphology (linguistics)|morphological]], [[syntax|syntactic]] and [[semantics|semantic]] criteria.
[10660330] |However, there is currently no generally agreed-upon classification scheme that can apply to all languages, or even a set of criteria upon which such a scheme should be based.
[10660340] |Linguists recognize that the above list of eight word classes is simplified and artificial.
[10660350] |For example, "adverb" is to some extent a catch-all class that includes words with many different functions.
[10660360] |Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.
[10660370] |==Functional classification==
[10660380] |Common ways of delimiting words by function include:
[10660390] |* '''[[Open word classes]]:'''
[10660400] |**[[adjective]]s
[10660410] |**[[adverb]]s
[10660420] |**[[interjection]]s
[10660430] |**[[noun]]s
[10660440] |**[[verb]]s (except [[auxiliary verb]]s)
[10660450] |* '''[[Closed word classes]]:'''
[10660460] |**[[auxiliary verb]]s
[10660470] |**[[clitic]]s
[10660480] |**[[coverb]]s
[10660490] |**[[Grammatical conjunction|conjunction]]s
[10660500] |**[[determiner (class)|Determiner]]s ([[article (grammar)|article]]s, [[quantifier]]s, [[demonstrative adjective]]s, and [[possessive adjective]]s)
[10660510] |**[[grammatical particle|particle]]s
[10660520] |**[[measure word]]s
[10660530] |**[[adposition]]s (prepositions, postpositions, and circumpositions)
[10660540] |**[[preverb]]s
[10660550] |**[[pronoun]]s
[10660560] |**[[Contraction (grammar)|contraction]]s
[10660570] |**[[Names of numbers in English#Cardinal numbers|cardinal numbers]]
[10660580] |==English==
[10660590] |[[English language|English]] frequently does not [[marker (linguistics)|mark]] words as belonging to one part of speech or another.
[10660600] |Words like ''neigh'', ''break'', ''outlaw'', ''laser'', ''microwave'' and ''telephone'' might all be either verb forms or nouns.
[10660610] |Although ''-ly'' is an adverb marker, not all adverbs end in ''-ly'' and not all words ending in ''-ly'' are adverbs.
[10660620] |For instance, ''tomorrow'', ''slow'', ''fast'', ''crosswise'' can all be adverbs, while ''early'', ''friendly'', ''ugly'' are all adjectives (though ''early'' can also function as an adverb).
[10660630] |In certain circumstances, even words with primarily grammatical functions can be used as verbs or nouns, as in "We must look to the ''hows'' and not just the ''whys''" or "Miranda was ''to-ing and fro-ing'' and not paying attention".
[10670010] |Part-of-speech tagging
[10670020] |'''Part-of-speech tagging''' ('''POS tagging''' or '''POST'''), also called '''grammatical tagging''', is the process of marking up the words in a text as corresponding to a particular [[parts of speech|part of speech]], based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a [[phrase]], [[sentence]], or [[paragraph]].
[10670030] |A simplified form of this is commonly taught school-age children, in the identification of words as [[noun]]s, [[verb]]s, [[adjective]]s, [[adverb]]s, etc.
[10670040] |Once performed by hand, POS tagging is now done in the context of [[computational linguistics]], using [[algorithms]] which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.
[10670050] |==History==
[10670060] |Research on part-of-speech tagging has been closely tied to [[corpus linguistics]].
[10670070] |The first major corpus of English for computer analysis was the [[Brown Corpus]] developed at [[Brown University]] by [[Henry Kucera]] and [[Nelson Francis]], in the mid-1960s.
[10670080] |It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications.
[10670090] |Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).
[10670100] |The [[Brown Corpus]] was painstakingly "tagged" with part-of-speech markers over many years.
[10670110] |A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all.
[10670120] |For example, article then noun can occur, but article verb (arguably) cannot.
[10670130] |The program got about 70% correct.
[10670140] |Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases even human speakers might not agree on).
[10670150] |This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages.
[10670160] |Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS and [[VOLSUNGA]].
[10670170] |However, by this time (2005) it has been superseded by larger corpora such as the 100 million word [[British National Corpus]].
[10670180] |For some time, part-of-speech tagging was considered an inseparable part of [[natural language processing]], because there are certain cases where the correct part of speech cannot be decided without understanding the [[semantics]] or even the [[pragmatics]] of the context.
[10670190] |This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
[10670200] |In the mid 1980s, researchers in Europe began to use [[hidden Markov model]]s (HMMs) to disambiguate parts of speech, when working to tag the [[Lancaster-Oslo-Bergen Corpus]] of British English.
[10670210] |HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences.
[10670220] |For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
[10670230] |Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal.
[10670240] |The same method can of course be used to benefit from knowledge about following words.
[10670250] |More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences.
[10670260] |So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but even less likely another verb.
[10670270] |When several ambiguous words occur together, the possibilities multiply.
[10670280] |However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn.
[10670290] |The combination with highest probability is then chosen.
[10670300] |The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.
[10670310] |It is worth remembering, as [[Eugene Charniak]] points out in ''Statistical techniques for natural language parsing'' [http://www.cs.brown.edu/people/ec/home.html], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns, will approach 90% accuracy because many words are unambiguous.
[10670320] |CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities.
[10670330] |It sometimes had to resort to backup methods when there were simply too many (the [[Brown Corpus]] contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech).
[10670340] |In 1987, [[Steve DeRose]] and [[Ken Church]] independently developed [[dynamic programming]] algorithms to solve the same problem in vastly less time.
[10670350] |Their methods were similar to the [[Viterbi algorithm]] known for some time in other fields.
[10670360] |DeRose used a table of pairs, while Church used a table of triples and an ingenious method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus).
[10670370] |Both methods achieved accuracy over 95%.
[10670380] |DeRose's 1990 dissertation at [[Brown University]] included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.
[10670390] |These findings were surprisingly disruptive to the field of [[Natural Language Processing]].
[10670400] |The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on.
[10670410] |CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare.
[10670420] |This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well.
[10670430] |Markov Models are now the standard method for part-of-speech assignment.
[10670440] |The methods already discussed involve working from a pre-existing corpus to learn tag probabilities.
[10670450] |It is, however, also possible to [[Bootstrapping (linguistics)|bootstrap]] using "unsupervised" tagging.
[10670460] |Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction.
[10670470] |That is, they observe patterns in word use, and derive part-of-speech categories themselves.
[10670480] |For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones.
[10670490] |With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.
[10670500] |These two categories can be further subdivided into rule-based, stochastic, and neural approaches.
[10670510] |Some current major algorithms for '''part-of-speech tagging''' include the [[Viterbi algorithm]], [[Brill Tagger]], and the [[Baum-Welch algorithm]] (also known as the forward-backward algorithm).
[10670520] |[[Hidden Markov model]] and [[visible Markov model]] taggers can both be implemented using the [[Viterbi algorithm]].
[10680010] |Pattern recognition
[10680020] |'''Pattern recognition''' is a sub-topic of [[machine learning]].
[10680030] |It can be defined as
[10680040] |:"the act of taking in raw data and taking an action based on the [[Category (taxonomy)|category]] of the data".
[10680050] |Most research in pattern recognition is about methods for [[supervised learning]] and [[unsupervised learning]].
[10680060] |Pattern recognition aims to classify [[data]] ([[pattern]]s) based on either ''[[A priori and a posteriori (philosophy)|a priori]]'' knowledge or on [[statistics|statistical]] information extracted from the patterns.
[10680070] |The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate [[space (mathematics)|multidimensional space]].
[10680080] |This is in contrast to '''[[pattern matching]]''', where the pattern is rigidly specified.
[10680090] |==Overview==
[10680100] |A complete pattern recognition system consists of a [[sensor]] that gathers the observations to be classified or described; a [[feature extraction]] mechanism that computes numeric or symbolic information from the observations; and a [[statistical classification|classification]] or description scheme that does the actual job of classifying or describing observations, relying on the extracted features.
[10680110] |The classification or description scheme is usually based on the availability of a set of patterns that have already been classified or described.
[10680120] |This set of patterns is termed the [[training set]] and the resulting learning strategy is characterized as [[supervised learning]].
[10680130] |Learning can also be [[unsupervised learning|unsupervised]], in the sense that the system is not given an ''a priori'' labeling of patterns, instead it establishes the classes itself based on the statistical regularities of the patterns.
[10680140] |The classification or description scheme usually uses one of the following approaches: [[statistical classification|statistical]] (or decision theoretic), [[syntactic pattern recognition|syntactic]] (or structural).
[10680150] |Statistical pattern recognition is based on statistical characterisations of patterns, assuming that the patterns are generated by a [[probabilistic]] system.
[10680160] |Syntactical (or structural) pattern recognition is based on the structural interrelationships of features.
[10680170] |A wide range of algorithms can be applied for pattern recognition, from very simple [[Naive Bayes classifier|Bayesian classifiers]] to much more powerful [[Artificial neural network|neural networks]].
[10680180] |An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).
[10680190] |Pattern recognition is more complex when templates are used to generate variants.
[10680200] |For example, in English, sentences often follow the "N-VP" (noun - verb phrase) pattern, but some knowledge of the English language is required to detect the pattern.
[10680210] |Pattern recognition is studied in many fields, including [[psychology]], [[ethology]], and [[computer science]].
[10680220] |[[Holographic associative memory]] is another type of pattern matching scheme where a target small patterns can be searched from a large set of learned patterns based on cognitive meta-weight.
[10680230] |==Uses==
[10680240] |Within medical science pattern recognition creates the basis for [[computer-aided diagnosis]] (CAD) systems.
[10680250] |CAD describes a procedure that supports the doctor's interpretations and findings.
[10680260] |Typical applications are automatic [[speech recognition]], [[document classification|classification of text into several categories]] (e.g. spam/non-spam email messages), the [[handwriting recognition|automatic recognition of handwritten postal codes]] on postal envelopes, or the [[facial recognition system|automatic recognition of images]] of human faces.
[10680270] |The last two examples form the subtopic [[image analysis]] of pattern recognition that deals with digital images as input to pattern recognition systems.
[10690010] |Phrase
[10690020] |In [[grammar]], a '''phrase''' is a group of [[word]]s that functions as a single unit in the [[syntax]] of a [[Sentence (linguistics)|sentence]].
[10690030] |For example ''the house at the end of the street'' (example 1) is a phrase.
[10690040] |It acts like a noun.
[10690050] |It contains the phrase ''at the end of the street'' (example 2), a prepositional phrase which acts like an adjective.
[10690060] |Example 2 could be replaced by ''white'', to make the phrase ''the white house''.
[10690070] |Examples 1 and 2 contain the phrase ''the end of the street'' (example 3) which acts like a noun.
[10690080] |It could be replaced by ''the cross-roads'' to give ''the house at the cross-roads''.
[10690090] |Most phrases have a or central word which defines the type of phrase.
[10690100] |This word is called the [[head (linguistics)|head]] of the phrase.
[10690110] |In English the head is often the first word of the phrase.
[10690120] |Some phrases, however, can be headless.
[10690130] |For example, ''the rich'' is a noun phrase composed of a determiner and an adjective, but no noun.
[10690140] |Phrases may be classified by the type of head they take
[10690150] |*[[Prepositional phrase]] (PP) with a [[preposition]] as head (e.g. ''in love'', ''over the rainbow'').
[10690160] |Languages that use [[postposition]]s instead have [[postpositional phrase]]s.
[10690170] |The two types are sometimes commonly referred to as [[adpositional phrase]]s.
[10690180] |*[[Noun phrase]] (NP) with a [[noun]] as head (e.g. ''the black cat'', ''a cat on the mat'')
[10690190] |*[[Verb phrase]] (VP) with a [[verb]] as head (e.g. ''eat cheese'', ''jump up and down'')
[10690200] |*[[Adjectival phrase]] with an [[adjective]] as head (e.g. ''full of toys'')
[10690210] |*[[Adverbial phrase]] with [[adverb]] as head (e.g. ''very carefully'')
[10690220] |== Formal definition ==
[10690230] |A '''phrase''' is a [[syntax|syntactic]] structure which has syntactic properties derived from its [[head (linguistics)|head]].
[10690240] |== Complexity ==
[10690250] |A complex phrase consists of several words, whereas a simple phrase consists of only one word.
[10690260] |This terminology is especially often used with [[verb]] phrases:
[10690270] |* simple past and present are simple verb, which require just one verb
[10690280] |* complex verb have one or two [[grammatical aspect|aspect]]s added, hence require additional two or three words
[10690290] |"Complex", which is phrase-level, is often confused with "[[compound (linguistics)|compound]]", which is [[word]]-level.
[10690300] |However, there are certain phenomena that formally seem to be phrases but semantically are more like compounds, like "women's magazines", which has the form of a possessive noun phrase, but which refers (just like a compound) to one specific [[lexeme]] (i.e. a magazine for women and not some magazine owned by a woman).
[10690310] |== Semiotic approaches to the concept of "phrase" ==
[10690320] |In more [[semiotic]] approaches to language, such as the more cognitivist versions of [[construction grammar]], a phrasal structure is not only a certain formal combination of word types whose features are inherited from the head.
[10690330] |Here each phrasal structure also expresses some type of [[concept]]ual content, be it specific or abstract.
[10700010] |Portuguese language
[10700020] |'''Portuguese''' ( or ''língua portuguesa'') is a [[Romance language]] that originated in what is now [[Galicia (Spain)]] and [[Portugal|northern Portugal]] from the [[Latin language|Latin]] spoken by [[Romanization (cultural)|romanized]] [[Pre-Roman peoples of the Iberian Peninsula]] (namely the [[Gallaeci]], the [[Lusitanians]], the [[Celtici]] and the [[Conii]]) about 2000 years ago.
[10700030] |It spread worldwide in the 15th and 16th centuries as Portugal established a [[Portuguese Empire|colonial and commercial empire]] (1415–1999) which spanned from [[Brazil]] in the [[Americas]] to [[Goa]] in [[India]] and [[Macau]] in [[China]], in fact it was used exclusively on the island of [[Sri Lanka]] as the [[lingua franca]] for almost 350 years.
[10700040] |During that time, many [[Portuguese Creole|creole languages based on Portuguese]] also appeared around the world, especially in [[Africa]], [[Asia]], and the [[Caribbean]].
[10700050] |Today it is one of the world's major languages, [[List of languages by number of native speakers|ranked 6th]] according to number of native speakers (approximately 177 million).
[10700060] |It is the language with the largest number of speakers in [[South America]], spoken by nearly all of Brazil's population, which amounts to over 51% of the continent's population even though it is the only Portuguese-speaking nation in [[the Americas]].
[10700070] |It is also a major lingua franca in Portugal's former colonial possessions in Africa.
[10700080] |It is the official language of ten countries (see the table on the right), also being co-official with [[Spanish language|Spanish]] and [[French language|French]] in [[Equatorial Guinea]], with [[Standard Cantonese|Cantonese]] [[Chinese language|Chinese]] in the Chinese special administrative region of [[Macau]], and with [[Tetum]] in [[East Timor]].
[10700090] |There are sizable communities of Portuguese-speakers in various regions of North America, notably in the [[United States]] ([[New Jersey]], [[New England]] and south [[Florida]]) and in [[Ontario]], [[Canada]].
[10700100] |[[Spain|Spanish]] author [[Miguel de Cervantes]] once called Portuguese "the sweet language", while Brazilian writer [[Olavo Bilac]] poetically described it as ''a última flor do Lácio, inculta e bela'': "the last flower of [[Latium]], wild and beautiful".
[10700110] |==Geographic distribution==
[10700120] |Today, Portuguese is the [[official language]] of [[Angola]], [[Brazil]], [[Cape Verde]], [[Guinea-Bissau]], [[Portugal]], [[São Tomé and Príncipe]] and [[Mozambique]].
[10700130] |It is also one of the official languages of [[Equatorial Guinea]] (with [[Spanish language|Spanish]] and [[French language|French]]), the [[Special Administrative Region of the People's Republic of China|Chinese special administrative region]] of [[Macau]] (with [[Chinese language|Chinese]]), and [[East Timor]], (with [[Tetum]]).
[10700140] |It is a [[First language|native language]] of most of the population in Portugal (100%), Brazil (99%), Angola (60%), and São Tomé and Príncipe (50%), and it is spoken by a [[plurality]] of the population of Mozambique (40%), though only 6.5% are native speakers.
[10700150] |No data is available for Cape Verde, but almost all the population is bilingual, and the monolingual population speaks [[Cape Verdean Creole]].
[10700160] |Small Portuguese-speaking communities subsist in former overseas colonies of Portugal such as Macau, where it is spoken as a first language by 0.6% of the population and East Timor.
[10700170] |[[Uruguay]] gave Portuguese an equal status to Spanish in its educational system at the north border with Brazil.
[10700180] |In the rest of the country, it's taught as an obligatory subject beginning by the 6th grade.
[10700190] |It is also spoken by substantial immigrant communities, though not official, in [[Andorra]], [[France]], [[Luxembourg]], [[Jersey]] (with a statistically significant Portuguese-speaking community of approximately 10,000 people), [[Paraguay]], [[Namibia]], [[South Africa]], [[Switzerland]], [[Venezuela]] and in the [[U.S.]] states of [[California]], [[Connecticut]], [[Florida]], [[Massachusetts]], [[New Jersey]], [[New York]] and [[Rhode Island]].
[10700200] |In some parts of India, such as [[Goa]] and [[Daman and Diu]] Portuguese is still spoken.
[10700210] |There are also significant populations of Portuguese speakers in [[Canada]] (mainly concentrated in and around [[Toronto]]) [[Bermuda]] and [[Netherlands Antilles]].
[10700220] |Portuguese is an official language of several international organizations.
[10700230] |The [[Community of Portuguese Language Countries]] (with the Portuguese acronym CPLP) consists of the eight independent countries that have Portuguese as an official language.
[10700240] |It is also an official language of the [[European Union]], [[Mercosul]], the [[Organization of American States]], the [[Organization of Ibero-American States]], the [[Union of South American Nations]], and the [[African Union]] (one of the working languages) and one of the official languages of other organizations.
[10700250] |The Portuguese language is gaining popularity in Africa, Asia, and South America as a second language for study.
[10700260] |Portuguese and Spanish are the fastest-growing European languages, and, according to estimates by UNESCO, Portuguese is the language with the highest potential for growth as an international language in southern Africa and South America.
[10700270] |The Portuguese-speaking African countries are expected to have a combined population of 83 million by 2050.
[10700280] |Since 1991, when Brazil signed into the economic market of Mercosul with other South American nations, such as Argentina, Uruguay, and Paraguay, there has been an increase in interest in the study of Portuguese in those South American countries.
[10700290] |The demographic weight of Brazil in the continent will continue to strengthen the presence of the language in the region.
[10700300] |Although in the early 21st century, after Macau was ceded to China in 1999, the use of Portuguese was in decline in Asia, it is becoming a language of opportunity there; mostly because of East Timor's boost in the number of speakers in the last five years but also because of increased Chinese diplomatic and financial ties with Portuguese-speaking countries.
[10700310] |In July 2007, President Teodoro Obiang Nguema announced his government's decision to make Portuguese [[Equatorial Guinea]]'s third official language, in order to meet the requirements to apply for full membership of the [[Community of Portuguese Language Countries]].
[10700320] |This upgrading from its current Associate Observer condition would result in Equatorial Guinea being able to access several professional and academic exchange programs and the facilitation of cross-border circulation of citizens.
[10700330] |Its application is currently being assessed by other CPLP members.
[10700340] |In March 1994 the [[Bosque de Portugal]] (Portugal's Woods) was founded in the Brazilian city of [[Curitiba]].
[10700350] |The park houses the Portuguese Language Memorial, which honors the Portuguese immigrants and the countries that adopted the Portuguese language.
[10700360] |Originally there were seven nations represented with pillars, but the independence of [[East Timor]] brought yet another pillar for that nation in 2007.
[10700370] |In March 2006, the [[Museum of the Portuguese Language]], an interactive museum about the Portuguese language, was founded in [[São Paulo]], Brazil, the city with the largest number of Portuguese speakers in the world.
[10700380] |==Dialects==
[10700390] |Portuguese is a [[pluricentric language]] with two main groups of [[dialect]]s, those of [[Brazil]] and those of the [[Old World]].
[10700400] |For historical reasons, the dialects of Africa and Asia are generally closer to those of Portugal than the Brazilian dialects, although in some aspects of their phonetics, especially the pronunciation of unstressed vowels, they resemble [[Brazilian Portuguese]] more than [[European Portuguese]].
[10700410] |They have not been studied as widely as European and Brazilian Portuguese.
[10700420] |Audio samples of some dialects of Portuguese are available below.
[10700430] |There are some differences between the areas but these are the best approximations possible.
[10700440] |For example, the ''caipira'' dialect has some differences from the one of Minas Gerais, but in general it is very close.
[10700450] |A good example of Brazilian Portuguese may be found in the capital city, [[Brasília]], because of the generalized population from all parts of the country.
[10700460] |'''[[Angola]]'''
[10700470] |# ''Benguelense'' — [[Benguela]] province.
[10700480] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som85.html ''Luandense''] — [[Luanda]] province.
[10700490] |# ''Sulista'' — South of Angola.
[10700500] |'''[[Brazil]]'''
[10700510] |# ''[[Caipira]]'' — States of [[São Paulo (state)|São Paulo]] (countryside; the city of São Paulo and the eastern areas of the state have their own dialect, called ''paulistano''); southern [[Minas Gerais]], northern [[Paraná (state)|Paraná]], [[Goiás]] and [[Mato Grosso do Sul]].
[10700520] |# ''Cearense'' — [[Ceará]].
[10700530] |# ''Baiano'' — [[Bahia]].
[10700540] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som90.html ''Fluminense''] — Variants spoken in the states of [[Rio de Janeiro (state)|Rio de Janeiro]] and [[Espírito Santo]] (excluding the city of Rio de Janeiro and its adjacent metropolitan areas, which have their own dialect, called ''[[carioca]]'').
[10700550] |# ''[[Gaucho|Gaúcho]]'' — [[Rio Grande do Sul]].
[10700560] |(There are many distinct accents in Rio Grande do Sul, mainly due to the heavy influx of European immigrants of diverse origins, those which have settled several colonies throughout the state.)
[10700570] |# ''[[Mineiro]]'' — [[Minas Gerais]] (not prevalent in the [[Triângulo Mineiro]], southern and southeastern [[Minas Gerais]]).
[10700580] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som91.html ''Nordestino''] — [[Northeast Region, Brazil|northeastern states of Brazil]] ([[Pernambuco]] and [[Rio Grande do Norte]] have a particular way of speaking).
[10700590] |# ''Nortista'' — [[Amazon Basin]] states.
[10700600] |# ''Paulistano'' — Variants spoken around [[São Paulo]] city and the eastern areas of São Paulo state.
[10700610] |# ''Sertanejo'' — States of [[Goiás]] and [[Mato Grosso]] (the city of [[Cuiabá]] has a particular way of speaking).
[10700620] |# ''Sulista'' — Variants spoken in the areas between the northern regions of [[Rio Grande do Sul]] and southern regions of São Paulo state.
[10700630] |(The cities of [[Curitiba]], [[Florianópolis]], and [[Itapetininga]] have fairly distinct accents as well.)
[10700640] |'''[[Portugal]]'''
[10700650] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som69.html ''Açoriano''] (Azorean) — [[Azores]].
[10700660] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som40.html ''Alentejano''] — [[Alentejo]]
[10700670] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som44.html ''Algarvio''] — [[Algarve]] (there is a particular dialect in a small part of western Algarve).
[10700680] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som1.html ''Alto-Minhoto''] — North of [[Braga]] (hinterland).
[10700690] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som49.html ''Baixo-Beirão''; ''Alto-Alentejano''] — Central Portugal (hinterland).
[10700700] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som9.html ''Beirão''] — Central Portugal.
[10700710] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som22.html ''Estremenho''] — Regions of [[Coimbra]] and [[Lisbon]] (the Lisbon dialect has some peculiar features not shared with the one of Coimbra).
[10700720] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som60.html ''Madeirense''] (Madeiran) — [[Madeira]].
[10700730] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som14.html ''Nortenho''] — Regions of Braga and [[Porto]].
[10700740] |# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som6.html ''Transmontano''] — [[Trás-os-Montes e Alto Douro]].
[10700750] |Other countries
[10700760] |* '''[[Cape Verde]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som87.html ''Português cabo-verdiano''] ([[Cape Verdean Portuguese]])
[10700770] |* '''[[Daman and Diu]]''', India — ''Damaense''.
[10700780] |* '''[[East Timor]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som84.html ''Timorense''] ([[East Timorese Portuguese|East Timorese]])
[10700790] |* '''[[Goa]]''', India — ''Goês''.
[10700800] |* '''[[Guinea-Bissau]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som88.html ''Guineense''] ([[Guinean Portuguese]]).
[10700810] |* '''[[Macau]]''', China — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som92.html ''Macaense''] ([[Macanese Portuguese|Macanese]])
[10700820] |* '''[[Mozambique]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som89.html ''Moçambicano''] ([[Mozambican Portuguese|Mozambican]])
[10700830] |* '''[[São Tomé and Príncipe]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som83.html ''Santomense'']
[10700840] |* '''[[Uruguay]]''' — [[Riverense Portuñol language|''Dialectos Portugueses del Uruguay (DPU)'']].
[10700850] |Differences between dialects are mostly of [[accent (linguistics)|accent]] and [[vocabulary]], but between the Brazilian dialects and other dialects, especially in their most coloquial forms, there can also be some grammatical differences.
[10700860] |The [[Portuguese creole|Portuguese-based creole]]s spoken in various parts of Africa, Asia, and the Americas are independent languages which should not be confused with Portuguese itself.
[10700870] |==History==
[10700880] |Arriving in the Iberian Peninsula in 216 BC, the Romans brought with them the [[Latin language]], from which all Romance languages descend.
[10700890] |The language was spread by arriving Roman soldiers, settlers and merchants, who built Roman cities mostly near the settlements of previous civilizations.
[10700900] |Between AD 409 and 711, as the Roman Empire collapsed in Western Europe, the Iberian Peninsula was conquered by Germanic peoples ([[Migration Period]]).
[10700910] |The occupiers, mainly [[Suebi]] and [[Visigoths]], quickly adopted late Roman culture and the [[Vulgar Latin]] dialects of the peninsula.
[10700920] |After the [[Moors|Moorish]] invasion of 711, [[Arabic language|Arabic]] became the administrative language in the conquered regions, but most of the population continued to speak a form of [[Romance languages|Romance]] commonly known as [[Mozarabic]].
[10700930] |The influence exerted by Arabic on the Romance dialects spoken in the Christian kingdoms of the north was small, affecting mainly their lexicon.
[10700940] |The earliest surviving records of a distinctively Portuguese language are administrative documents of the 9th century, still interspersed with many Latin phrases.
[10700950] |Today this phase is known as Proto-Portuguese (between the 9th and the 12th centuries).
[10700960] |In the first period of Old Portuguese — [[Galician-Portuguese]] Period (from the 12th to the 14th century) — the language gradually came into general use.
[10700970] |For some time, it was the language of preference for [[lyric poetry]] in Christian Hispania, much like [[Occitan]] was the language of the [[Occitan literature#Poetry_of_the_troubadours|poetry of the troubadours]].
[10700980] |Portugal was formally recognized as an independent kingdom by the [[Kingdom of Leon]] in 1143, with [[Afonso I of Portugal|Afonso Henriques]] as king.
[10700990] |In 1290, king [[Denis of Portugal|Dinis]] created the first Portuguese university in Lisbon (the ''Estudos Gerais'', later moved to [[Coimbra]]) and decreed that Portuguese, then simply called the "common language" should be known as the Portuguese language and used officially.
[10701000] |In the second period of Old Portuguese, from the 14th to the 16th century, with the [[Age of discovery|Portuguese discoveries]], the language was taken to many regions of [[Asia]], [[Africa]] and the [[Americas]] (nowadays, the great majority of Portuguese speakers live in Brazil, in South America).
[10701010] |By the 16th century it had become a ''[[lingua franca]]'' in Asia and Africa, used not only for colonial administration and trade but also for communication between local officials and Europeans of all nationalities.
[10701020] |Its spread was helped by mixed marriages between Portuguese and local people, and by its association with [[Roman Catholic]] [[missionary]] efforts, which led to the formation of a [[creole language]] called [[Kristang language|Kristang]] in many parts of Asia (from the word ''cristão'', "Christian").
[10701030] |The language continued to be popular in parts of Asia until the 19th century.
[10701040] |Some Portuguese-speaking Christian communities in [[India]], [[Sri Lanka]], [[Malaysia]], and [[Indonesia]] preserved their language even after they were isolated from Portugal.
[10701050] |The end of the Old Portuguese period was marked by the publication of the ''Cancioneiro Geral'' by [[Garcia de Resende]], in 1516.
[10701060] |The early times of Modern Portuguese, which spans from the 16th century to present day, were characterized by an increase in the number of learned words borrowed from Classical Latin and Classical Greek since the Renaissance, which greatly enriched the lexicon.
[10701070] |===Characterization===
[10701080] |A distinctive feature of Portuguese is that it preserved the stressed vowels of [[Vulgar Latin]], which became diphthongs in other Romance languages; cf. Fr. ''pierre'', Sp. ''piedra'', It. ''pietra'', Port. ''pedra'', from Lat. ''petra''; or Sp. ''fuego'', It. ''fuoco'', Port. ''fogo'', from Lat. ''focum''.
[10701090] |Another characteristic of early Portuguese was the loss of [[:wiktionary:intervocalic|intervocalic]] ''l'' and ''n'', sometimes followed by the merger of the two surrounding vowels, or by the insertion of an [[epenthesis|epenthetic vowel]] between them: cf. Lat. ''salire'', ''tenere'', ''catena'', Sp. ''salir'', ''tener'', ''cadena'', Port. ''sair'', ''ter'', ''cadeia''.
[10701100] |When the [[elision|elided]] consonant was ''n'', it often [[nasalization|nasalized]] the preceding vowel: cf. Lat. ''manum'', ''rana'', ''bonum'', Port. ''mão'', ''rãa'', ''bõo'' (now ''mão'', ''rã'', ''bom'').
[10701110] |This process was the source of most of the nasal diphthongs which are typical of Portuguese.
[10701120] |In particular, the Latin endings ''-anem'', ''-anum'' and ''-onem'' became ''-ão'' in most cases, cf. Lat. ''canem'', ''germanum'', ''rationem'' with Modern Port. ''cão'', ''irmão'', ''razão'', and their plurals ''-anes'', ''-anos'', ''-ones'' normally became ''-ães'', ''-ãos'', ''-ões'', cf. ''cães'', ''irmãos'', ''razões''.
[10701130] |===Movement to make Portuguese an official language of the UN===
[10701140] |There is a growing number of people in the Portuguese speaking media and the internet who are presenting the case to the CPLP and other organizations to run a debate in the [[Lusophone]] community with the purpose of bringing forward a petition to make Portuguese an official language of the United Nations.
[10701150] |In October 2005, during the international Convention of the [http://www.elosinternacional.com.br/index.htm Elos Club International ] that took place in Tavira, Portugal a petition was written and unanimously approved whose text can be found on the internet with the title ''Petição Para Tornar Oficial o Idioma Português na ONU''.
[10701160] |Romulo Alexandre Soares, president of the Brazil-Portugal Chamber highlights that the positioning of Brazil in the international arena as one of the emergent powers of the 21 century, the size of its population, and the presence of the language around the world provides legitimacy and justifies a petition to the UN to make the Portuguese an official language at the UN.
[10701170] |==Vocabulary==
[10701180] |Most of the lexicon of Portuguese is derived from Latin.
[10701190] |Nevertheless, because of the [[Moors|Moorish]] occupation of the [[Iberian Peninsula]] during the Middle Ages, and the participation of Portugal in the [[Age of Discovery]], it has adopted loanwords from all over the world.
[10701200] |Very few Portuguese words can be traced to the [[Pre-Roman peoples of the Iberian Peninsula|pre-Roman inhabitants of Portugal]], which included the [[Gallaeci]], [[Lusitanians]], [[Celtici]] and [[Cynetes]].
[10701210] |The [[Phoenicians]] and [[Carthaginians]], briefly present, also left some scarce traces.
[10701220] |Some notable examples are ''abóbora'' "pumpkin" and ''bezerro'' "year-old calf", from the nearby [[Celtiberian language]] (probably through the Celtici); ''cerveja'' "beer", from [[Celtic languages|Celtic]]; ''saco'' "bag", from [[Phoenician language|Phoenician]]; and ''cachorro'' "dog, puppy", from [[Basque language|Basque]].
[10701230] |In the 5th century, the Iberian Peninsula (the [[Ancient Rome|Roman]] [[Hispania]]) was conquered by the [[Germanic peoples|Germanic]] [[Suevi]] and [[Visigoths]].
[10701240] |As they adopted the Roman civilization and language, however, these people contributed only a few words to the lexicon, mostly related to warfare — such as ''espora'' "spur", ''estaca'' "stake", and ''guerra'' "war", from [[Gothic language|Gothic]] ''*spaúra'', ''*stakka'', and ''*wirro'', respectively.
[10701250] |Between the 9th and 15th centuries Portuguese acquired about 1000 words from [[Arabic language|Arabic]] by influence of [[al-Andalus|Moorish Iberia]].
[10701260] |They are often recognizable by the initial Arabic article ''a''(''l'')''-'', and include many common words such as ''aldeia'' "village" from الضيعة ''aldaya'', ''alface'' "lettuce" from الخس ''alkhass'', ''armazém'' "warehouse" from المخزن ''almahazan'', and ''azeite'' "olive oil" from زيت ''azzait''.
[10701270] |From Arabic came also the grammatically peculiar word [[Insha'Allah|''oxalá'']] "hopefully".
[10701280] |The Mozambican currency name [[Mozambican Metical|''metical'']] was derived from the word مطقال ''miṭqāl'', a unit of weight.
[10701290] |The word Mozambique itself is from the Arabic name of sultan Muça Alebique (Musa Alibiki).
[10701300] |The name of the Portuguese town of [[Fátima, Portugal|Fátima]] comes from the name of one of the daughters of the prophet [[Muhammad]].
[10701310] |Starting in the 15th century, the Portuguese maritime explorations led to the introduction of many loanwords from [[Asia]]n languages.
[10701320] |For instance, ''catana'' "cutlass" from Japanese ''katana''; ''corja'' "rabble" from Malay ''kórchchu''; and ''chá'' "tea" from [[Chinese language|Chinese]] ''[[Tea#The word tea|''chá'']]''.
[10701330] |From South America came ''batata'' "[[potato]]", from [[Taino]]; ''ananás'' and ''abacaxi'', from [[Tupi-Guarani]] ''naná'' and [[Tupi language|Tupi]] ''ibá cati'', respectively (two species of [[pineapple]]), and ''tucano'' "[[toucan]]" from [[Guarani language|Guarani]] ''tucan''.
[10701340] |See [[List of Brazil state name etymologies]], for some more examples.
[10701350] |From the 16th to the 19th century, the role of Portugal as intermediary in the [[Atlantic slave trade]], with the establishment of large Portuguese colonies in Angola, Mozambique, and Brazil, Portuguese got several words of African and [[indigenous peoples of Brazil|Amerind]] origin, especially names for most of the animals and plants found in those territories.
[10701360] |While those terms are mostly used in the former colonies, many became current in European Portuguese as well.
[10701370] |From [[Kimbundu language|Kimbundu]], for example, came ''kifumate'' → ''cafuné'' "head caress", ''kusula'' → ''caçula'' "youngest child", ''marimbondo'' "tropical wasp", and ''kubungula'' → ''bungular'' "to dance like a wizard".
[10701380] |Finally, it has received a steady influx of loanwords from other European languages.
[10701390] |For example, ''melena'' "hair lock", ''fiambre'' "wet-cured ham" (in contrast with ''presunto'' "dry-cured ham" from Latin ''prae-exsuctus'' "dehydrated"), and ''castelhano'' "Castilian", from Spanish; ''colchete''/''crochê'' "bracket"/"crochet", ''paletó'' "jacket", ''batom'' "lipstick", and ''filé''/''filete'' "steak"/"slice" respectively, from French ''crochet'', ''paletot'', ''bâton'', ''filet''; ''macarrão'' "pasta", ''piloto'' "pilot", ''carroça'' "carriage", and ''barraca'' "barrack", from Italian ''maccherone'', ''pilota'', ''carrozza'', ''baracca''; and ''bife'' "steak", ''futebol'', ''revólver'', ''estoque'', ''folclore'', from English ''beef'', ''football'', ''revolver'', ''stock'', ''folklore''.
[10701400] |==Classification and related languages==
[10701410] |Portuguese belongs to the [[West Iberian languages|West Iberian]] branch of the [[Romance language]]s, and it has special ties with the following members of this group:
[10701420] |* [[Galician language|Galician]] and the [[Fala language|Fala]], its closest relatives.
[10701430] |See below.
[10701440] |* [[Spanish language|Spanish]], the major language closest to Portuguese.
[10701450] |(See also [[Differences between Spanish and Portuguese]].)
[10701460] |* [[Mirandese language|Mirandese]], another West Iberian language spoken in Portugal.
[10701470] |* [[Judeo-Portuguese]] and [[Ladino language|Judeo-Spanish]], languages spoken by [[Sephardic Jew]]s, which remained close to Portuguese and Spanish.
[10701480] |Despite the obvious lexical and grammatical similarities between Portuguese and other Romance languages, it is not [[mutually intelligible]] with most of them.
[10701490] |Apart from Galician, Portuguese speakers will usually need some formal study of basic grammar and vocabulary, before attaining a reasonable level of comprehension of those languages, and vice-versa.
[10701500] |===Galician and the Fala===
[10701510] |The closest language to Portuguese is Galician, spoken in the autonomous community of Galicia (northwestern Spain).
[10701520] |The two were at one time a single language, known today as [[Galician-Portuguese]], but since the political separation of Portugal from Galicia they have diverged somewhat, especially in pronunciation and vocabulary.
[10701530] |Nevertheless, the core vocabulary and grammar of Galician are still noticeably closer to Portuguese than to Spanish.
[10701540] |In particular, like Portuguese, it uses the future subjunctive, the personal infinitive, and the synthetic pluperfect (see the section on the grammar of Portuguese, below).
[10701550] |Mutual intelligibility (estimated at 85% by R. A. Hall, Jr., 1989) is good between Galicians and northern Portuguese, but poorer between Galicians and speakers from central Portugal.
[10701560] |The Fala language is another descendant of Galician-Portuguese, spoken by a small number of people in the Spanish towns of Valverdi du Fresnu, As Ellas and Sa Martín de Trebellu (autonomous community of [[Extremadura]], near the border with Portugal).
[10701570] |===Influence on other languages===
[10701580] |Many languages have [[loanword|borrowed words]] from Portuguese, such as [[Bahasa Indonesia|Indonesian]], [[Sri Lanka]]n [[Sri Lanka Tamils (native)|Tamil]] and [[Sinhalese language|Sinhalese]] (see [[Sri Lanka Indo-Portuguese language|Sri Lanka Indo-Portuguese]]), [[Malay language|Malay]], [[Bengali language|Bengali]], [[English (language)|English]], [[Hindi]], [[Konkani language|Konkani]], [[Marathi language|Marathi]], [[Tetum language|Tetum]], [[Tsonga language|Xitsonga]], [[Papiamentu]], [[Japanese language|Japanese]], [[Barbadian|Bajan Creole]] (Spoken in Barbados), [[Lanc-Patuá]] (spoken in northern Brazil) and [[Sranan Tongo]] (spoken in Suriname).
[10701590] |It left a strong influence on the ''[[Old Tupi|língua brasílica]]'', a [[Tupi-Guarani|Tupi-Guarani language]] which was the most widely spoken in [[Brazil]] until the 18th century, and on the language spoken around [[Sikka]] in [[Flores|Flores Island]], [[Indonesia]].
[10701600] |In nearby [[Larantuka]], Portuguese is used for prayers in [[Holy Week]] rituals.
[10701610] |The Japanese-Portuguese dictionary ''[[Nippo Jisho]]'' (1603) was the first dictionary of Japanese in a European language, a product of [[Society of Jesus|Jesuit]] missionary activity in [[Japan]].
[10701620] |Building on the work of earlier Portuguese missionaries, the ''Dictionarium Anamiticum, Lusitanum et Latinum'' (Annamite-Portuguese-Latin dictionary) of [[Alexandre de Rhodes]] (1651) introduced the modern [[Vietnamese alphabet|orthography of Vietnamese]], which is based on the orthography of 17th-century Portuguese.
[10701630] |The [[Romanization]] of [[Chinese language|Chinese]] was also influenced by the Portuguese language (among others), particularly regarding [[List of common Chinese surnames|Chinese surnames]]; one example is ''Mei''.
[10701640] |See also [[List of English words of Portuguese origin]], [[Loan words in Indonesian]], [[Japanese words of Portuguese origin]], [[Malay_language#Borrowed_words|Borrowed words in Malay]], [[Sinhala words of Portuguese origin]], [[Loan words in Sri Lankan Tamil#Portuguese|Loan words from Portuguese in Sri Lankan Tamil]].
[10701650] |===Derived languages===
[10701660] |Beginning in the 16th century, the extensive contacts between Portuguese travelers and settlers, African slaves, and local populations led to the appearance of many [[pidgin]]s with varying amounts of Portuguese influence.
[10701670] |As these pidgins became the mother tongue of succeeding generations, they evolved into fully fledged [[creole language]]s, which remained in use in many parts of Asia and Africa until the 18th century.
[10701680] |Some Portuguese-based or Portuguese-influenced creoles are still spoken today, by over 3 million people worldwide, especially people of partial [[Portuguese people|Portuguese]] ancestry.
[10701690] |== Phonology ==
[10701700] |There is a maximum of 9 oral vowels and 19 consonants, though some varieties of the language have fewer phonemes (Brazilian Portuguese has only 8 oral vowel [[phone]]s).
[10701710] |There are also five nasal vowels, which some linguists regard as allophones of the oral vowels, ten oral [[diphthong]]s, and five nasal diphthongs.
[10701720] |===Vowels===
[10701730] |To the seven vowels of [[Vulgar Latin]], European Portuguese has added two [[Mid-centralized vowel|near central vowels]], one of which tends to be [[elision|elided]] in [[relaxed pronunciation|rapid speech]], like the ''e caduc'' of [[French language|French]] (represented either as {{IPA|/ɯ̽/}}, or {{IPA|/ɨ/}}, or {{IPA|/ə/}}).
[10701740] |The high vowels {{IPA|/e o/}} and the low vowels {{IPA|/ɛ ɔ/}} are four distinct phonemes, and they alternate in various forms of [[apophony]].
[10701750] |Like [[Catalan language|Catalan]], Portuguese uses vowel quality to contrast stressed syllables with unstressed syllables: isolated vowels tend to be [[Vowel#Height|raised]], and in some cases centralized, when unstressed.
[10701760] |Nasal diphthongs occur mostly at the end of words.
[10701770] |===Consonants===
[10701780] |The consonant inventory of Portuguese is fairly conservative.
[10701790] |The medieval affricates {{IPA|/ts/}}, {{IPA|/dz/}}, {{IPA|/tʃ/}}, {{IPA|/dʒ/}} merged with the fricatives {{IPA|/s/}}, {{IPA|/z/}}, {{IPA|/ʃ/}}, {{IPA|/ʒ/}}, respectively, but not with each other, and there were no other significant changes to the consonant phonemes since then.
[10701800] |However, some remarkable dialectal variants and [[allophone]]s have appeared, among which:
[10701810] |*In many regions of Brazil, {{IPA|/t/}} and {{IPA|/d/}} have the affricate allophones {{IPA|[tʃ]}} and {{IPA|[dʒ]}}, respectively, before {{IPA|/i/}} and {{IPA|/ĩ/}}.
[10701820] |([[Quebec French]] has a similar phenomenon, with alveolar affricates instead of postalveolars.
[10701830] |[[Japanese language|Japanese]] is another example).
[10701840] |*At the end of a syllable, the phoneme {{IPA|/l/}} has the allophone {{IPA|[u̯]}} in Brazilian Portuguese (''[[L-vocalization#L-vocalization|L-vocalization]]'').
[10701850] |*In many parts of Brazil and Angola, intervocalic {{IPA|/ɲ/}} is pronounced as a [[nasalization|nasalized]] [[palatal approximant]] {{IPA|[j̃]}} which nasalizes the preceding vowel, so that for instance {{IPA|/ˈniɲu/}} is pronounced {{IPA|[ˈnĩj̃u]}}.
[10701860] |*In most of Brazil, the alveolar sibilants {{IPA|/s/}} and {{IPA|/z/}} occur in complementary distribution at the end of syllables, depending on whether the consonant that follows is voiceless or voiced, as in English.
[10701870] |But in most of Portugal and parts of Brazil sibilants are postalveolar at the end of syllables, {{IPA|/ʃ/}} before voiceless consonants, and {{IPA|/ʒ/}} before voiced consonants (in [[Ladino language|Judeo-Spanish]], {{IPA|/s/}} is often replaced with {{IPA|/ʃ/}} at the end of syllables, too).
[10701880] |*There is considerable dialectal variation in the value of the [[Rhotic consonant|rhotic]] phoneme {{IPA|/ʁ/}}.
[10701890] |See [[Guttural R#Portuguese|Guttural R in Portuguese]], for details.
[10701900] |==Grammar==
[10701910] |A particularly interesting aspect of the grammar of Portuguese is the verb.
[10701920] |Morphologically, more verbal inflections from classical Latin have been preserved by Portuguese than any other major Romance language.
[10701930] |See [[Romance copula#Morphological comparison|Romance copula]], for a detailed comparison.
[10701940] |It has also some innovations not found in other Romance languages (except Galician and the Fala):
[10701950] |* The [[present perfect tense]] has an iterative sense unique among the Romance languages.
[10701960] |It denotes an action or a series of actions which began in the past and are expected to keep repeating in the future.
[10701970] |For instance, the sentence ''Tenho tentado falar com ela'' would be translated to "I have been trying to talk to her", not "I have tried to talk to her".
[10701980] |On the other hand, the correct translation of the question "Have you heard the latest news?" is not ''*Tem ouvido a última notícia?'', but ''Ouviu a última notícia?'', since no repetition is implied.
[10701990] |* The future [[Subjunctive mood|subjunctive]] tense, which was developed by medieval [[West Iberian languages|West Iberian Romance]], but has now fallen into disuse in Spanish, is still used in [[vernacular]] Portuguese.
[10702000] |It appears in dependent clauses that denote a condition which must be fulfilled in the future, so that the independent clause will occur.
[10702010] |Other languages normally employ the present tense under the same circumstances:
[10702020] |:''Se ''for'' eleito presidente, mudarei a lei.''
[10702030] |:If ''I am'' elected president, I will change the law.
[10702040] |:''Quando ''fores'' mais velho, vais entender.''
[10702050] |:When ''you are'' older, you will understand.
[10702060] |* The personal [[infinitive]]: infinitives can [[inflection|inflect]] according to their subject in [[Grammatical person|person]] and [[Grammatical number|number]], often showing who is expected to perform a certain action; cf. ''É melhor voltares'' "It is better [for you] to go back," ''É melhor voltarmos'' "It is better [for us] to go back."
[10702070] |Perhaps for this reason, infinitive clauses replace subjunctive clauses more often in Portuguese than in other Romance languages.
[10702080] |==Writing system==
[10702090] |Portuguese is written with the [[Latin alphabet]], making use of five [[diacritic]]s to denote stress, vowel height, contraction, nasalization, and other sound changes (acute accent, grave accent, circumflex accent, tilde, and cedilla).
[10702100] |[[Brazilian Portuguese]] also uses the diaeresis mark.
[10702110] |Accented characters and digraphs are not counted as separate letters for [[collation]] purposes.
[10702120] |===Brazilian vs. European spelling===
[10702130] |There are some minor differences between the orthographies of Brazil and other Portuguese language countries.
[10702140] |One of the most pervasive is the use of acute accents in the European/African/Asian orthography in many words such as ''sinónimo'', where the Brazilian orthography has a circumflex accent, ''sinônimo''.
[10702150] |Another important difference is that Brazilian spelling often lacks ''c'' or ''p'' before ''c'', ''ç'', or ''t'', where the European orthography has them; for example, cf. Brazilian ''fato'' with European ''facto'', "fact", or Brazilian ''objeto'' with European ''objecto'', "object".
[10702160] |Some of these spelling differences reflect differences in the pronunciation of the words, but others are merely graphic.
[10702170] |==Examples==
[10702180] |;Excerpt from the Portuguese [[national epic]] ''[[Os Lusíadas]]'', by author [[Luís de Camões]] (I, 33)
[10702190] |{| border="0" cellpadding="2" cellspacing="1" style="font-family: Lucida Sans Unicode" |-