METADATA 6.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
  1. Metadata-Version: 2.1
  2. Name: snowballstemmer
  3. Version: 2.2.0
  4. Summary: This package provides 29 stemmers for 28 languages generated from Snowball algorithms.
  5. Home-page: https://github.com/snowballstem/snowball
  6. Author: Snowball Developers
  7. Author-email: snowball-discuss@lists.tartarus.org
  8. License: BSD-3-Clause
  9. Keywords: stemmer
  10. Platform: UNKNOWN
  11. Classifier: Development Status :: 5 - Production/Stable
  12. Classifier: Intended Audience :: Developers
  13. Classifier: License :: OSI Approved :: BSD License
  14. Classifier: Natural Language :: Arabic
  15. Classifier: Natural Language :: Basque
  16. Classifier: Natural Language :: Catalan
  17. Classifier: Natural Language :: Danish
  18. Classifier: Natural Language :: Dutch
  19. Classifier: Natural Language :: English
  20. Classifier: Natural Language :: Finnish
  21. Classifier: Natural Language :: French
  22. Classifier: Natural Language :: German
  23. Classifier: Natural Language :: Greek
  24. Classifier: Natural Language :: Hindi
  25. Classifier: Natural Language :: Hungarian
  26. Classifier: Natural Language :: Indonesian
  27. Classifier: Natural Language :: Irish
  28. Classifier: Natural Language :: Italian
  29. Classifier: Natural Language :: Lithuanian
  30. Classifier: Natural Language :: Nepali
  31. Classifier: Natural Language :: Norwegian
  32. Classifier: Natural Language :: Portuguese
  33. Classifier: Natural Language :: Romanian
  34. Classifier: Natural Language :: Russian
  35. Classifier: Natural Language :: Serbian
  36. Classifier: Natural Language :: Spanish
  37. Classifier: Natural Language :: Swedish
  38. Classifier: Natural Language :: Tamil
  39. Classifier: Natural Language :: Turkish
  40. Classifier: Operating System :: OS Independent
  41. Classifier: Programming Language :: Python
  42. Classifier: Programming Language :: Python :: 2
  43. Classifier: Programming Language :: Python :: 2.6
  44. Classifier: Programming Language :: Python :: 2.7
  45. Classifier: Programming Language :: Python :: 3
  46. Classifier: Programming Language :: Python :: 3.4
  47. Classifier: Programming Language :: Python :: 3.5
  48. Classifier: Programming Language :: Python :: 3.6
  49. Classifier: Programming Language :: Python :: 3.7
  50. Classifier: Programming Language :: Python :: 3.8
  51. Classifier: Programming Language :: Python :: 3.9
  52. Classifier: Programming Language :: Python :: 3.10
  53. Classifier: Programming Language :: Python :: Implementation :: CPython
  54. Classifier: Programming Language :: Python :: Implementation :: PyPy
  55. Classifier: Topic :: Database
  56. Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
  57. Classifier: Topic :: Text Processing :: Indexing
  58. Classifier: Topic :: Text Processing :: Linguistic
  59. Description-Content-Type: text/x-rst
  60. License-File: COPYING
  61. Snowball stemming library collection for Python
  62. ===============================================
  63. Python 3 (>= 3.3) is supported. We no longer actively support Python 2 as
  64. the Python developers stopped supporting it at the start of 2020. Snowball
  65. 2.1.0 was the last release to officially support Python 2.
  66. What is Stemming?
  67. -----------------
  68. Stemming maps different forms of the same word to a common "stem" - for
  69. example, the English stemmer maps *connection*, *connections*, *connective*,
  70. *connected*, and *connecting* to *connect*. So a searching for *connected*
  71. would also find documents which only have the other forms.
  72. This stem form is often a word itself, but this is not always the case as this
  73. is not a requirement for text search systems, which are the intended field of
  74. use. We also aim to conflate words with the same meaning, rather than all
  75. words with a common linguistic root (so *awe* and *awful* don't have the same
  76. stem), and over-stemming is more problematic than under-stemming so we tend not
  77. to stem in cases that are hard to resolve. If you want to always reduce words
  78. to a root form and/or get a root form which is itself a word then Snowball's
  79. stemming algorithms likely aren't the right answer.
  80. How to use library
  81. ------------------
  82. The ``snowballstemmer`` module has two functions.
  83. The ``snowballstemmer.algorithms`` function returns a list of available
  84. algorithm names.
  85. The ``snowballstemmer.stemmer`` function takes an algorithm name and returns a
  86. ``Stemmer`` object.
  87. ``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a
  88. ``Stemmer.stemWords(word[])`` method.
  89. .. code-block:: python
  90. import snowballstemmer
  91. stemmer = snowballstemmer.stemmer('english');
  92. print(stemmer.stemWords("We are the world".split()));
  93. Automatic Acceleration
  94. ----------------------
  95. `PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for
  96. Snowball's ``libstemmer_c`` and should provide results 100% compatible to
  97. **snowballstemmer**.
  98. **PyStemmer** is faster because it wraps generated C versions of the stemmers;
  99. **snowballstemmer** uses generate Python code and is slower but offers a pure
  100. Python solution.
  101. If PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer``
  102. ``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and
  103. ``Stemmer.stemWords()`` methods.
  104. Benchmark
  105. ~~~~~~~~~
  106. This is a crude benchmark which measures the time for running each stemmer on
  107. every word in its sample vocabulary (10,787,583 words over 26 languages). It's
  108. not a realistic test of normal use as a real application would do much more
  109. than just stemming. It's also skewed towards the stemmers which do more work
  110. per word and towards those with larger sample vocabularies.
  111. * Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)
  112. * Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)
  113. * PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)
  114. * PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)
  115. * Python 2.7 + **PyStemmer** : 52s
  116. For reference the equivalent test for C runs in 9 seconds.
  117. These results are for Snowball 2.0.0. They're likely to evolve over time as
  118. the code Snowball generates for both Python and C continues to improve (for
  119. a much older test over a different set of stemmers using Python 2.7,
  120. **snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower
  121. with **PyPy**).
  122. The message to take away is that if you're stemming a lot of words you should
  123. either install **PyStemmer** (which **snowballstemmer** will then automatically
  124. use for you as described above) or use PyPy.
  125. The TestApp example
  126. -------------------
  127. The ``testapp.py`` example program allows you to run any of the stemmers
  128. on a sample vocabulary.
  129. Usage::
  130. testapp.py <algorithm> "sentences ... "
  131. .. code-block:: bash
  132. $ python testapp.py English "sentences... "