pyelotl
Python package for Natural Language Processing (NLP), focused on low-resource languages spoken in Mexico.
Python package for Natural Language Processing (NLP), focused on low-resource languages spoken in Mexico.
This is a project of Comunidad Elotl.
Developed by:
- Paul Aguilar @penserbjorne, paul.aguilar.enriquez@hotmail.com
- Robert Pugh @Lguyogiro, robertpugh408@gmail.com
- Diego Barriga @umoqnier, dbarriga@ciencias.unam.mx
Requiere python>=3.11
- Development Status
Beta
. Read Classifiers - pip package: elotl
- GitHub repository: ElotlMX/py-elotl
Installation
Using pip
pip install elotl
From source
git clone https://github.com/ElotlMX/py-elotl.git
cd py-elotl
pip install -e .
Use
Working with corpus
import elotl.corpus
Listing available corpus
print("Name\t\tDescription")
list_of_corpus = elotl.corpus.list_of_corpus()
for row in list_of_corpus:
print(row)
Output:
Name Description
['axolotl', 'Is a Spanish-Nahuatl parallel corpus']
['tsunkua', 'Is a Spanish-Otomí parallel corpus']
['kolo', 'Is a Spanish-Mixteco parallel corpus']
Loading a corpus
If a non-existent corpus is requested, a value of 0 is returned.
axolotl = elotl.corpus.load('axolotlr')
if axolotl == 0:
print("The name entered does not correspond to any corpus")
If an existing corpus is entered, a list is returned.
axolotl = elotl.corpus.load('axolotl')
print(axolotl[0])
[
'Y así, cuando hizo su ofrenda de fuego, se sienta delante de los demás y una persona se queda junto a él.',
'Auh in ye yuhqui in on tlenamacac niman ye ic teixpan on motlalia ce tlacatl itech mocaua.',
'Classical Nahuatl',
'Vida económica de Tenochtitlan',
'nci'
]
Each element of the list has four indices:
- non_original_language (l1)
- original_language (l2)
- variant
- document_name
- iso lang (optional)
tsunkua = elotl.corpus.load('tsunkua')
for row in tsunkua:
print(row[0]) # language 1
print(row[1]) # language 2
print(row[2]) # variant
print(row[3]) # document
Una vez una señora se emborrachó
nándi na ra t'u̱xú bintí
Otomí del Estado de México (ots)
El otomí de toluca, Yolanda Lastra
Package structure
The following structure is a reference. As the package grows it will be better documented.
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── dist
├── docs
├── elotl Top-level package
├── corpora Here are the corpus data
├── corpus Subpackage to load corpus
├── huave Huave language subpackage
└── orthography.py Module to normalyze huave orthography and phonemas
├── __init__.py Initialize the package
├── nahuatl Nahuatl language subpackage
└── orthography.py Module to normalyze nahuatl orthography and phonemas
├── otomi Otomi language subpackage
└── orthography.py Module to normalyze otomi orthography and phonemas
├── __pycache__
└── utils Subpackage with common functions and files
└── fst Finite State Transducer functions
└── att Module with static .att files
├── LICENSE
├── Makefile
├── MANIFEST.in
├── pyproject.toml
├── README.md
└── tests
Development
Requirements
Quick build
poetry env use 3.x
poetry shell
make all
Where 3.x
is your local python version. Check managing environments with poetry
Step by step
Build FSTs
Build the FSTs with make
.
make fst
Create a virtual environment and activate it.
poetry env use 3.x
poetry shell
Update pip
and generate distribution files.
python -m pip install --upgrade pip
poetry build
Testing the package locally
python -m pip install -e .
Send to PyPI
poetry publish
Remember to configure your PyPi credentials
License
Mozilla Public License 2.0 (MPL 2.0)