This repository is a mirror from my GitHub Profile => https://github.com/pavan245/bitext-aligner (Check GitHub link for more details). https://pavan245.github.io/bitext-aligner/
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Pavan Mandava 561790fa36
Updating README
6 years ago
aligner test_aligner.py added 6 years ago
data Reading FB2 Files done, Creating XML from parsed FB2 file done, FB2 files structure issues fixed 6 years ago
db Reading FB2 Files done, Creating XML from parsed FB2 file done, FB2 files structure issues fixed 6 years ago
fb2_parser Reading FB2 Files done, Creating XML from parsed FB2 file done, FB2 files structure issues fixed 6 years ago
json Adding Sample Queries 6 years ago
schema fix semicolon 6 years ago
slides final version - for real this time, hopefully 6 years ago
utils XSLT for Combining Books Done, 6 years ago
xml_files Main Heading Added to index.html 6 years ago
xml_parser Notes from the Underground Done, 6 years ago
xslt Main Heading Added to index.html 6 years ago
.gitignore Added .DS_Store to gitignore 6 years ago
404.html Page Not Found Added 6 years ago
README.md Updating README 6 years ago
books_data.csv Updated books_data.csv 6 years ago
db_config.ini DB Schema Name changed to bitext_aligner 6 years ago
index.html Main Heading Added to index.html 6 years ago
requirements.txt Updated Requirements file 6 years ago
run.py Main Heading Added to index.html 6 years ago

README.md

bitext-aligner (Parallel Corpus Creation)

In the contemporary era of data-driven Natural Language Processing (NLP), Parallel Corpora has been a key resource in addressing the requirements of our multilingual society. In our project, we were motivated to create a parallel corpus that provided an accessible mapping of two language pairs - Russian-English and German-English. The scope of this, however, can be extended to several other languages.

Data forms the backbone of our corpus and its collection was the initial task that required us to experiment with several types of file formats. We have used the FictionBook 2.0 or simply, FB2 file format due to its XML parsability and compatibility with the standard XSD format. Furthermore, this format is designed for fictional literature that suited the very nature of our data.

In order to create an efficiently mapped corpus, we developed an aligner that identified and matched the corresponding units of the input text. These units in our project are sentences. Tokenization of these sentences is the first step that contributes towards the overall alignment algorithm. This is followed by their translation using Googles NMT API. The core part of the aligner is the usage of Levenshtein distance for finding similarity between the original and translated sentences. Through an iterative process, we then find the best matches constrained within a window and perform the alignment.

The output from the aligner is then saved to multiple XML files. Recording the file path through JSON simplified the management of several stages through its addition to a database. We present our corpus as a web-page and therefore, we used XSLT to transform data from multiple XML files to an HTML file. While the XML/HTML formats enhance the readability of the project, they cannot act upon queries to retrieve information required by the user. Therefore, we used a SQL database (MySQL) to assist the user in querying.

There are various possible enhancements to our project. Through corpus extension, efficient alignment at a word level, the inclusion of a local NMT engine and noise reduction, our project can be amplified to a much greater extent. Furthermore, analysis of translation styles, language learning and paremiology are some of the areas that can witness the usage of our project.