English to Vietnamese Noun Translator

Project Summary

This project was part of my LING 508: Computational Techniques for Linguists course in the Master's in Human Language Technology program at the University of Arizona. It is a web-based application that I developed using Python, SQL, and HTML. I implemented it with a microservices architecture, using principles of object-oriented programming and a test-driven development approach.

The code is available on my GitHub here.

I first came up with a few possible use cases, and then I decided on two main functions of this project based on the second use case: grab a word from a database, and add a word to the database.

Get a word from the database

The database allows users to look up information about nouns in English and Vietnamese. To use the database, the user inputs an English noun using the "Get a word from the database" interface, and the app returns the Vietnamese translation with its classifier (if applicable) as a JSON string. For example:

    {
      "noun": {
        "english": "bird",
        "classifier": "con",
        "vietnamese": "chim",
      }
    }
  

Add a word to the database

The user can add an English noun and its Vietnamese translation to the database using the "Add a word to the database" interface. For example, the user could enter pig in the English Input, con in the Classifier Input, and heo in the Vietnamese input, and the page will respond that the noun has been added successfully or that it already exists in the database.

Approach

My approach was heavily shaped and guided by how the structure of the assignment was presented to us in LING 508. As this was my first time writing a web-based application, I closely followed the recommendations and examples provided in the course. Many of the techniques and tools I used were new to me. In this course, I learned how to use Flask and SQL, and I learned about the components that can go in a web-based application, such as the service layer, API, database, and HTML interface.

After deciding on the use case, which was for the app to take a noun in English as input and return the original input, the Vietnamese translation, and the appropriate classifier (if applicable) as output, I developed a UML diagram to determine how to organize the classes. In the enums.py file, I made an enumerator class for classifiers.

In Vietnamese, most nouns are typically used with a classifier, especially when the noun is plural. For example, the Vietnamese translation for "two pigs" would be "hai con heo," and not "hai heo." It is like how certain nouns in English, like "money" or "water," would be grammatically incorrect if expressed as "two money" or "two water." Such nouns would instead have to be expressed as something like "two bags of money" or "two bottles of water." Different types of nouns in Vietnamese require different types of classifiers. For example, animals typically use "con" for their classifier, while many objects typically use "cái."

In the generators.py file, I defined a Noun class with three main attributes: the English translation of a noun, the classifier for the Vietnamese translation, and the Vietnamese translation itself. The Noun class contains a classifier_mapper method that consists of a dictionary in which the keys are the string inputs of the classifier and the values are the corresponding classifiers I listed in the Classifier class. The main purpose for this method was to make it easier to refer to the classifiers in other parts of the code as needed without having to worry about maintaining the diacritics.

The tests folder contains tests in test_classes.py, test_services.py, and test_sql.py to ensure proper function of my classes, services, and SQL database respectively. Some of these tests do not intuitively seem necessary, such as the one to test the ability to remove a noun from the database. After all, the use case is focused on adding and pulling nouns to and from the database. However, being able to remove a noun and then test that the noun was properly removed allowed for more efficient debugging, as I would not have to write new tests to assess the correct retrieval and storage of nouns not yet used in my testing. Removing the noun would allow me to reuse the same tests on the same noun repeatedly.

In the data folder, I have a database written in SQL in init.sql. I developed a MySQL repository with three functions: one to pull nouns from the database, one to store nouns in the database, and one to remove nouns from the database. The first two functions were for the use case, and the last function was to facilitate with testing during development, following the test-driven development approach introduced early in my LING 508 course. There is also a general repository in repository.py that imports the abc module to define abstract base classes and facilitate substitutions, increasing flexibility.

The service layer in services.py contains the code to address the use case. The noun_classifier method checks if the noun the user has input is already in the repository. If it is, then it retrieves the noun from the repository. However, if it is not already in the database, then the noun is put through a Translator from the imported translate library to translate the noun into Vietnamese. To ensure the classifier is included, the word "the" is placed in front of the noun before putting it through the translator. This is because when the English word "the" is in front of a noun, it is often translated into the noun's classifier in Vietnamese. The method then stores this information in the repository so that it can be called again in the future without having to be translated again. The add_noun method checks for the presence of the noun in the database to determine if it needs to be stored in the database. The parse_noun method retrieves the noun from the database and returns an instance of the Noun class I defined in generators.py.

The API layer in app.py uses Flask to handle requests for retrieving data and storing data. For example, the method for the GET request looks like this:

    @app.route("/noun_data/", methods=["GET"])
    def noun_data(noun: str):
        res = services.noun_classifier(noun)
        app.logger.info("/noun_data - Got noun data: " + noun)
        return jsonify(english=res.eng, classifier=res.classifier.value if res.classifier else "N/A", vietnamese=res.viet)
  

The API can be called directly without using the UI by using a GET request. The endpoint for this example is http://localhost:5000/noun_data/bird. The GET request must contain the desired word in English at the end of the URL. A complete curl command looks like this:

    curl -i -X GET localhost:5000/noun_data/bird
  

Throughout the files in the GitHub repository, there is additional code for another use case that I ended up not completing for the assignment but hope to finish in the future. The other use case was to parse a pronoun in Vietnamese. The app would take a Vietnamese pronoun as input and return information on the gender, person, number, and context.