Elnu's Blog

I write about things. Sometimes.

The Joy of Data Processing

Posted by Elnu on

For a project I’m working on (I’ll make a post about it once it’s done), I needed a large list of Japanese words, with the requirements being that the words be short and without kanji (only in hiragana). In addition, ideally they should be simple words that the average Japanese learner would know, and must be in a machine-readable format that I can use in JavaScript.

Luckily, I found something that matches these criteria! A JLPT N5 vocabulary list from JLPT Matome that covers 549 words, which should hopefully be more than enough for my needs.

However, the list is in the form of HTML tables, and so it’s going to needs some work. In addition, the pronunciation is provided only in romaji, so we’ll need to convert it to hiragana.

This is where the joy of data processing and web scraping comes in: refining a data source step-by-step until it’s something you can work with. This post is going to be less of a tutorial and more of a walk-through of a process that I found fun.

Extracting the raw data

The first step is to convert the tables to a CSV file. If you’re unfamiliar with CSV, it’s basically the most basic way of storing spreadsheet data in a file. On each line, there’s a comma-separated list of values (hence it’s name, a Comma-Separated Values file), one for each column, and each line corresponds with a row.

There are various command-line applications and browser extensions that convert HTML to CSV, but what I ended up using is this website. Using a website for conversions is cringe, I know, but when you’re doing a one-off thing and don’t need to write any scripts to do the job, utility websites are often the easiest way to get the job done.

It’s pretty handy, and it does the URL fetching for you. The only annoyance was that the source was paginated onto 11 pages, so I had to do each page separately then put them together, but after that was all done I had a nice CSV file:

1,あげる,ageru,to give

Now, this is more data than we need. I only need the third column with the pronunciations (we’ll convert these into hiragana later), so one can remove the first, second, and fourth column in some spreadsheet software like LibreOffice and then reexport a single-column CSV file with just the romaji. Once that’s done, we have a text file that is just a list of pronunciations:


Converting to hiragana

After a bit of research, I found koozaki/romaji-conv, an npm package that does exactly what I need. There’s a web-based demo that you can use if you want to quickly try it out or do a quick conversion, but it also has a CLI (command-line interface), which is what we’ll use.

Assuming you already have Node.js and npm installed, you can globally install it with the following command (i is a an alias for install, -g installs the package globally to your system instead of to a particular project’s node_modules):

npm i -g @koozaki/romaji-conv

We can use the following command to run romaji-conv on each line of our romaji file, romaji.csv, and push the resulting hiragana version to a new file, hiragana.csv. Previously, I wasn’t familiar with xargs but I found this solution thanks to this Stack Overflow answer.

cat romaji.csv | xargs -L1 romaji-conv > hiragana.csv

Converting into a JSON list

The final step is to convert this simple text file list into a JSON list/array that we can put directly into our JavaScript for use, in a format such as the following:

["あげる", "あさ", "ふうとう", "ふゆ", "ご", ...]

To do this, all we need to do is a simple find-and-replace in any text editor, first replacing each new line \n with ", ", and then finally adding the starting [" and closing "].

We’re done!

Closing thoughts

The things I’ve done here by themselves are not in any way groundbreaking. I could have done some things more elegantly, and honestly none of this is anything to call home about. However, I wrote this post anyway because I wanted to show the power of step-by-step data processing and manipulation, and how one with a bit of time one can mutate data sources into whatever form is needed.

If you see data in tables or some other form online, don’t give up! Getting it into the form you need isn’t going to be as hard or time-consuming as you think.

If you’re interested in Japanese text manipulation, please do check out Kojiro Ozaki’s romaji-conv on GitHub and give it a star! ⭐ It’s super handy and easy to use, and is painfully underrated at only 8 stars (including mine) at the time of writing.

See you in the next post!