The Joy of Data ProcessingPosted by Elnu on
Luckily, I found something that matches these criteria! A JLPT N5 vocabulary list from JLPT Matome that covers 549 words, which should hopefully be more than enough for my needs.
However, the list is in the form of HTML tables, and so it’s going to needs some work. In addition, the pronunciation is provided only in romaji, so we’ll need to convert it to hiragana.
This is where the joy of data processing and web scraping comes in: refining a data source step-by-step until it’s something you can work with. This post is going to be less of a tutorial and more of a walk-through of a process that I found fun.
Extracting the raw data
The first step is to convert the tables to a CSV file. If you’re unfamiliar with CSV, it’s basically the most basic way of storing spreadsheet data in a file. On each line, there’s a comma-separated list of values (hence it’s name, a Comma-Separated Values file), one for each column, and each line corresponds with a row.
There are various command-line applications and browser extensions that convert HTML to CSV, but what I ended up using is this website. Using a website for conversions is cringe, I know, but when you’re doing a one-off thing and don’t need to write any scripts to do the job, utility websites are often the easiest way to get the job done.
It’s pretty handy, and it does the URL fetching for you. The only annoyance was that the source was paginated onto 11 pages, so I had to do each page separately then put them together, but after that was all done I had a nice CSV file:
1,あげる,ageru,to give 2,朝,asa,morning 3,封筒,fuutou,envelope 4,冬,fuyu,winter 5,五,go,five ...
Now, this is more data than we need. I only need the third column with the pronunciations (we’ll convert these into hiragana later), so one can remove the first, second, and fourth column in some spreadsheet software like LibreOffice and then reexport a single-column CSV file with just the romaji. Once that’s done, we have a text file that is just a list of pronunciations:
ageru asa fuutou fuyu go ...
Converting to hiragana
After a bit of research, I found koozaki/romaji-conv, an npm package that does exactly what I need. There’s a web-based demo that you can use if you want to quickly try it out or do a quick conversion, but it also has a CLI (command-line interface), which is what we’ll use.
Assuming you already have Node.js and npm installed, you can globally install it with the following command (
i is a an alias for
-g installs the package globally to your system instead of to a particular project’s
npm i -g @koozaki/romaji-conv
We can use the following command to run romaji-conv on each line of our romaji file,
romaji.csv, and push the resulting hiragana version to a new file,
hiragana.csv. Previously, I wasn’t familiar with
xargs but I found this solution thanks to this Stack Overflow answer.
cat romaji.csv | xargs -L1 romaji-conv > hiragana.csv
Converting into a JSON list
["あげる", "あさ", "ふうとう", "ふゆ", "ご", ...]
To do this, all we need to do is a simple find-and-replace in any text editor, first replacing each new line
", ", and then finally adding the starting
[" and closing
The things I’ve done here by themselves are not in any way groundbreaking. I could have done some things more elegantly, and honestly none of this is anything to call home about. However, I wrote this post anyway because I wanted to show the power of step-by-step data processing and manipulation, and how one with a bit of time one can mutate data sources into whatever form is needed.
If you see data in tables or some other form online, don’t give up! Getting it into the form you need isn’t going to be as hard or time-consuming as you think.
If you’re interested in Japanese text manipulation, please do check out Kojiro Ozaki’s romaji-conv on GitHub and give it a star! ⭐ It’s super handy and easy to use, and is painfully underrated at only 8 stars (including mine) at the time of writing.
See you in the next post!