Simple Sentence Parser for Indonesian Text

Last week I got a job assignment to create a simple Indonesia text summarizer. My lovely wife,  Nana, as she knows better than me about text processing, she suggest me to make a simple system which selects the most important sentence in the document, and return it as summary (yes, it very simple but I think it is effective enough for this project)

She suggests me some algorithms to do that, then she warned me:

“In this case, the problem will be in the sentence parser method not in summarizer. It is not easy to handling the academic degree, direct sentence, number, etc. Last time, I need to create a dictionary which content of all academic degree.”

wow

Then, after hours struggling, this is my proposed method to create an Indonesia sentence parser which I use in my project (TL;DR I didn’t post my full code here, because I doubt my solution is the best, in this post I just tell you what I make, and how I do it):

My first trial

My first trial to split the paragraph is using regex (actually you can use string split function in python) to split based on character. I split the text every I found one of these characters: a dot (“.”), a question mark (“?”), and an exclamation mark (“!”) and followed by a whitespace.

newtext = re.split("([\.\!\?]) ", text)
Then, I found two facts: First, I didn’t really need to split on the question and exclamation mark, in my data those symbols only appear in the direct sentence which shouldn’t be split. Second, my assumption to put a whitespace after those symbols successfully avoiding split in number (e.g. “6.700 victims”)

Avoiding split in direct sentence

“Roses are red. Violets are Blue,” said Hugo.
To avoiding split in the direct sentence I create a boolean variable that checks the dot is inside the brackets or double quote. If the dot is inside the bracket then it will be marked not to be split.
inside = False
for i in range(len(text)):
    if text[i]=="\"":
        inside = not inside
    if inside and text[i]==".":
        # mark to not split here

Marked not to be split

How to mark a dot character not to be split? I do this by putting a unique symbol, (which I sure will never exist in my document) before and after it, so when I split using re.split("(\.) ", text) it will be ignored because there are unique symbols before and after it:

“Roses are red@#.@# Violets are Blue,” said Hugo.

Handling abbreviation or acronym

This is the hardest part, let say you have this sentences:

Last night we met H. Akbar M. and Rian R. We ate so much last night in Thai Restaurant

Why is it so hard? If you see carefully, the dot after H and M is the abbreviation dot, but the dot after R is the abbreviation dot and also the end of sentence dot. I can’t tell the difference between these dots! So, my solution to this case is: if it is an abbreviation dot, do not spliteven though it is at the end of a sentence. It means this method will not split two sentences above but return it as one sentence.

Detect abbreviation dot

In this project, I have created a rule to classify a dot is an abbreviation dot or not. This rule works perfectly in my case, I hope it works in any case.:

  1. If a word just before the dot exists in abbreviation dictionary (yes, I create an abbreviation dictionary like my wife did it before), then it is an abbreviation dot (e.g. “Prof.”, “Dr.”, “Hj.”, “dll.”, “Moch.”)
  2. Else, If a word just before the dot is only a character, then it is an abbreviation dot. (e.g. “M.”, “H.”)

How about organization abbreviation? like the U.S.A.? in my case, it follows the correct rule to not put a dot character on organization abbreviation.

Last touch

This is it, after you mark the abbreviation dot and dot inside a bracket not to be split. You can freely split your text using regex function (don’t forget to put a whitespace after the dot as a condition). Then don’t forget to remove the unique symbols you use to mark the dot after the split. For now, I didn’t provide fully-functional code in this post, but I hope you can implement it by yourself ;).

Image from: http://www.robertgillphotography.com

“Hack” Long Table using Tabularx LaTex

When creating a table in LaTex I really like using tabularx package, it is easier to control the width of the column (and I think it is the most important part of creating table). But one of the drawbacks, tabularx didn’t support long table. It means you can’t create a table through the page using it.

But I found a simple “hack” to create a long table. I said it is a “hack” because we will not create a really long table, but we create two tables with the same properties and make it look like a long table. This is how we do it:

  • Create two table with same properties, make sure the second table is on the second page. You can use \newpage  command or just put your second table in the first table.
\begin{table}[H]
\caption{Table 1}
\begin{tabularx}{6.25in}{}
...
\end{tabularx}
\end{table}

\begin{table}[H]
\caption{Table 2}
\begin{tabularx}{6.25in}{}
...
\end{tabularx}
\end{table}
  • Next, make sure your second table number is same as the first one. You can do that by put \addcounter  before define the second table. It will subtract your current table numbering by 1.
...
\addtocounter{table}{-1}
\begin{table}[H]
\caption{Table 2}
...<span id="mce_marker" data-mce-type="bookmark" data-mce-fragment="1">​</span>
  • The last, remove the second table from indexing in Table of Content, by put a square bracket in the caption
...
\caption[]{Table 2}
...

and done! we just create a “hack” version of a long table!

OF course the bad of this long table is we can’t automatically move the last item in the first table and put it in the second table when need it. But at least, it looks like a long table 😉 and we didn’t need to automate everything right? If you have any suggestions to create a long table using tabularx, put a comment below!

Thank You!

image from: https://www.potterybarn.com/

Create custom Ubuntu terminal commands

A few days ago I invited in the Grand Final of Kode Indonesia in Jakarta. That is a programming contest held by Kalibrr. Different with other programming competitions that I have ever participated in, I realized in this competition the organizer doesn’t provide computer or notebook for participants so I must use my own notebook in the grand final.

So, my first thought for preparing my notebook for competition is

let’s create a simple command to compile C++ file!

The default command that I used to compile C++ file in Ubuntu is:

g++ myfile.cpp -o outfile$

and then run the program using command:

./outfile

My mission is to simplify two commands above become one simple command. This is the way how I do it:

  1. Create a script file, let say customcpp.sh
  2. Add in the very first line#!/bin/bash , then your command after it. So in my case it will content
    #!/bin/bash
    CPPFILE="$1"
    g++ ${CPPFILE}.cpp -o outfile
    ./outfile
  3. Move your file to /usr/local/bin you can move it by using terminal using command below. SCRIPTNAME is the command which you will use to call this script in the terminal. I use cmps as SCRIPTNAME
    sudo mv ~/customcpp.sh /usr/local/bin/SCRIPTNAME
  4. Set the correct permission and done!
    sudo chown root: /usr/local/bin/cmps
    sudo chmod 755 /usr/local/bin/cmps

After set the custom command, now I just need to run in the terminal cmps mycppfile  to compile and run the cpp file. It faster though than before 😉

Refference:

https://askubuntu.com/questions/789476/how-to-create-my-own-terminal-commands

Recognizing Arabic Letter Utterance using Convolutional Neural Network

Arabic letters have unique characteristics because of similarity of sound produced when reciting few letters. This paper present one of application Convolutional Neural Network (CNN) in speech recognition Arabic letters. CNN has shown very good performance for image and speech recognition int the last few years. This study examined the several types of CNN models as well as compare with some Deep Neural Network (DNN) models to speech datasets used. As a result, CNN with a convolution layer and one layer fully-connected managed to obtain an accuracy of up to 83.00%, far better than the traditional DNN that only able to reach 79.25%.

Download here: http://ieeexplore.ieee.org/document/8022720/

This is my first published paper,  not really good, or advance 🙁
but I hope it useful! 🙂

Belajar dari Stackoverflow

Salah satu yang berat menurut saya untuk belajar Artificial Intelligence/Machine Learning/Data Mining adalah untuk belajar saja, kadang kita perlu menuliskan sintaks kode yang lumayan panjang dan kompleks. Kadang hal ini yang bikin jadi down dulu sebelum memulai belajar.

Nah, salah satu solusi belajar yang saya temui cukup membantu saya adalah dengan aktif di Stackoverflow. Ya, bagi kita para programmer situs itu bukanlah situs yang asing. Kadang ketika kita menemukan eror atau kesulitan, lalu mulai googling untuk mencari solusi, maka Stackoverflow lah yang sering memberi jawaban.

Ketika saya menyarankan untuk ‘aktif’, di sini maksudnya bukan sekadar mencari jawaban orang lain tapi aktiflah untuk mencari pertanyaan dan memberi jawaban.

1. Mencari Pertanyaan

Ya, cobalah mencari pertanyaan yang setopik dengan materi yang ingin kalian perlajari. Telusuri satu persatu pertanyaan yang sudah pernah ditanyakan, siapa tahu kalian menemukan pertanyaan atau jawaban yang menarik yang sebenarnya cukup penting untuk ditanyakan tapi kita tidak pernah terpikirkan.

Dengan melihat jawaban orang lain kita juga jadi bisa menambah wawasan dengan menemukan solusi solusi menarik bagaimana orang menyelesaikan masalahnya.

2. Memberi Jawaban

Dengan mencoba menjawab beberapa pertanyaan, kita akan belajar untuk memahami lebih dalam materi yang kita pelajari. Kalau ada orang bilang, dengan mengajar ilmu kita bertambah, maka benar saja, dengan mencoba menjawab pertanyaan-pertanyaan di sana kita akan semakin bertambah wawasannya. Kita akan belajar bagaimana menjelaskan dan memberi contoh yang menjawab pertanyaan pengguna lain.

MVCE

Tips lain ketika ingin aktif di Stackoverflow adalah dengan membaca terlebih dahulu aturan di sana, seperti bagaimana cara bertanya dan menjawab yang baik. Di sini kalian akan mengenal istilah MVCE (Minimum Veriviable Complete Example), atau bagaimana memberikan contoh program kita yang eror secara minimalis. Kita akan belajar bagaimana menjelaskan eror pada program kita secara baik kepada orang lain.

Belajar AI

Dan salah satu manfaat yang saya dapat adalah dengan mencoba melihat dan menjawab pertanyaan-pertanyaan di Stackoverflow seputar AI/ML/DM kita akan mendapat wawasan sekaligus berlatih dengan jumlah baris kode yang ditulis tidak terlalu banyak 😉 Semoga Bermanfaat!