Last week I got a job assignment to create a simple Indonesia text summarizer. My lovely wife, Nana, as she knows better than me about text processing, she suggest me to make a simple system which selects the most important sentence in the document, and return it as summary (yes, it very simple but I think it is effective enough for this project)
She suggests me some algorithms to do that, then she warned me:
“In this case, the problem will be in the sentence parser method not in summarizer. It is not easy to handling the academic degree, direct sentence, number, etc. Last time, I need to create a dictionary which content of all academic degree.”
Then, after hours struggling, this is my proposed method to create an Indonesia sentence parser which I use in my project (TL;DR I didn’t post my full code here, because I doubt my solution is the best, in this post I just tell you what I make, and how I do it):
My first trial
My first trial to split the paragraph is using regex (actually you can use string split function in python) to split based on character. I split the text every I found one of these characters: a dot (“.”), a question mark (“?”), and an exclamation mark (“!”) and followed by a whitespace.
newtext = re.split("([\.\!\?]) ", text)
Then, I found two facts: First, I didn’t really need to split on the question and exclamation mark, in my data those symbols only appear in the direct sentence which shouldn’t be split. Second, my assumption to put a whitespace after those symbols successfully avoiding split in number (e.g. “6.700 victims”)
Avoiding split in direct sentence
“Roses are red. Violets are Blue,” said Hugo.
To avoiding split in the direct sentence I create a boolean variable that checks the dot is inside the brackets or double quote. If the dot is inside the bracket then it will be marked not to be split.
inside = False
for i in range(len(text)):
inside = not inside
if inside and text[i]==".":
# mark to not split here
Marked not to be split
How to mark a dot character not to be split? I do this by putting a unique symbol, (which I sure will never exist in my document) before and after it, so when I split using
re.split("(\.) ", text) it will be ignored because there are unique symbols before and after it:
“Roses are red@#.@# Violets are Blue,” said Hugo.
Handling abbreviation or acronym
This is the hardest part, let say you have this sentences:
Last night we met H. Akbar M. and Rian R. We ate so much last night in Thai Restaurant
Why is it so hard? If you see carefully, the dot after H and M is the abbreviation dot, but the dot after R is the abbreviation dot and also the end of sentence dot. I can’t tell the difference between these dots! So, my solution to this case is: if it is an abbreviation dot, do not split, even though it is at the end of a sentence. It means this method will not split two sentences above but return it as one sentence.
Detect abbreviation dot
In this project, I have created a rule to classify a dot is an abbreviation dot or not. This rule works perfectly in my case, I hope it works in any case.:
- If a word just before the dot exists in abbreviation dictionary (yes, I create an abbreviation dictionary like my wife did it before), then it is an abbreviation dot (e.g. “Prof.”, “Dr.”, “Hj.”, “dll.”, “Moch.”)
- Else, If a word just before the dot is only a character, then it is an abbreviation dot. (e.g. “M.”, “H.”)
How about organization abbreviation? like the U.S.A.? in my case, it follows the correct rule to not put a dot character on organization abbreviation.
This is it, after you mark the abbreviation dot and dot inside a bracket not to be split. You can freely split your text using regex function (don’t forget to put a whitespace after the dot as a condition). Then don’t forget to remove the unique symbols you use to mark the dot after the split. For now, I didn’t provide fully-functional code in this post, but I hope you can implement it by yourself ;).
Image from: http://www.robertgillphotography.com