I already wrote about words count challenge. And couple weeks later I found another one about words. You can read detailed description, but the quote below is enough in most cases.
For the purposes of this problem, we define an English word as a word from the list included here as words.txt. (If you're interested, it's the SCOWL “insane” dictionary with flattened diacritics.) Membership in the dictionary is case-insensitive. You have to write a program that can answer whether a given word is English. This would be easy — you'd just need to look the word up in the dictionary — but there is an important restriction: your program must not be larger than 64 KiB. We don't think it's possible to write a program that fits in 64 KiB and always gives correct answers. But we don't require 100% correctness. For every program we will measure how often it answers correctly, and the most precise solution shall win.
I decided to spend some time to stretch my mind. Unfortunately, I wasn't on vacation, so the amount of time available was pretty limited. But it was fun, even though there's plenty of people beat my result. I found some weak areas in my approach to problem-solving. Here's what I did good and bad.
Step 1 - The naive approach
The naive approach was to try different compression methods. I was sure that size limit was selected to prevent "just compress it" solution. But in my mind, it's better to spend a bit of time to know that compression doesn't work than spend even more time in hesitation "maybe I should try it." Moreover, it gives a rough estimation of how much data I should shave to fit everything into 64k.
Step 2 - Try them all
There is a lot of algorithms to build word classifiers. Starting from a dumbest "return false;" rule to complex machine learning algorithms. I tried the following:
- return false
- bloom filter
- HyperLogLog
- n-grams
- neural network
- prefix tree
- heuristics
I decided to combine bloom filter with heuristics. What I like about Bloom filter is an understandable behavior and performance. The performance was crucial to me because time was limited and I wanted a fast feedback. On another hand, due to the simplicity of Bloom filter, I understand what can be done to increase results precision.
Step 3 - Adjust Bloom filter
Bloom filter alone gives ~60% precision. Errors are hash collisions - a hash of a string which is not a word matches with a hash of a word. So to reduce errors count we need to reduce a chance of collisions. Below is a list of what in my mind can be done. Some of these ideas I had from the start, some of them I got during the process.
- Increase filter size. Most obvious idea, but global size limit doesn't allow just to make it bigger. The size increasing can be done either by reducing code size or by improving compression of serialized filter form.
- Modify a hash function. Different hash functions will give different collisions count on the same data set. Also, most of the hash functions have some initial value which also affects collisions count.
- Add fewer words to the filter. If we can make a decision if a string is a word or not using heuristics, then bloom filter has less data, therefore, a lower chance of collision.
- Unset "bad" bits. Some bits are used by just one word but cause a lot of collisions. Setting to zero bits with collisions chance higher than a chance of occurring of particular word increases false-negative errors but improves overall precision.
Step 4 - Things are getting complicated
At first glance solution seems easy - evaluate the probability of hit/miss chance for each decision and then combine best of them together to get the best overall result. In fact, most decisions affect other points, so it becomes hard to estimate summarized effect of change.
For example, if we add a rule that words longer than 13 letters are not a word. It increases code size (negative effect), increases false negatives (-), reduces false positives (+), reduces an amount of data in bloom filter (+). So in a lot of cases, it's easier to try and see if a change improves precision or not.
I made a wrong decision here. I relied mostly on my mind and tested only modifications which in my opinion should make an improvement. So I miss unobvious improvements. Investment of some time on early stages to implement a system to brute force most of the parameters could give me a much better result. I realized it couple days before the deadline. So I partially brute forced "bad" bits and the base value of the hash function.
Step 5 - Heuristics
Some of the participants wrote their approach after the deadline. And the combination of a Bloom filter with a set of rules - by far the most common approach. But even with the same global idea results are significantly different. So heuristics are one of the most critical components.
You also need to reiterate through the rules set time to time. Because filter that gave a decent precision boost on start can make it worse in later stages. For example, if some condition has 24% error rate then it's awesome when you have overall error rate 40% but a bad thing when you have 20% of misses.
The final set of rules you can find in my submission.
Step 6 - A little bit of time and space left
Last day before the deadline and couple hundred bytes of free space. We had no a live leaderboard, so I had just guesses if my results are somewhere in top 10 or even in top 3. It would be a pity to lose with a hundredth part of the percent different. So I decided to pack some more data up to strictly 64k.
I wrote a program which generates a regex from words and created regular expressions for two and three letters words. And then added part of expressions to use all available space. Doesn't give a much of improvements but why to waste a space if you have no better usage for it.
Step 7 - The brilliant solution I missed
By the rules, the testing script will call test function multiple times without a relaunch. So it's possible to gather statistics about input strings. Set of words is limited and set of false words is unlimited. So if test function receives same input parameter multiple times - most likely it's a word. And precision improves with increasing of the testing set. It's a beautiful and mindblowing idea. I'm a bit sad that I didn't think about rules enough to get this idea myself.
But in fact, such approach was unintended, so final testing set contains 1 000 000 words. It's too small to get decent precision with learning.
Conclusion
I had a great time. Here are some lessons I learned:
- Investment in infrastructure and tools can save a lot of time later.
- Using your mind is great, but for each subtask with a limited set of choices, you should evaluate solving of the task by brute forcing it.
- Read every single line with thinking "how can I use it to win." I was excited when I read the solution described in "Step 7."
- Have fun. It'll allow taking most benefits from the task.
I took 11th place out of 312 participants. And just 0,05% away from top 10.
Thanks to the Hola company for an excellent challenge.