Table of Contents Hide
- The lousy powers behind Hemingway Editor, Grammarly, and Readable.com.
- Flesch-Kincaid test
- Automated Readability Index
- Gunning Fog Index
- Checking the text with FK
- These fluctuations in FKT results on different platforms make me feel uneasy. The formula is clear, so the results with FKT must be the same. Not like this:
- Checking the text with ARI
- Checking the text with GF
The lousy powers behind Hemingway Editor, Grammarly, and Readable.com.
At the core of the most popular online editors are readability tests that were invented 60–80 years ago.
These UX writing and copywriting platforms use Flesch-Kincaid tests, the Gunning Fog Index, or Automated Readability Index to measure if the text is easy to read:
- Hemingway Editor
Flesch-Kincaid tests (FK or FKT) and the Gunning Fog Index (GF or GFI), why do they sound like something from particle physics? Why Hemingway Editor prefers the Automated Readability Index (ARI) to every other readability test?
Most importantly, are these tests even viable in 2023?
Let’s see what I’ve dug up.
My goal is to write in a simple and accessible language. This is my mission. I stupify the text for people who no longer read. This doesn’t mean I think my target audience is stupid. It’s just that they may not have all the time in the world to read something that requires extra effort.
Readability tests help me evaluate if my writing is really for everyone (as law professors say, “for everyone in the Clapham omnibus”). The goal of this article is to check how reliable these tests are.
In content that’s not life chronicles (e.g., the life and death of Marie Antoinette), we should mention as few dates as possible. It’s not a school essay—let’s get to the point.
As I understood from Dr. R. Flesch’s book, readability tests appeared so that newspapers, written by well-educated people in an intricate manner, could maximize their reach. See, for example, two different approaches to one news piece:
Journalists were interested in making their text more readable. The more people understand the text, the more likely they are to buy newspapers, right?
Here’s the history of FKT in 2 sentences according to ChatGPT:
The Flesch-Kincaid tests were developed in the mid-20th century to assess the readability of written material. Rudolf Flesch introduced the first version of his formula in 1948, which was later expanded upon by J. Peter Kincaid in the 1970s.
ARI (from the 1960s) and GFI (from the 1950s) are also pretty old. That’s all you want to know about their history.
There are also similar readability tests like the SMOG index, Fry formula, and the Coleman-Liau index which I may cover sometime in the future.
The readability tests can only be used to evaluate the simplicity of English written text, not speech (even a transcribed one). All tests are fairly similar. You count the words, apply a certain formula, and get your text rated.
So what can you do to improve your writing…? What you really need is a good working knowledge of informal, everyday, practical English.
— R. Flesch
To use FKT, you count the number of words, sentences, and syllables in a sample of your writing. Then you apply a formula:
0.39 * (words/sentences) + 11.8 * (syllables/words) — 15.59
As a result, you get a grade score that correlates to how many years of education someone would need to understand the text. For example, Ernest Hemingway’s writing is readable for a fifth-grader, so his grade score would be 5. FKT also provides you with a reading ease score.
Automated Readability Index
To use ARI, you first count the number of characters, words, and sentences in a sample of your writing. Then you apply a formula:
4.71 * (characters/words) + 0.5 * (words/sentences) — 21.43
As in FKT, you get a grade score.
Gunning Fog Index
The index looks at two things: the length of the sentences and the difficulty of the words used. To use the GF, you first count the number of words in a sample. Then, you count the number of “complex” words (complex = words that have 3+ syllables, excluding proper nouns, compound words, -ing, and a few other prefixes). Finally, you apply a formula:
0.4 * ((words/sentences) + 100 * (complex words/words))
The index is a number between 0 and 20, with a higher number meaning your writing is more difficult to read.
Let’s check my text with each of the readability tests.
This email with a reply to a user complaint will be our test subject:
Thank you for contacting us! My name is Alex, and I will assist you with your complaint.
We are against all stereotypes, including gender-based ones, which is why there is no discrimination in the game. We are sorry to hear that the event made you feel this way. We value our players’ opinions and will ensure that your message reaches our development team, who do their best to make the game friendly and respectful to everybody.
Feel free to contact us if you have any further feedback or questions.
Enjoy your day!
Checking the text with FK
The challenge was to find a reliable online calculator to check my text using FKT. Each calculator provided a different result.
- TextCompare.org provided me with a number (72.23) but offered no explanation or conversion into a grade. I assume this is an FK reading ease score, but the calculator doesn’t clarify whether it’s good or bad.
2. Good Calculators’ testing resulted in a detailed analysis but with a different reading ease score (68.2 compared to 72.23 from TextCompare). This platform evaluated my text as suitable for 8th or 9th-grade reading.
3. Charactercalculator.com also provided a detailed analysis of my writing with the a reading ease score of 70.54, giving the text a 7th-grade rating.
4. Moving on to the “monster trucks” of readability platforms! Let’s check my FK grade with Readable.com:
It’s different again… FK grade of 5.9 and reading ease of 73.4.
5. Let’s check my FK grade with Grammarly (thanks to my colleague Ed for lending me his Premium for a sec). Grammarly uses FKT, as mentioned in their blog.
Grammarly gives the text a readability score of 74 and a 7th-grade rating.
These fluctuations in FKT results on different platforms make me feel uneasy. The formula is clear, so the results with FKT must be the same. Not like this:
6. Let’s check the text manually with a calculator by the formula above and then double-check it with ChatGPT (I suck at Math).
The text contains 7 sentences, 93 words, and 135 syllables. I’m not counting “Hello” at the beginning as a sentence, as it wouldn’t affect readability, in my opinion.
FK Grade = 0.39 * (93/7) + 11.8 * (135/93) — 15.59
FK Grade ≈ 6.74
Flesch Reading Ease = 206.835 — (1.015 * ASL) — (84.6 * ASW)
ASL: Average Sentence Length (total words in a text divided by the total number of sentences). ASL = 93/7 = 13.28
ASW: Average Syllables per Word (total syllables in a text divided by the total number of words). ASW = 135/93 = 1.45
206.835 — (13.47) — (122.67) = 70.6 Flesch Reading Ease
With “Hello” counted as an independent sentence: FK Grade ≈ 6th grade, Reading Ease ≈ 72.5. But I think it’s cheating.
My manual results: a reading ease score of 70.6 and ~7th-grade level:
I think I’m starting to see the problem…
Checking the text with ARI
Next, let’s check the email with ARI as it also gives a grade score.
- Hemingway Editor gave me a 6th-grade rating. They disclaim that they use ARI in their FAQ section.
2. Readable.com evaluated my text at 5.4.
3. TextCompare.org says my email has a 6.43-grade rating.
4. Online-Utility.org checked the text at 6.28.
5. StoryToolz gave me 6.4.
6. Let’s check my text manually according to the ARI formula…
The text contains 7 sentences, 93 words, and 524 characters. Again, I’m not counting “Hello” at the beginning as a sentence, as it wouldn’t affect readability, in my opinion… Although I’m starting to suspect it affects the tests.
ARI = 4.71 * (characters/words) + 0.5 * (words/sentences) — 21.43
ARI = 4.71 * (524/93) + 0.5 * (93/7) — 21.43 =…
11.7429. Wait. How? How do you convert it to a grade? Let me rummage the internet some more…
So you’re telling me it’s an 11th-grade score? How? OK, I think I messed up. Or did I? I re-calculated the index 5 times.
I can see the downside of ARI already. It’s freaking c-o-m-p-l-i-c-a-t-e-d.
The results for ARI are:
Checking the text with GF
Finally, let’s check the email with GFI, the Gunning Fog Index. The naming sounds the coolest so far. I’m prepared for the worst but hope for the best.
There weren’t that many quality calculators to choose from, so I decided to let ChatGPT-4, too, calculate the index.
Remember that with GFI, a higher number means your writing is more difficult to read.
- Readable.com, 8.1/20.
2. ChatGPT-4, 9.24/20. The Chat went crazy when counting words and sentences, so the index is very much incorrect. But it’s a good illustration of how a human being can, too, miss out on some words, ending up with wrong results.
3. Gunning-fog-index.com, 8.755/20.
4. Character Calculator, 8.76/20.
5. TextCompare.org, 9.19/20.
6. Manual results, 7.89.
The text contains 7 sentences, 93 words, and 6 complex characters with 3+ syllables. I’m not counting the word “contacting” as you’re supposed to ingore words with common suffixes like -ing.
GFI = 0.4 * (words/sentences) + 100 * (complex words/words)
GFI = 0.4 * (93/7) + 100 * (6/93) = 7.89
The total results for GFI are:
- Every platform that evaluates readability, such as Readable.com, Hemingway, and Grammarly, uses a readability test that’s quite old.
- While these tests employ precise mathematical formulas, the results still differ between platforms. I truly believe they should not.
- Different results are most likely caused by different ways the platforms count words, complex words, sentences, and syllables.
- It’s frustrating that readability-checking platforms, which are built upon these tests, cannot agree on a standard way to count words.
- Suppose readability-checkers make minor adjustments or round numbers for simplicity. Сan it justify the differences in readability scores and grades? I guess it’s for you to judge.
Here are the results for all three tests on 5 different platforms each + my manual evaluation:
It’s difficult to determine the absolute accuracy for any of these tests. However, we can assess the consistency of the results across different platforms (not taking my manual calculations into account).
For FKT, the grade level results range from 5.9 to 7, which is a difference of 1.1 grade levels. The readability score ranges from 68.2 to 74, a difference of 5.8 points. For ARI, the grade level results range from 5.4 to 6.43, which is a difference of 1.03 grade levels. For GFI, the score ranges from 7.89 (manual results) to 9.24 (ChatGPT-4), a difference of 1.35 points on a 0–20 scale.
Based on this little consistency analysis, the ARI test has the least amount of fluctuation between platforms. So, I guess you can call ARI the most stable. That’s probably why Hemingway platform uses it.
Am I pissed that all ultra-popular readability platforms provide different results for tests based on precise formulas? Yess.
Am I gonna use those platforms anyway? Yess.
What choice do I have? I need some sort of assessment for my writing, even if it’s slightly different from the original test.
But let’s discuss why I think the problem lies within the tests, not platforms. These tests have many imperfections:
- Look, Freud had an obsession with sex. FKT, ARI, and GFI obsess over sentence length and word difficulty. If we obsess over numerical scores, we risk overlooking the complexity and uniqueness of the ideas. Marcel Proust would fail all of these tests miserably.
- These tests don’t account for other factors, such as paragraph structure, coherence, and cohesion.
- These tests formulas don’t work for non-English languages.
- We’ve become too obsessed with simplicity. I think we need to focus on the value more. Everyone thinks that the shorter the text in the UI, the better. Right? For example, here are my 2 texts from today:
One projects the value of the mirror website (smooth streaming experience), and another is just… short and looks better on mobile. Guess which one was accepted?
5. The tests are unreliable with short texts. People didn’t really have “buttons” back in the 40s. Measuring sentence length doesn’t make sense in these cases.
6. Most importantly, formulas can’t be a substitute for human judgment. It’s really nice and useful when Hemingway highlights my long sentences. But every time I need to think twice before cutting them short.
Read the full article here