Testing readability tests: Flesch–Kincaid, ARI, and Gunning Fog

The painting depicts a surgeon, wearing a funnel hat, removing the stone of madness from a patient’s head by trepanation.  An assistant, a monk bearing a tankard, stands nearby. Playing on the double-meaning of the word kei (stone or bulb), the stone appears as a flower bulb, while another flower rests on the table. A woman with a book balanced on her head looks on.
The Cure of Folly by Hieronymus Bosch (c.1494 — c.1516, PUBLIC DOMAIN)

At the core of the most popular online editors are readability tests that were invented 60–80 years ago.

These UX writing and copywriting platforms use Flesch-Kincaid tests, the Gunning Fog Index, or Automated Readability Index to measure if the text is easy to read:

  • Readable.com
  • Hemingway Editor
  • Writer.com
  • Grammarly

Flesch-Kincaid tests (FK or FKT) and the Gunning Fog Index (GF or GFI), why do they sound like something from particle physics? Why Hemingway Editor prefers the Automated Readability Index (ARI) to every other readability test?

Most importantly, are these tests even viable in 2023?

Let’s see what I’ve dug up.

A yellowish painting of different people sitting in the bus.
The Bus by Frida Kahlo (1929, FAIR USE for non-profit educational purpose)

My goal is to write in a simple and accessible language. This is my mission. I stupify the text for people who no longer read. This doesn’t mean I think my target audience is stupid. It’s just that they may not have all the time in the world to read something that requires extra effort.

Readability tests help me evaluate if my writing is really for everyone (as law professors say, “for everyone in the Clapham omnibus”). The goal of this article is to check how reliable these tests are.

In content that’s not life chronicles (e.g., the life and death of Marie Antoinette), we should mention as few dates as possible. It’s not a school essay—let’s get to the point.

As I understood from Dr. R. Flesch’s book, readability tests appeared so that newspapers, written by well-educated people in an intricate manner, could maximize their reach. See, for example, two different approaches to one news piece:

New York Times: His (Roosevelt’s) extraordinary action in going personally to Annapolis for this purpose will be interpreted everywhere as it was intended... New York Daily News: New York Daily News Lord Halifax must have felt from Mr. Roosevelt’s manner of receiving him that the United States was saying to Great Britain, in our breezy American idiom, “Pal, the joint is yours.”
NYT and NYDN articles, excerpt from The Art of Readable Writing by Rudolf Flesch

Journalists were interested in making their text more readable. The more people understand the text, the more likely they are to buy newspapers, right?

Here’s the history of FKT in 2 sentences according to ChatGPT:

The Flesch-Kincaid tests were developed in the mid-20th century to assess the readability of written material. Rudolf Flesch introduced the first version of his formula in 1948, which was later expanded upon by J. Peter Kincaid in the 1970s.

ARI (from the 1960s) and GFI (from the 1950s) are also pretty old. That’s all you want to know about their history.

There are also similar readability tests like the SMOG index, Fry formula, and the Coleman-Liau index which I may cover sometime in the future.

The readability tests can only be used to evaluate the simplicity of English written text, not speech (even a transcribed one). All tests are fairly similar. You count the words, apply a certain formula, and get your text rated.

Flesch-Kincaid test

So what can you do to improve your writing…? What you really need is a good working knowledge of informal, everyday, practical English.
— R. Flesch

To use FKT, you count the number of words, sentences, and syllables in a sample of your writing. Then you apply a formula:

0.39 * (words/sentences) + 11.8 * (syllables/words) — 15.59

As a result, you get a grade score that correlates to how many years of education someone would need to understand the text. For example, Ernest Hemingway’s writing is readable for a fifth-grader, so his grade score would be 5. FKT also provides you with a reading ease score.

Automated Readability Index

To use ARI, you first count the number of characters, words, and sentences in a sample of your writing. Then you apply a formula:

4.71 * (characters/words) + 0.5 * (words/sentences) — 21.43

As in FKT, you get a grade score.

Gunning Fog Index

The index looks at two things: the length of the sentences and the difficulty of the words used. To use the GF, you first count the number of words in a sample. Then, you count the number of “complex” words (complex = words that have 3+ syllables, excluding proper nouns, compound words, -ing, and a few other prefixes). Finally, you apply a formula:

0.4 * ((words/sentences) + 100 * (complex words/words))

The index is a number between 0 and 20, with a higher number meaning your writing is more difficult to read.

Let’s check my text with each of the readability tests.

This email with a reply to a user complaint will be our test subject:

Hello,

Thank you for contacting us! My name is Alex, and I will assist you with your complaint.

We are against all stereotypes, including gender-based ones, which is why there is no discrimination in the game. We are sorry to hear that the event made you feel this way. We value our players’ opinions and will ensure that your message reaches our development team, who do their best to make the game friendly and respectful to everybody.

Feel free to contact us if you have any further feedback or questions.

Enjoy your day!

Checking the text with FK

The challenge was to find a reliable online calculator to check my text using FKT. Each calculator provided a different result.

  1. TextCompare.org provided me with a number (72.23) but offered no explanation or conversion into a grade. I assume this is an FK reading ease score, but the calculator doesn’t clarify whether it’s good or bad.
This calculator provided me with a number (72.23)
My screenshot via textcompare.org

2. Good Calculators’ testing resulted in a detailed analysis but with a different reading ease score (68.2 compared to 72.23 from TextCompare). This platform evaluated my text as suitable for 8th or 9th-grade reading.

Good Calculators’ testing resulted in a different reading ease score 68.2
My screenshot via Flesch Kincaid Calculator | Good Calculators

3. Charactercalculator.com also provided a detailed analysis of my writing with the a reading ease score of 70.54, giving the text a 7th-grade rating.

Charactercalculator.com provided a reading ease score of 70.54, giving the text a 7th-grade rating.
My screenshot via Charactercalculator.com

4. Moving on to the “monster trucks” of readability platforms! Let’s check my FK grade with Readable.com:

Readable.com provided an FK grade of 5.9 and reading ease of 73.4.
My screenshot via readable.com

It’s different again… FK grade of 5.9 and reading ease of 73.4.

5. Let’s check my FK grade with Grammarly (thanks to my colleague Ed for lending me his Premium for a sec). Grammarly uses FKT, as mentioned in their blog.

Grammarly gives the text a readability score of 74 and a 7th-grade rating.
My collegue’s screenshot via grammarly.com

Grammarly gives the text a readability score of 74 and a 7th-grade rating.

These fluctuations in FKT results on different platforms make me feel uneasy. The formula is clear, so the results with FKT must be the same. Not like this:

TextCompare.com — 72.23; Good Calculators — 68.2 and 6.6 grade; Charactercalculator — 70.54 and 6.72 grade; Readable.com — 73.4 and 5.9; Grammarly — 74 and grade 7; Manual results —  70.6 , grade 6.74

6. Let’s check the text manually with a calculator by the formula above and then double-check it with ChatGPT (I suck at Math).

The text contains 7 sentences, 93 words, and 135 syllables. I’m not counting “Hello” at the beginning as a sentence, as it wouldn’t affect readability, in my opinion.

FK Grade = 0.39 * (93/7) + 11.8 * (135/93) — 15.59

FK Grade ≈ 6.74

Flesch Reading Ease = 206.835 — (1.015 * ASL) — (84.6 * ASW)

Where:

ASL: Average Sentence Length (total words in a text divided by the total number of sentences). ASL = 93/7 = 13.28

ASW: Average Syllables per Word (total syllables in a text divided by the total number of words). ASW = 135/93 = 1.45

206.835 — (13.47) — (122.67) = 70.6 Flesch Reading Ease

With “Hello” counted as an independent sentence: FK Grade ≈ 6th grade, Reading Ease ≈ 72.5. But I think it’s cheating.

My manual results: a reading ease score of 70.6 and ~7th-grade level:

TextCompare.com — 72.23; Good Calculators — 68.2 and 6.6 grade; Charactercalculator — 70.54 and 6.72 grade; Readable.com — 73.4 and 5.9; Grammarly — 74 and grade 7; Manual results — 70.6 , grade 6.74

I think I’m starting to see the problem…

Checking the text with ARI

Next, let’s check the email with ARI as it also gives a grade score.

  1. Hemingway Editor gave me a 6th-grade rating. They disclaim that they use ARI in their FAQ section.
Hemingway Editor gave me a 6th-grade rating with ARI
My screenshot via Hemingway Editor

2. Readable.com evaluated my text at 5.4.

Readable.com evaluated my text at 5.4 grade ARI
My screenshot via Readable.com

3. TextCompare.org says my email has a 6.43-grade rating.

TextCompare.org says my email has a 6.43-grade rating ARI
My screenshot via TextCompare.org

4. Online-Utility.org checked the text at 6.28.

Online-Utility.org checked the text at 6.28 ARI
My screenshot via Online-Utility.org

5. StoryToolz gave me 6.4.

StoryToolz gave me 6.4 ARI
My screenshot via StoryToolz.com

6. Let’s check my text manually according to the ARI formula…

The text contains 7 sentences, 93 words, and 524 characters. Again, I’m not counting “Hello” at the beginning as a sentence, as it wouldn’t affect readability, in my opinion… Although I’m starting to suspect it affects the tests.

ARI = 4.71 * (characters/words) + 0.5 * (words/sentences) — 21.43
ARI = 4.71 * (524/93) + 0.5 * (93/7) — 21.43 =…

11.7429. Wait. How? How do you convert it to a grade? Let me rummage the internet some more…

ARI Grades Table via readable.com — 11th eleventh grade
ARI Grades Table via readable.com

So you’re telling me it’s an 11th-grade score? How? OK, I think I messed up. Or did I? I re-calculated the index 5 times.

I can see the downside of ARI already. It’s freaking c-o-m-p-l-i-c-a-t-e-d.

The results for ARI are:

Hemingway 6 Readable.com 5.4 TextCompare.org 6.43 Online-Utility.org 6.28 StoryToolz 6.4 Manual results 11.7429 :( for ARI

Checking the text with GF

Finally, let’s check the email with GFI, the Gunning Fog Index. The naming sounds the coolest so far. I’m prepared for the worst but hope for the best.

There weren’t that many quality calculators to choose from, so I decided to let ChatGPT-4, too, calculate the index.

Remember that with GFI, a higher number means your writing is more difficult to read.

  1. Readable.com, 8.1/20.
Readable.com, 8.1/20 GFI
My screenshot via Readable.com

2. ChatGPT-4, 9.24/20. The Chat went crazy when counting words and sentences, so the index is very much incorrect. But it’s a good illustration of how a human being can, too, miss out on some words, ending up with wrong results.

ChatGPT-4, 9.24/20 GFI
My screenshot via chat.openai.com

3. Gunning-fog-index.com, 8.755/20.

Gunning-fog-index.com, 8.755/20 GFI
My screenshot via Gunning-fog-index.com

4. Character Calculator, 8.76/20.

Character Calculator, 8.76/20 GFI
My screenshot via Character Calculator

5. TextCompare.org, 9.19/20.

TextCompare.org, 9.19/20 GFI
My screenshot via TextCompare.org

6. Manual results, 7.89.

The text contains 7 sentences, 93 words, and 6 complex characters with 3+ syllables. I’m not counting the word “contacting” as you’re supposed to ingore words with common suffixes like -ing.

GFI = 0.4 * (words/sentences) + 100 * (complex words/words)
GFI = 0.4 * (93/7) + 100 * (6/93) = 7.89

The total results for GFI are:

Platform Score (0–20) Readable.com 8.1 ChatGPT-4 9.24 Gunning-fog-index.com 8.75 Character Calculator, 8.76 TextCompare.org 9.19 Manual results 7.89 for GFI
  1. Every platform that evaluates readability, such as Readable.com, Hemingway, and Grammarly, uses a readability test that’s quite old.
  2. While these tests employ precise mathematical formulas, the results still differ between platforms. I truly believe they should not.
  3. Different results are most likely caused by different ways the platforms count words, complex words, sentences, and syllables.
  4. It’s frustrating that readability-checking platforms, which are built upon these tests, cannot agree on a standard way to count words.
  5. Suppose readability-checkers make minor adjustments or round numbers for simplicity. Сan it justify the differences in readability scores and grades? I guess it’s for you to judge.

Here are the results for all three tests on 5 different platforms each + my manual evaluation:

the results for all three tests on 5 different platforms each + my manual evaluation on previous screenshots

It’s difficult to determine the absolute accuracy for any of these tests. However, we can assess the consistency of the results across different platforms (not taking my manual calculations into account).

For FKT, the grade level results range from 5.9 to 7, which is a difference of 1.1 grade levels. The readability score ranges from 68.2 to 74, a difference of 5.8 points. For ARI, the grade level results range from 5.4 to 6.43, which is a difference of 1.03 grade levels. For GFI, the score ranges from 7.89 (manual results) to 9.24 (ChatGPT-4), a difference of 1.35 points on a 0–20 scale.

Based on this little consistency analysis, the ARI test has the least amount of fluctuation between platforms. So, I guess you can call ARI the most stable. That’s probably why Hemingway platform uses it.

Am I pissed that all ultra-popular readability platforms provide different results for tests based on precise formulas? Yess.

Am I gonna use those platforms anyway? Yess.

What choice do I have? I need some sort of assessment for my writing, even if it’s slightly different from the original test.

But let’s discuss why I think the problem lies within the tests, not platforms. These tests have many imperfections:

  1. Look, Freud had an obsession with sex. FKT, ARI, and GFI obsess over sentence length and word difficulty. If we obsess over numerical scores, we risk overlooking the complexity and uniqueness of the ideas. Marcel Proust would fail all of these tests miserably.
  2. These tests don’t account for other factors, such as paragraph structure, coherence, and cohesion.
  3. These tests formulas don’t work for non-English languages.
  4. We’ve become too obsessed with simplicity. I think we need to focus on the value more. Everyone thinks that the shorter the text in the UI, the better. Right? For example, here are my 2 texts from today:
1st screenshot: “Streaming in {Country_Name}? No VPN required for our mirror site.” 2nd screenshot: “Streaming in {Country_Name}? Use our official mirror site for smooth experience without VPN”.

One projects the value of the mirror website (smooth streaming experience), and another is just… short and looks better on mobile. Guess which one was accepted?

5. The tests are unreliable with short texts. People didn’t really have “buttons” back in the 40s. Measuring sentence length doesn’t make sense in these cases.

6. Most importantly, formulas can’t be a substitute for human judgment. It’s really nice and useful when Hemingway highlights my long sentences. But every time I need to think twice before cutting them short.

Read the full article here

Total
0
Shares
Leave a Reply

Your email address will not be published.

Prev
Breaking Free From UX Debt: 6 Ways to Improve Your Product’s User Experience Now

Breaking Free From UX Debt: 6 Ways to Improve Your Product’s User Experience Now

Picture this: You have been hearing a lot about the dysfunctionality of your

Next
SALT & STONE – High-Performance Natural Skincare on Land-book

SALT & STONE – High-Performance Natural Skincare on Land-book

https://www

You May Also Like