The problem that the authors are trying to tackle is an interesting and difficult one.
I have noticed that when dealing with natural language using artificial intelligence / machine learning techniques, the work being done by computer scientists very often would have greatly benefitted from collaboration with a linguist or other sort of language expert, especially in the design phase of an experiment. This work is a good example of what I mean.
People trained in CS or similar precise fields develop, over time, a tendency of thinking in terms of "getting the right result" (I say this as one of these people). When dealing with natural language, however, sometimes there simply is no single correct result.
Consider the topic of fluency that the authors work on: is there a rigorous, objective definition of "fluent"? The answer, as any linguist would tell you, is "no". There are idiomatic expressions, grammatical structures, contractions, slang, and so on that vary from city to city within a country, let alone globally. What may sound "fluent" to one native speaker of a language may sounds strange to another. It is impossible to objectively generally evaluate "fluency". In particular, any practicing linguist will be able to give examples, likely off the top of their head, of English sentences that would be rated as "fluent" by someone from one geographical area and "awkward" by someone from another.
Furthermore, using Mechanical Turk to find humans to rate the fluency of a particular sentence makes for an unclean dataset and evaluation benchmark. The linked post says that, in the end, 50 people found via Mechanical Turk rated sentences for fluency; since any one language is used significantly differently around the globe, there will be an unpredictable range of fluency ratings for at least some sentences across just 50 people around the world. Choosing a different 50 people to rate the same sentences would most likely result in different fluency ratings.
I do not mean to detract from the authors' work; this is a difficult problem to tackle, with no clear path to a general solution. However, I am forced to wonder why the authors, who, based on their biographies linked to in the article, seem to have a range of experience, did not comment on the considerations I've mentioned here.
Maybe they went to ten linguists, and all they got as an answer is "there is no objective definition of 'fluent'. You are trying to find a single correct result that doesn't exist!"
Then, armed with the naiveté of thinking that if there is something like 'fluency' it must be possible to measure it, they just threw a bit of money at the problem. Note that asking a representative group of people is the closest you can get to exactly what you want to measure (apart from asking everyone). It doesn't matter that there's no agreed-upon method to measure the quality of pizza: if I maximise the subjective impression, I'll get exactly what I wanted.
Using the Mechanical Turkers to rate fluency would arguably be an even more dubious evaluation benchmark if more rigorous standards and consistent for fluent English existed; people regarded as having good writing and editing skills can find better-paying sources of part-time remote work than AMT. Some of the examples of human editing shown in the blog entry certainly don't look fluent to me...
I have noticed that when dealing with natural language using artificial intelligence / machine learning techniques, the work being done by computer scientists very often would have greatly benefitted from collaboration with a linguist or other sort of language expert, especially in the design phase of an experiment. This work is a good example of what I mean.
People trained in CS or similar precise fields develop, over time, a tendency of thinking in terms of "getting the right result" (I say this as one of these people). When dealing with natural language, however, sometimes there simply is no single correct result.
Consider the topic of fluency that the authors work on: is there a rigorous, objective definition of "fluent"? The answer, as any linguist would tell you, is "no". There are idiomatic expressions, grammatical structures, contractions, slang, and so on that vary from city to city within a country, let alone globally. What may sound "fluent" to one native speaker of a language may sounds strange to another. It is impossible to objectively generally evaluate "fluency". In particular, any practicing linguist will be able to give examples, likely off the top of their head, of English sentences that would be rated as "fluent" by someone from one geographical area and "awkward" by someone from another.
Furthermore, using Mechanical Turk to find humans to rate the fluency of a particular sentence makes for an unclean dataset and evaluation benchmark. The linked post says that, in the end, 50 people found via Mechanical Turk rated sentences for fluency; since any one language is used significantly differently around the globe, there will be an unpredictable range of fluency ratings for at least some sentences across just 50 people around the world. Choosing a different 50 people to rate the same sentences would most likely result in different fluency ratings.
I do not mean to detract from the authors' work; this is a difficult problem to tackle, with no clear path to a general solution. However, I am forced to wonder why the authors, who, based on their biographies linked to in the article, seem to have a range of experience, did not comment on the considerations I've mentioned here.