Under the Hood – The Unique Tech Powering Public Editor

Sep 3

By Nick Adams, PhD

Assessing, labeling, and scoring the credibility of the content within news articles at the scale of hundreds or more per day –– hereafter referred to as “our news annotation job” –– implies that (A) machines will label the articles, or (B) a large pool of humans will label the articles. This post explains why machine labeling is infeasible with current technology and why approaches using annotation tools other than TagWorks (the Internet-based collaborative annotation technology underlying Public Editor) are infeasible. It will close with evidence that, in fact, no other tool/project is even attempting the specific labeling of news article content that Public Editor provides.

The blogpost will support the following five claims: (1) Current automated natural language processing technologies are inadequate for our news annotation job. They are not capable of identifying the argumentative fallacies, inferential mistakes, cognitive biases, or issues of language and tone identified by Public Editor –– nor are they capable of reliably identifying formal elements of articles labeled by Public Editor, like separate arguments, their evidentiary support, or quotes and paraphrases of journalists’ sources. (2) Other human-driven content analysis tools are inadequate for our news annotation job. They were designed either for use by experts and are not technically or operationally scaleable for our annotation job, or were designed for online workers completing simple, rote jobs like categorizing text blocks describing products as ‘furniture’ or ‘electronics.’ Only TagWorks (which powers Public Editor) manages content analysis assembly lines capable of producing large, intricately labeled, expert-grade datasets without requiring an unfeasibly large and expensive expert workforce. (3) No other project (unless it is very secretive) is even attempting to richly label the content within news articles at scale. (4) Public Editor’s set of credibility indicators is large, intricate, specific and ideal for measuring content credibility. Developed by a team of PhDs., including a Nobel Laureate, cognitive scientists, epistemologists, journalists, sociologists, and science educators, Public Editor’s credibility indicators have been distilled from centuries of progressively improving scientific methods designed to critically assess the credibility of claims. Because PE’s credibility indicators are applied at such a granular level, they can also be used to assess credibility at multiple higher levels of target content. The PE system labels and scores specific words and phrases. These labels are summed to score articles. Article scores are averaged to produce scores for the journalists and news outlets who publish them. And public figures’ reported statements are aggregated and used to score their credibility. (5) Public Editor output labels are ideal for supervised machine learning approaches aiming to build AI text classifiers that can label news article content automatically. Moreover, they provide valuable additional traction for supervised ML approaches seeking to score the credibility of articles or more simply (as is the SotA) label articles as ‘fake’ or ‘real.’

1. Automated natural language processing techniques (NLP) are not sufficient for our annotation job.

Computers were built to compute unambiguous numeric, ordinal, and categorical data. String data, i.e. natural language data, have always been most difficult for computers to process. This section describes a fairly comprehensive set of NLP approaches, and explains how they are inadequate for our job. Readers might also note that if sufficient automated methods were ready to deploy, multiple labs working since 2016 would have already developed systems using them to label misinformation as Public Editor does.

In general, automated NLP approaches begin by ‘tokenizing’ language into its elemental characters or words. Early and still-used approaches then count and compare tokens across document segments or files (as is typical of ‘bag of words’ approaches) to generate some understanding of each documents’ absolute and comparative content. Sentiment analyzers often compare documents’ token counts to dictionaries (i.e. lists) of words/tokens that are indicative of human emotions. This simple ‘dictionary approach’ is the closest that any NLP technique comes to being able to label and score the content within news articles. It is somewhat effective for identifying hate speech indicated by a dictionary of derogatory words, but this method is of little utility for identifying argumentative fallacies, cognitive biases, inferential mistakes and other issues of language or tone.

Token counts can also be used in clustering approaches (a form of unsupervised ML), which group together documents composed of similar tokens. Multi-level clustering approaches like topic modeling (Latent Dirichlet Allocation or LDA) even allow a machine to find clusters of semantic topics (e.g. the weather, or politics) that occur within and across documents, simultaneously modeling the list of tokens that compose topics, and the list of topics that compose documents. These methods are of obvious value to researchers who want to quickly summarize the content of a set of documents, and the documents themselves. But the method does not reliably and unambiguously label the topic membership of any particular word in a document, nor does it allow researchers to define topics a priori. For these reasons, the method is not suitable for our annotation job. Moreover, topic models only ‘discover’ higher order meaning around their tokens (in most applications individual words). They are incapable of addressing, evaluating, labeling, and scoring complex semantic units like chains of reason.

Document parsers provide another pseudo-automated approach to natural language data that could seem rather applicable to our annotation job. Stanford CoreNLP has collected a number of document parsers able to accomplish rather impressive feats like identifying and labeling parts of speech, named entities (like people or locations), and even resolving co-references, i.e. replacing pronouns like they, he, and she with the names of the people or entities to which they refer. But these parsers are only pseudo-automated. Unlike the unsupervised clustering approaches described above (which with only several lines of code, can model language across vast documents sets without human input) these parsers are often created through processes requiring a lot of human input, and are not so broadly applicable. Grammar parsers (sometimes called part of speech or POS taggers) were trained through a supervised machine learning process similar to that pursued by the Public Editor team. Many people spent thousands of hours labeling the parts of speech (i.e. noun, verb, subject, object, causal complement, etc.) of each word in a vast set of documents before those labels were used to train the “AI” that is now referred to as a POS tagger. There was no clever algorithm (as in the case of LDA) that learned parts of speech simply through iteration. (Some versions of POS taggers called ‘dependency parsers’ also include model elements representing the dependencies of words within sentences to gain even higher accuracy against POS tagging benchmarks.) Other parsers like named-entity taggers are based on humans’ amassing large dictionaries of people, places, and organizations, then applying a dictionary approach like that used by the sentiment analyzers. (Sometimes named-entity parsers use a combination of dictionary and supervised ML approaches so that computers can learn to identify even novel people by features like capitalized tokens that match other known names.) And finally, some parsers, like coreference resolution parsers, are created via the clever programming of rules. For instance, a simple (and probably poorly performing) coreference resolution parser would find any named entity in a document, and then replace the immediately following pronouns with the name of that entity until it finds another named entity. Then it would replace the pronouns following the second named entity with its name, etc. These rule-based parsers can be built to do narrowly-defined tasks based on reliable structures in documents. But there is no parsimonious set of rules for reliably identifying fallacies like the ignoring of selection effects, or psychological bias creeping into an author’s writing, nor for the other dozens of credibility indicators identified and assessed by the PE system.

There are other NLP approaches with impressive ambitions that receive a fair amount of attention in the popular scientific press. While they are not applicable to our annotation job, readers may wish to understand why that is the case. Word vectorization models are used to map the relationships of words to other words based on their co-occurrences across very large document sets. These models are rather impressive in their ability to demonstrate multiple words’ relationships to one another, like ‘royalty’ - (minus) ‘man’ = (equals) ‘queen.’ But they can’t be used to evaluate the reasoning chain of entire sentences (or paragraphs) to determine their credibility. Long short-term memory (LSTM) approaches provide an alternative to computing natural language data as counts of tokens within documents (often called the ‘bag of words’ approach). Better mimicking human reading, LSTM approaches attempt to process tokens in sequence, holding onto multiple words (in the models’ short-term memory) before fully computing some representation of the words’ semantics, usually at the end of a sentence. The approach is still in its toddlerhood and could hold promise, especially when tokens are represented as positions in a word vectorization that better contextualizes their semantics. While these two approaches to language –– particularly in combination –– could become valuable, they are not immediately applicable to our, or any, annotation job. They are better understood as feature engineering steps of a machine learning system producing a task-specific NLP-AI application. That is, they could help a supervised text classification model converge faster and perform better against the sort of benchmark proposed by the Public Editor team.

Readers may be familiar with task-specific NLP applications like content summarization, translation, and generation, or question-answering systems. These are interesting but aim to solve problems separate from those arising with our annotation job. Content summarization tools can quickly read through a document and boil its multiple pages down to one or two paragraphs. They do this by identifying and cutting redundant language (often based on models tracking POS and word-level semantics through a dictionary like WordNet and/or a word vectorization model), and with rule-based programming that prioritizes the first and last sentences of paragraphs since human writers often use these to convey higher-order meaning. Content translation tools usually operate via dictionary methods that accomplish word-level translations and then rules that reorganize sentence structure to match output language grammar. (More sophisticated systems use probabilistic elements in their models to choose among synonyms, and sometimes translate at the clause-level instead of word-level). Content generation tools gain a lot of attention for their impressive results as well. As an illustration: if these systems are fed the box-score (a simple graphical representation of all the plays) of a baseball game, they can generate a colorful, humanistic 500-word news article describing the highlights of the game. These models use dictionaries of clauses and sentences populated by human effort and labeled examples so that they can describe a base hit (a particular play in baseball) through numerous natural language expressions. The resulting articles have the feel of human-written article in part because they are comprised entirely of human-written words, clauses, and sentences. All of these NLP applications produce impressive results. But no one should be mistaken into believing that they demonstrate computers have developed a general understanding of human language allowing them to reason through and with language as we do, much less complete our annotation job.

Computers’ impressive-seeming but limited understanding of human language is demonstrated by another NLP application (and the last we discuss) that is sometimes confused for a solution to our annotation job: question-answering systems. The most famous of these systems is IBM’s Watson, which held a long winning streak on the trivia gameshow “Jeopardy.” These systems are able to intake a natural language trivia question, match the questions’ words against a large set of Wikipedia documents to find relevant passages, and then infer which piece of information within those passages is called for by the question. In the case of Watson playing Jeopardy, question-answering systems can also frame their output in the form of a question. Watson’s ability to quickly scan the massive Wikipedia trove and spit out answers allowed it to reliably beat humans, especially because it only answered when it had high-confidence in its output. Watson was never able to perfectly perform its simple job of matching text, inferring missing information, and outputting an answer, and the system is no longer state-of-the-art. (Read about the SotA benchmark.) That honor is held by a team from BloomsburyAI, and even its system outputs the wrong answer about a third of the time. While our human intuitions about human language processing suggest that someone who knows a lot about a lot of things might be good at identifying nuanced misinformation through a chain of reasoning, machine question-answering systems are performing a much simpler and more narrowly-defined information retrieval task on a more stable set of Wikipedia documents. They have not developed general purpose semantic reasoning allowing them to identify when an author has confused correlation for causation or exaggerated the relevance of some bit of evidence, nor identify the other dozens of credibility issues that crop up in the daily news.

This review of NLP techniques is not intended to rule out the possibility that any of them could ever be usefully applied to the task of identifying and labeling one or more of the credibility indicators identified by the Public Editor system. Simple dictionary approaches are being used by commercial firms like Trust Metrics to identify hate speech and other strong language. But (probably) the (vast) majority of PE’s credibility indicators need to be identified by humans for now. Of course, as with POS taggers, question-answering systems, and more, human-applied labels are frequently used to train computers to identify and classify language content. In fact, this supervised machine learning approach has been used to develop some of our best NLP parsers and applications. The question then becomes, how best can we humans apply labels to enough news articles to develop an application/AI capable of parsing (i.e. identifying and labeling) misinformation in the news.

2. Other tools allowing humans to label the content within documents are inadequate for our annotation job.

Probably not long after humans began writing, people began annotating that writing with margin notes, underlines, and strikethroughs. In the mid-20th century, we developed colored ‘highlighter’ markers and post-it notes –– and by the 1990s, we moved those annotation tools into the personal computer with computer assisted qualitative data analysis software (CAQDAS) like AtlasTI, which has been reverse-engineered and sold under brands including NVivo, MaxQDA, Dedoose, and TagTog. In the 2000s, the online retailer Amazon produced a set of very simple crowd labeling tools that could be operated by Internet-based workers to categorize their burgeoning retail inventory. More recently, researchers’ and corporations’ growing demand for custom NLP parsers has supported the development of a handful of lightweight language tagging tools teams can use to quickly create dictionaries, or labeled examples to train parsers via supervised ML. These tools have their own strengths and weaknesses described fully here, and more briefly in this section.

CAQDAS tools are well-designed for situations where only experts will be labeling documents. These tools were built for social scientists and humanists who wanted to label the content in their interview transcripts, observational notes, or some smaller collection of documents like the meeting minutes for some organization. Their interfaces presume that the expertise necessary for labeling the documents is in the head of the user. And they presume that the user will be assessing the entire document. These constraints fit low-throughput scenarios –– labeling the transcripts of a hundred interviews or a few dozen focus groups. But they could only aide our annotation job if we were able to recruit several hundred experts in critical thinking and news analysis to label news articles full time, day-in and day-out for years. Given the tedium and cognitive difficulty of such a repetitive task, using such tools seems infeasible and very expensive. Anyone qualified probably has other ways they’d prefer spending their time and earning remuneration.

First generation crowd labeling tools like those provided by Amazon’s Mechanical Turk (AMT) offer hope that larger workforces could be brought to bear on our news annotation job. These tools were designed to aide Amazon’s efforts to classify the many retail items on their Internet marketplace. Rather than hire full-time employees to spend their days classifying product descriptions and images as ‘furniture,’ ‘electronics, etc., Amazon’s engineers established the Mechanical Turk marketplace and some basic annotation interfaces allowing Internet users to take on work classifying their inventory. The tools for classifying the natural language product descriptions were very rudimentary. They did not even allow for word-level annotations like the CAQDAS. Still, researchers were able to show that the crowd workers themselves, if organized properly, could output judgments equal or superior to the judgments of experts on tasks as complicated as classifying the political position of an idea simply by looking at a sentence. Follow on research has provided even more evidence for the wisdom of these crowds. And new companies like Figure Eight, iMerit, Alegion, Lionsbridge, and Scale have formed up to apply the crowd data labeling approach, mostly to image data. The resulting labels are used in supervised ML approaches to train computer vision applications for use in self-driving vehicles and robotics. But these companies have done little to nothing to update AMT’s original language classifying tools. They still require extensive project management, data science support, are clunky to use, and were never designed for complex, in-depth analysis of language content.

Managing first-gen crowd labeling tools is not easy, so many researchers and industry data scientists have turned to lightweight language tagging tools like LightTag, DiscoverText, and Prodigy for their quicker language labeling jobs. These tools allow a team to log into an interface that provides them with push button labeling tools. But the tools were built for tasks like quickly building up a dictionary or a set of simple examples. The interfaces only allow users to label for a couple categories of information at a time, so applying many dozens of labels would require people to read over the same information dozens of times. Moreover, these tools are not designed to access crowds of workers and find the consensus among them –– the sort of approach shown to produce high-scale, expert-grade judgments from teams of lay people.

Only one language annotation tool has been specifically designed for jobs requiring the in-depth, expert-grade analysis of a very large sets of longer documents. It is called TagWorks and it is used to power the Public Editor system. Classified as a (currently the only) 2nd-generation crowd labeling tool for language, TagWorks breaks out experts’ analytical processes into an assembly line of smaller annotation tasks made for lay people. TagWorks’ interfaces do not rely on Internet workers to know or be trained with any relevant expertise. The interfaces’ prompts guide the worker/volunteer through the same analytical process an expert uses. For instance, if after being prompted to find evidence of some bias, the worker highlights text identifying the bias, a follow up prompt appears directing them to assess the severity of the bias. This adaptive prompt-logic ensures that all relevant information is labeled while annotators still experience the quickest path to task completion.

TagWorks is designed to support the wisdom of crowds. Each of the eight separate tasks in the Public Editor assembly line is completed by at least 5 independent volunteers/workers. Then TagWorks’ (transparent) ‘consensus finder’ algorithm measures the agreement of their work and passes their consensus judgment (if there is one) further along in the Public Editor data pipeline. TagWorks tracks volunteer/worker performance by measuring their agreement with the consensus over time, offering feedback to improving volunteers, and barring bad actors. This collaborative annotation system manages task delegation, data shuttling, crowd work validation, edge case adjudication, and data analysis through a GUI, so it does not require the hiring of extra project managers or data scientists. And TagWorks is unique. There is no evidence in relevant trade news, the blogosphere, via the relevant channels of Twitter and LinkedIn, or at recent AI conferences like O’Reilly AI in San Jose, September 10-12, 2019 that any other such high-scale, expert-grade annotation technology exists.

3. There is no evidence that any project other than Public Editor is even attempting to label the content of news articles with a broad, intricate array of credibility indicators.

At least three documents have been written describing projects and efforts addressing the problem of misinformation, disinformation, fake news, and/or false news. Apart from one or a few projects’ ancillary efforts to automatically identify and label keywords indicative of hate speech, as in the case of Trust Metrics, no project but Public Editor even has similar goals to our news annotation job. Some like the Certified Content Coalition, The Trust Project, NewsGuard, and Our.News focus on classifying the trustworthiness of news outlets. Some attempt to give a ‘thumbs up/down’, a nutrition label, or a score to articles, as do FakerFact, Our.News, and Factmata. And many traditional organizations like Politifact, Snopes, FactCheck.org, Check, Poynter, First Draft News, and others organize expert fact-checkers to identify claims of fact in news articles and then do follow-up research (requiring anywhere from 20 minutes to 20 weeks of effort) to determine the truth-value of the claim. But the only thing even partially resembling our annotation job was an idea for a project called CIVIC, to be organized by the TED conference hosting organization. That effort was imagined to be a sort of crowdsourcing of contextual information and debunks that could be applied to memes, images, and other content circulating through social media. It was to be led by Claire Wardle of First Draft News, but she has returned full time to First Draft News since no/insufficient funding materialized for CIVIC and the project’s aspirational website has been taken down.

4. If there is a secret team developing large-scale expert-grade annotation technology for the purposes of completing a news annotation project like Public Editor’s, it will be hard-pressed to develop a set of within-document credibility indicators comparably powerful and comprehensive.

The teams that have formed up around the misinformation challenge often include journalists and data scientists, or as one organization propagating action teams around the world calls itself “Hacks & Hackers.” Such folks have also contributed positively to the Public Editor team. But the power of Public Editor’s credibility indicators and prompts for identifying and assessing the veracity of news articles’ various claims and uses of evidence derives substantially from its experts in science education. Public Editor’s PhDs, including a Nobel Laureate, have spent thousands of hours teaching the methods and critical reasoning capacities accrued over centuries of scientific practice. These capacities have been translated into the interfaces prompting volunteers’ efforts. They have been carefully organized and scoped so that they neither overwhelm the cognitive capacity of nor bore the volunteers. And they have been iteratively tested and improved with university undergraduates for over two years.

If there is any other team in the misinformation mitigation space able to make progress on such work, it may be a group that called itself the Credibility Coalition (CredCo). With the goal of ensuring that the field of misinformation mitigation was grounded in scientific practices encouraging cooperation, Public Editor helped found the group by writing its mission statement and work plan, leading trainings on credibility indicator operationalization, representing the group at conferences, establishing a credibility indicator data model (which the group handed off to a working group of the w3c), and demonstrating that credibility indicators could be reliably applied by volunteers in work that resulted in CredCo’s only published paper. (It is unclear if CredCo is still operating at this time. Its website has not posted any updates since July 2018, and many of its paid project managers and participating organizations, including Public Editor, no longer participate in the organization.)

Operationalizing the critical thinking techniques of science as credibility indicators is not a straightforward affair. Public Editor’s team includes social science methodologists, cognitive scientists, human-computer interaction experts, and science educators best equipped to convert centuries of gradually improving human reason into a set of metrics applicable by the public. As perhaps the sole project team able to create Public Editor’s expert-generated set of indicators, and the only project with access to the annotation tools able to guide the public to apply them, Public Editor stands alone.

Public Editor’s interfaces guide volunteers to answer questions resulting in labels for over 40 credibility indicators in the following 4 categories, with a few examples listed for each:

REASONING – identifying common reasoning errors

straw man fallacy

false dilemma

appeal to authority

LANGUAGE – identifying issues of language or tone

exaggeration

inappropriate metaphor

sarcasm

EVIDENCE – identifying issues with the use of evidence to support claims

confusing correlation and causation

sample representativeness (i.e. issues of cherry-picking data)

adequate control conditions for experiments

PROBABILITY – identifying issues with probabilistic reasoning

poorly calibrated confidence

systematic errors in data collection

comparison to an appropriate baseline

Public Editor also tracks whether credibility errors are made by an article's author(s) or quoted sources, and assesses how important the mistaken argument or quote is to the article's main argument. That way, PE can appropriately weight the error when scoring articles (journalists, news outlets), and quoted sources.

5. A Public Editor benchmark dataset will add significant value to ongoing machine learning efforts.

Given the costs associated with human experts labeling documents, and the weakness of machine efforts to identify and label misinformation on their own, many projects combine the efforts of humans and machines through supervised machine learning (ML). These projects, in almost all cases, rely on humans to label an article as ‘fake news’ or ‘real’ –– or they label the article as ‘hateful,’ ‘satirical,’ etc. (Google even has a team of 10,000 people who rate websites and sometimes their articles according to 5 variables of credibility.) Except for Public Editor, these projects apply their labels at the level of the article or higher (i.e. website). This approach certainly holds down costs –– since humans are not relied upon to closely label content within the articles. But it is also limited in terms of machine learning traction.The term ‘machine learning traction’ is probably a new one, and it’s best explained through illustration. When a machine is learning to mimic humans’ applications of some label to raw data, its learning process relies on two categories of information: the raw data and the label attached to the raw data (and some will rightly say, any other features of the data like meta-data, other labels, etc. included in the model, but we’ll leave those out for this illustration). The machine learns by reviewing some raw data, taking a guess as to whether or not it should be labeled, then getting feedback based on humans’ previous work about whether it guessed correctly or not. That feedback raises or lowers the potential of the machine placing the label on similar information in its next try. When the computer repeatedly tries, gains feedback, and improves over thousands of iterations, it ‘learns’ how to apply labels to raw data as the humans would.

So, let’s suppose that a misinformation research project wanted to go beyond labeling an article as ‘fake’ vs. ‘real.’ Suppose it wanted to also apply labels to articles committing the straw man fallacy –– where an opponent’s argument is simplified so readers will dismiss it. If humans assessed thousands of articles and labeled half of them as containing the straw man fallacy, let’s imagine how the machine would learn to spot such fallacies. It’s raw data is all the text in the entire article. And it’s label is ‘straw man fallacy.’ So it is fed all the article’s text, asked to make a guess and then it is given feedback based on humans’ prior labeling work. After thousands of iterations, researchers would likely find that the machine was still quite inaccurate at guessing which articles contained the straw man fallacy. Why? Insufficient traction. Not only does a straw man fallacy often appear rather similar to the unproblematic presentation of an opponent’s argument, it is usually only indicated by a sentence or two of text in an article that may contain several dozen sentences. How is a computer to find straw man fallacies in a large pool of irrelevant sentences, especially when this fallacy is rather subtle.

Public Editor is different. It labels the specific words and phrases committing fallacies, not entire articles. So, its thousands of examples of straw man fallacies will provide machines with a higher-traction learning experience. Machines learning from PE data will be tasked with judging only a few sentences of raw data at a time, and the raw data corresponding to ‘straw man fallacy’ labels will not include all the article’s content, just the erroneous words and phrases. Such data will offer numerous constrained examples of authors casting doubt on other people’s claims and arguments –– the sort of high traction data machines need to dig deeper into article content. The granular labels provided by Public Editor will not only make it possible to train machines capable of granular analysis, they will provide better traction for machines focusing their labels at the article-level, too. It’s important to note: models taught by supervised ML can only ever perform as well as the human labeling procedure that is ‘supervising’ them. Right now, dozens of ML research teams are improving their models (i.e. AI) against very simple benchmark datasets that label an article as ‘fake’ or ‘real.’ With the greater ML traction provided by Public Editor’s labels and 0-100 article credibility scores, even beginner teams will be able to quickly leapfrog the current state of the art.

Nick Adams