In the late-1990s, I was responsible for the journals at the American Academy of Pediatrics when a paper about Sudden Infant Death Syndrome (SIDS) received a flood of attention in the media after the publication of a book called “The Death of Innocents.” It turned out the babies featured in this highly-cited Pediatrics paper weren’t dying on their own of SIDS, but were being murdered by a deranged parent who was ultimately described as a serial killer. The paper in question — which had been published in the early 1970s — was a consistent citation winner because SIDS was such a concern and mystery in the pediatric community. This paper helped create the mystery. The revelations that the paper described homicide/infanticide and not SIDS caused the paper to be corrected, but not retracted. This was deemed the better approach so the literature reflected the entire story. It was still a valid study, but now we knew it was a study of murder disguised as SIDS.
Needless to say, the original paper continued to be a citation champ, and the correction only bolstered its citation numbers for a time. However, the reasons why it was cited changed dramatically, and also became unpredictable. This represented my first experience observing how citations are not always “positive intellectual debt” but can be made for a host of different and evolving reasons.
Fast-forward to 2009, when I came across a paper in the BMJ by Steven Greenberg entitled, “How Citation Distortions Create Unfounded Authority: Analysis of a Citation Network.” Greenberg’s use of citation network mapping, the insights generated (concepts like “citation distortion” and “citation expansion”), and a nagging concern based on experience that citations were too “dumb,” all culminated in my effort to create a tool that would add qualitative aspects to citations. I called this tool “SocialCite,” and some readers here might recall it. We got to the point of beta-testing it with some adventurous publishers, but the fatal flaw was that it depended on user engagement to work — we developed it before machine learning was mainstream enough to map citation networks at scale and derive useful meaning from these at a price a small startup could afford.
So, I was excited to learn a few weeks ago that a new group of people had picked up on the same problem, and were able to leverage machine learning to start tackling these deficiencies in qualitative citation categorization. Their timing seems far more promising. The initiative debuted at the STM Meeting a couple of weeks ago, and is just launching this week. It is called Scite.ai (which they refer to as “scite”). Josh Nicholson is the CEO and founder, and he agreed to an email interview, which follows.
Q: Give me a little bit of your background – education, professional life, and so forth.
Nicholson: I received my PhD from Virginia Tech in 2015 studying the role of aneuploidy on chromosome mis-segregation in human cells (scite report here). In addition to my work on cancer I have long been interested in how science is performed and the people behind the science itself. As an undergrad, I wrote for the health and science section of the student newspaper at UC Santa Cruz, and as a grad student I would often run some side analyses looking at the structure of scientific funding and peer review. As I read more and more on scholarly publishing and experienced publishing firsthand as a graduate student, I became frustrated with the process and wanted to improve it beyond just writing papers about it so, during the last year of my PhD, I launched “The Winnower,” an open access publishing platform that used only open post-publication peer review.
Q: You’ve been involved in a few startups. Can you walk me through the path from “The Winnower” to Authorea to Scite.ai?
Nicholson: When I launched “The Winnower,” I knew virtually nothing about starting a company, but I had friends that had started a company, and I figured I must be at least as smart as they are. I lucked into a small angel investment after losing a business competition and jumped head first into “trying to fix academic publishing.” I soon realized how difficult it was to attract submissions as a no-name graduate student, and even more difficult to get people to review submitted papers. Being brutally honest with myself about how it was going (I could not convince my own PI to publish there), I switched focus pretty early on to publishing grey literature, which proved to be quite successful because I no longer competed with traditional publishers for submissions, and, really, we were one of the first to give DOIs and permanent archival to content that was not traditional peer-reviewed academic papers but which I thought still had tremendous value (blog posts, student essays, how-to’s, reddit science AMAs, etc.). After finishing my PhD, I focused solely on “The Winnower” until it was acquired by another early stage startup with a similar founding story, Authorea. Authorea, a document editor for researchers, was founded by two researchers who were frustrated with how difficult it was for them to collaborate on technical documents. I was part of Authorea for roughly two years until it was acquired by Atypon.
The original idea behind scite was actually first published on the Winnower in 2014. Yuri Lazebnik, my co-founder and long-time informal advisor and collaborator, and I had been discussing the seminal report from Amgen saying they could only validate 6 out of 53 major cancer studies they looked at. We proposed a solution based on our own experiences with evaluating research reports by identifying the studies that actually tested the claim and classifying them as supporting or contradicting. Basically, we would know in our fields that this paper may come from MIT, may be in Cell, Nature, or Science, and may be cited 200+ times, but because we followed the work so closely we would know the reports that re-tested the claim and either supported or contradicted it. However, if either of us moved just outside of our narrow fields, we would have no way of knowing this and would be forced to rely on the proxies of quality nearly everyone uses today to evaluate research — journal name and citation count.
We started by manually going through papers that we knew were bad and reading each of the papers that cited it to see if we, as humans, could tell if the citing papers were supporting, contradicting, or just mentioning the study by the way they cited it. We were encouraged by the fact that we could do this manually but realized that this would never scale as it would take hours if not days to look at hundreds of citations for one paper alone!
Because Yuri and I had no experience in machine learning or software development, I basically talked to anyone that would listen to us saying, “Hey, we have this great idea, we want to automate it. Will you join us?” Most responses were, “That’s impossible!” “Give me money,”, or “I don’t have the time.” While preaching about the promise of scite, I met Peter Grabitz at OpenCon. Peter was a medical student at the time who was performing citation analyses looking at drug development and was enthused about the idea of looking at citations in terms of new classifications. Additionally, we met Sean Rife (through Twitter!), a psychology professor at Murray State who had done some similar technical work by developing statcheck.io, an application that checks the accuracy of statistics in publications, and was acutely aware of the reproducibility problem in psychology. Sean not only had the motivation to work with us he had the technical know-how as well and said, “Yeah, let’s do it!” With Sean, we were able to build a prototype to prove the idea and soon, after raising private capital, to work on it full time. Milo Mordaunt, who I worked with at Authorea, and who has a degree in Classics from Cambridge where his dissertation was essentially text mining Homer’s “Odyssey” and representing it as a network, joined soon after and became the CTO of scite, and soon after so did Patrice Lopez, the creator of GROBID, the top tool for scholarly document conversion from PDF to XML and a computer scientist with over a decade of experience text mining scientific papers. Additionally, we’re fortunate to also work with Neves Rodrigues, another former co-worker at Authorea who has done nearly all the product design work, and Camilo Frigeni, who has helped with branding!
Since then we have developed a production-ready site with over 250 million citation snippets analyzed, and attracted lots of publishers who want to start pilots, and are hoping to expand the team soon. To me, scite can give researchers super powers, and this is exciting!
Although we have tons to do, we’ve already come a long way and are excited about opening up scite to the wider scientific community! The team is really amazing, and I feel lucky to be working with them every day, even amongst disagreements!
Q: You’ve certainly learned some lessons from the other startups. Have those lessons changed how you’re approaching the development of scite.ai?
Nicholson: People matter! I used to hear this all the time and think it was just a cliché, but people really do matter because there will be a lot of tough decisions and discussions that need to happen and they need to be discussed openly and honestly. Additionally, focus, focus, focus. There are a million things to do, and you need to focus on doing the things that really matter.
Q: What’s your ultimate goal with Scite.ai? Beyond a sustainable company, what would define success in a broader sense? What’s the mission?
Nicholson: scite aims to make science more reliable. Ultimately, we want to align career incentives with good research. We think research is one of the most important human endeavors in the world, so, if we can make it more efficient, we will see better outcomes in all parts of our lives. I used to say that “in order to cure cancer, we must first cure cancer research,” and really I think that is true. The research enterprise needs to find better ways of doing research if we want to make real progress.
Q: Tell me a little about how Scite.ai works.
Nicholson: There are three major parts to scite: 1) document processing, 2) citation classification, and 3) the web app (user-friendly site).
In order to find citation statements, we need to analyze the full-text of scientific publications. We work exclusively with XML and PDF versions of the articles. In this process, we match in-text citations to references in the reference list and then match those against the Crossref database. Thus, we’re effectively creating a citation graph from scratch. We can do over 1,000 XML articles a minute and about 650,000 PDFs a day! This is a very difficult part of the process if you consider that most articles are in PDF (and some quite old), and there is a large variation in citation style and reference information. We’re not perfect at doing this, but it is pretty magical, I think — or, as Patrice would say, there are about 10+ machine learning models at work that allow us to do this. We currently have about 250 million citation snippets and are adding more daily.
Once we have the snippets and citations extracted, we run our deep learning models to classify the citation snippets as supporting, contradicting, or mentioning. This model is trained on roughly 35,000 citation snippets that were manually annotated, a set that we are constantly increasing. We can classify about 35 million snippets a day.
Most importantly, we need to make this wealth of information actually usable, so we’ve built a user-friendly website that allows anyone to search if a scientific report has been supported or contradicted.
Q: Why did citations seem to present opportunities for you and your team?
Nicholson: We looked at citations because of how influential they are in academia and academic publishing and thought that merely counting them was quite “dumb,” as they could be looked at more intelligently. For example, a citation trashing a paper is now counted the same as one supporting it. We could see parallels of what we are doing now to ranking web pages in the early days of the Internet. Before Google introduced PageRank, search engines didn’t take into account the quality of inbound links they had, they only looked at the quantity. In science today, we are still in the pre-PageRank era, only counting citations. We think scite will change that and often frame what we’re doing as the “PageRank of science.”
Q: How long have you been working on the underlying technical approach? Were there any breakthrough moments? Major setbacks?
Nicholson: Patrice has been working on text mining scientific papers for over a decade, and much of what he has learned and built is being directly applied to scite. I think we’ve been quite lucky with the timing of tools available to us. It’s a very difficult technical problem that we’re trying to solve, but with advances in machine learning where they are today, we’ve been able to do it.
Q: One of the concepts you’ve mentioned comes from the legal field, the idea of “Shepardizing.” Can you explain that, and how you see it applying?
Nicholson: Many people mentioned Shepardizing when I would tell them about scite, and I mostly ignored it because the word sounded really bizarre to me, and I had no idea what they were talking about. What I have found is that, effectively, what we are doing has existed in law citation practices for over one hundred years, and was introduced by Frank Shepard. In short, it is a system that allows lawyers to determine if a case is reliable to cite or not. What is funny is that Eugene Garfield knew about this, and so did many others for decades, they just simply couldn’t do it technically because of the scale, and, I would say, the unclear need at the time (you can watch Garfield discuss the concept here).
Also, an interesting aside is that when I first started “The Winnower,” I actually got to speak with Eugene Garfield on the phone, as I wanted to revive his long running series “Citation Classics” (I revived it under a different name at the Winnower), and he said something to the effect of, “Welcome to this crazy world.” I am pretty happy to have that connection to history and an innovator in this space.
Q: How do you see the future of citations, citation counting, and citation quality moving forward?
Nicholson: I see researchers writing their citation statements more clearly as they become aware that scite exists, classifies them as supporting or contradicting, and that classification affects how others view the cited study. I also see citations being less biased. Currently, we see about 95% of citations simply mention a study, about 5% support it, and less than 1% contradict it. I think that by having these explicit categories we will encourage researchers to publish more contradicting and supporting work, a substantial fraction of which remains unpublished, because novelty is no longer the only thing that matters.
Q: Anything else you’d like to say?
Nicholson: Thanks for the great questions. Please try the system out at scite.ai, and let us know what you think. If you have direct questions or thoughts, I can be reached at firstname.lastname@example.org.
Disclosure: Since this interview first appeared, I have become an advisor for Scite.ai.