Online Indexing of Scholarly Publications: Part 2, What Happens When Finding Everything is So Easy?

The transformation in discovery – and its consequences – was the topic of the opening keynote at the September 2015 ALPSP Annual Meeting. Anurag Acharya – co-founder of Google Scholar – spoke and answered questions for an hour.   That’s forever in our sound-bite culture, but the talk was both inspirational — about what we had collectively accomplished — as well as exciting and challenging – about the directions ahead.   Anurag’s talk and the Q&A is online as a video and as audio in parts one and two

This post is in two parts: Part One covered Anurag’s presentation of what we have accomplished. The present post, Part Two, covers the consequences. Anurag has agreed to address questions about this post that readers put in the comments.

Here is my take on the key topics from Anurag’s keynote.

In Part One, I highlighted the factors that have transformed  scholarly communication over the last 10-15 years:

  • Search is the new browse
  • Full text indexing of current articles plus significant backfiles joined with relevance ranking to change how we looked and what we did.
  • “Articles stand on their own merit”
  • “Bring all researchers to the frontier”
  • “So much more you can actually read”

In the Part Two of this post, I cover Anurag’s view of What Happens When Finding Everything is So Easy?

Each of the above factors may seem to be incremental, but together they deliver so much impact that even a tradition-bound and well-practiced researcher workflow will change. In fact, while publishing behavior – often determined by a senior member of a research group, and by the senior editors of the journals they publish in – is slow to change, the “hunting behavior” of readers can shift more rapidly in response to adjustments made by the younger grad students and postdocs.

What are the effects of the transformation in finding and reading? Here Google Scholar has a lot of evidence – based on search and result behavior – to report from. While the evidence and its interpretation are two different things, the evidence alone of behavior shift is important for us to be aware of.

What do researchers look for? More queries, more words, more concepts, more areas

Scholar records these changes, per user:

  • growth in the number of articles clicked
    • growth is in both abstracts and fulltext clicked
      • but abstracts are growing more
  • growth in diversity of areas clicked on

What’s happening here? An iterative-filtering workflow is now common: search – scan titles and snippet – click on a number of abstracts – click on a few full texts – change query – lather, rinse, repeat.   I think of this as a kind of hunt-then-gather mode: you hunt, you gather up, you move on to another venue, you repeat. I imagine people are determining relevance via the abstract – which loads more quickly and never hits a paywall – then decide whether to store (a PDF) or read.

Scholar has also found that abstracts that have full text links are more likely to be clicked on than those that do not have such links. Perhaps this is because the user is assured that full text is available if it is needed. And/or perhaps because the entries draw your eye:scholar

PDF still wins the popularity contest

 While there may be reasons that HTML full text is more “powerful” – especially for researchers who need access to high resolution figures or supplemental data sets – the PDF still wins the ‘conformance quality’ (1) award: a downloaded PDF ensures you will be able to read the article later. The impermanent nature of access rights – library subscriptions change, off campus access, a reader’s own job changes – leads to a need to store a local, permanent copy. As Anurag said in the Q&A, “Some downloadable form that is permanent will survive.”

The spread of attention

The ease of finding a great variety of items encourages what Anurag called a “spread of attention”. The spreading is on several dimensions: small journals, new journals, non-English journals, old(er) articles, non-articles (preprints, dissertations, working papers, proceedings, technical reports, patents) all get more attention when they are in the same query space with the “formal literature” that is found in the highly-curated databases.

The “article economy” is enabled by many things in our ecosystem, but the scholarly search engine which finds articles – not journals or issues – is key. The early user experience design decision that each full article would have an address and be on one page rather than be atomized across several is another key enabler. The SKU for an article is a URL or a DOI, if you will; an article doesn’t have several DOIs or URLs. (This wasn’t always the case. Some early scholarly-article web sites had each article section on its own page.)

Users want lots of abstracts and only some full text; metrics should recognize this

Literature review is inherently a filtering process, and abstracts are purpose built to be the distillation of an article. Anurag believes that supplying users with full text when they can’t use it (because the part of the workflow a user is in is the filtering part, not the reading part) is not helpful and slows down the filtering. (There are probably exceptions to this, such as detailed methods searches when the information needed for filtering is not in the abstract.) Similarly, metrics that ignore abstracts are missing a lot of the utility a journal provides to its readers.   Anurag encourages COUNTER to add abstracts and PDF downloads as required measurements in addition to fulltext, gold OA and denials currently emphasized.

Abstracts should now be written for a broader, not just a specialized, audience

Research articles are typically written to the authors’ peers: subject-matter experts in their field. And so abstracts are typically written for the same audience. Relevance ranking in a comprehensive search will lead searchers outside the field of experts to these articles by their abstracts, attracting a larger research readership, giving the authors’ a wider impact. But abstracts written for one’s peers often have jargon (like blog posts for publishers…). Abstracts that are accessible to a broader audience, i.e., researchers in related fields, will help.

Anurag noted that Science and Nature have written broad-audience abstracts well for many years. We see journals beginning to attend to this by adding keywords to articles, and by including impact summaries and “take-home messages”.   Readers appreciate these, and they expand the audience while also efficiently helping the broader audience contextualize a paper.

Now on to the Q&A, which was a wide-ranging one.

“What about searching books?”

To paraphrase Anurag again: ‘We need a representation of a monograph that functions like the abstract does for a journal article. This can’t be an introduction or preface, and it can’t be the first few pages –these things aren’t a representation of the whole.’

On the difference between book and journal searching

‘Users expectations are different for these two. When you search books you expect an answer; when you search journal articles, a scholar expects a list of things to read. A book represents late-stage work, not the early-stage work of journal articles.’

You can easily see the difference between these two modalities. Do a search in Google for “San Francisco weather” and the answer pops up: the current temperature and conditions, and the forecast. But for the “weather scholar” there is also a list of sites below the forecast that you can go to if you want to study the topic by reading web pages about San Francisco weather.

“Do you use things previously searched and found for ranking in Scholar?”

Anurag: ‘This isn’t as significant as you think. [Scholarly] queries are long and contain discipline-specific terms” unlike in Google web search.” He adds, “Personalization helps when queries are ambiguous. When queries are detailed and specific, as most Scholar queries are, personalization doesn’t add much.” There were follow up questions on this theme, suggesting disbelief that Scholar doesn’t use frequency, location, etc. as a ranking signal. It doesn’t, Anurag repeated. He encouraged people who can’t believe it to try the experiment of doing the same Scholar queries in different countries, or with different people trying the same query.

“What is the role of the journal in the future?”

To paraphrase Anurag: ‘I have no good answer for that. The journal was a channel of distribution, and this is less important now. It is still important as a channel of recognition. There are three important relevance-ranking signals for a just-published article: Who wrote it? What is it about? Where was it published? The last of these, where it was published, covers many different indications.’

“The goal of Scholar…” 

“…is to make it easier for people solving difficult problems to do more.”

Anurag has agreed to address questions about this post that readers put in the comments.

(1) Conformance quality: Quality of conformance is the ability of a product, service, or process to meet its design specifications. Design specifications are an interpretation of what the customer needs.

4 thoughts on “Online Indexing of Scholarly Publications: Part 2, What Happens When Finding Everything is So Easy?

  1. Pingback: What’s the Definition of “Read”? | HighWire

  2. I use GS to study research communities, which may be somewhat different from most users, or perhaps not. In any case the reason for the repeats is often that in reading the abstracts one finds different search terms related to one’s interest. Sometimes this means broadening the search. For example I once started with “author disambiguation” then discovered that a lot of the relevant literature used “name disambiguation.” But sometimes it means narrowing the search, by discovering search terms that more precisely fit what is sought. Related Articles is also good for this. Is there data on the use of RA? The point is that in addition to discovering content one discovers search terms.

    There is a huge difference between having the search term appear in the full text, where it may be incidental, and having it appear in the title, where it will be central. For example I recently did a search on “fracking” since 2011, without citations or patents. There are about 13,000 full text occurrences but just 700 title occurrences. The latter is the place to look. Any data on this search distinction?

    Actually it is still hard to find everything on a given subject, but I have developed a procedure for doing so. As for finding what one needs, that is probably what the complex search patterns GS is finding are all about. I have long argued that the logic of search (and how to facilitate it) is one of the great scientific challenges of our day. We need to understand these patterns as a form of problem solving. This is cognitive science.


    • Hi David: discovering new search terms is definitely one of the reasons for the iterative process. As are broadening/narrowing of the query based on how well the results match the question the user has in mind. Yet other reasons include realizing that you need to approach the question differently, pivoting based on specific concepts/authors, etc.

      Regarding related articles search, we do know that this is used a lot but we haven’t tried to understand if there are specific cases/patterns that users use it for.

      Very few people try to find everything on a subject. Even the folks who try to be comprehensive have some notion of a usefulness cutoff. Partly because every subject blends into related areas and partly because not everything written on a subject adds new information.



      • Thanks Anurag. Actually the patterns I refer to are the total search behavior, not ust RA. I am wondering if it is possible from the pattern to figure out what the user’s need is, a least in some cases. The community that studies human problem solving (from which I come) has long done what I call cognitive time and motion studies to identify patterns. If we could recognize them we could offer assistance accordingly.

        For example, in recent weeks I have used GS in several very different ways, including these:
        1. Find candidate reviewers for an article published by F1000Research, a taxonomy of biases.
        2. Find a candidate journal to submit another article to.
        3. Conduct a preliminary survey of recent research on fracking for a book project.
        I am pretty sure that my step by step behavior is different in each case.

        Alternatively, we could offer users a menu of search problems like these and let them tell us what they are doing.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s