Solving Indexing, one step at a time

Publishing is on the verge of exciting times. The promise of relatively new technology like machine learning, artificial intelligence, and Natural Language Processing makes it incredibly tempting to speculate on the new world we’ll soon be living in, including questions about which processes can be automated and whose jobs will be taken over. (We have even done some of the speculating ourselves, here and here).

While there is certainly a time for thinking carefully about large scale changes to our industry, I do fear that thinking only in terms of large scale changes makes us focus on the wrong questions — by constantly thinking in terms of abstractions and generalities, we can inadvertently ignore and fail to value the concrete.

Consider for example, the state of indexing. As any academic will tell you, indexing can be incredibly helpful for research. By listing major topics and the page numbers they are mentioned in, it allows readers to first decide whether a certain resource is what they are looking for by giving them a taste of the topics covered as well as a rough estimate of the extent to which they are addressed. And for research, a well-designed index enables people to narrow in on precisely the topic necessary, since obviously every resource cannot be read from scratch each time a paper or a book or a website entry needs to be written. The need for the index then is very real.

In addition, few people I talked to in publishing and in the world of academia think that the current indexing procedures work. A recent popular Twitter thread by historian and editor Audra Wolfe raised many issues I have been hearing about. She tweeted that professional indexers were essential for any academic who wasn’t knowledgeable about and competent at indexing, because otherwise the result was often “frustrating and unprofessional”.

In response, historian Bodie Ashton pointed out that early career researchers simply cannot hire indexers, and that if he had paid $7 per page for his first book, the indexing fee would have been a whole order of magnitude more than what he would have earned in royalties. Historian of technology Marie Hicks weighed in too, revealing that the turnaround time required by the publisher was too short to be able to hire an indexer. Moreover, they pointed out that it simply seemed unacceptable that anyone would need thousands of dollars to be able to produce an index that was professional.

I agree. This strikes me as a situation ripe for technological intervention— an indispensable job that costs too much and takes far too much time. The biggest obstacle to incorporating technology, however, is that expectations seem to skew too far in two directions. On the one hand, tech optimists seem to think we can come up with an indexing engine that will immediately replace professional indexers, saving them both time and money. Unfortunately, the work of indexing is not simply mechanical in a way that can be captured by a simple algorithm, but instead depends on skill that takes time to develop, and quite often also expertise in the discipline that the book belongs to. Unsurprisingly then, trying to replace human indexers wholesale results in unhappiness all around. Authors report being forced to live with clearly inadequate results or else having to redo the whole job themselves.

On the other hand, some people seem to over-correct and insist that indexing cannot possibly be improved, that we simply should accept the way things are. This kind of lapse into a fatalistic pessimism is sadly understandable. For some time now, there has been a standard story about how things play out: the unrealistic expectations of some about publishing tech leads to publishing tech advertising abilities they simply cannot deliver on, leading to disappointment all around. As this keeps repeating, of course publishers start to instinctively react to tech with skepticism. But given that there are real problems that need to be addressed — as the original tweets testify — this position isn’t sustainable either.

I believe the way out of this impasse is to recognise that this is in a very real way an artificial problem. Our talk of tech in terms of abstractions and generalisations only allows us to speak of progress in terms of binary states, as entirely successful or as entirely failing. Rather than fall for this, we need to stop asking whether a certain task can be automated or be performed by AI engines, and instead ask in what ways can tech actually help us, given where we are. Once we do this, we can start noticing that there are multiple products already that can assist indexing.

Keyword extractors that already exist may not be perfect but they can certainly generate a list of suggestions that can dramatically cut back on time, since authors or indexers will only need to remove unnecessary entries, add any left out, and tweak existing ones (for example, a case of synonyms or two different people with the same name accidentally classified as the same person). Statistical information about the frequency of terms can significantly ease indexing by showing the spread of a topic through the entire manuscript. And certain categories of keywords can be extracted better than others — proper names for example are far easier to identify than key concepts. And this is by no means the end of the line. I even predict engines intelligent enough to autogenerate keywords based on the kind of reader and subject area in the coming years.

Such a plan is undeniably ambitious, and will require quite a different fundamental attitude towards tech and change. But as one scholar wistfully writes about the task of indexing, an arrangement where publishers can take care of indexing well and quickly would be ideal. This can be made real, but only one step at a time.

Trackback URL-osoite:

Ei kommenteja vielä. Ole ensimmäinen.