NLG in der Praxis

How automated content defines news

When it comes to Natural Language Generation (NLG), many feel that one advantage that human reporters have over their robotic brethren is that the latter are unable to accurately define, when confronted with raw data, what is worthy of being called ‘news’. Or, more specifically, they cannot leave to one side everything that is NOT news.

It’s a viewpoint that has some foundation, albeit one that is neither particularly strong nor stable; it assumes all automation does is go through datasets and regurgitate numbers into sentences and paragraphs rather than cells and rows.

Automation products are actually a lot more complex. While a very simplistic product takes raw data—single words, numbers, figures, etc—and turns it into rote sentences with little variation or context, more-complex systems, such as Retresco’s textengine pro or our streamlined service, go much further and deeper.

This can be looked at in two distinctly-different ways: design and execution.

When it comes to design, more-complex NLG systems are built with the behind-the-scenes ability to assess and analyse data, and find the corresponding sentence (known as a ‘template’).

A very simple version of this could be when generating personnel records for a large company. If there was a data field that recorded gender, that could be input in one of three ways, e.g. ‘male’, ‘female’, or ‘not specified’. With that data field in the system, the NLG system could choose to generate sentences beginning with ‘he’, ‘she’, or ‘they’, depending on whatever was in that data field. The piece of code that instructs the system to do that is known as a ‘condition’.

These ‘conditions’ go from the very simplistic in the example above to much more complex, particularly in areas such as sports and finance. With the former, the wealth of information we have has allowed us to create over 6,000 templates, with a further 700 planned and 500 that could be used at a later date, for the German-language soccer project. That we have so much variation on this project is because we recognise that in soccer, anything can—and often does—happen, and we need to be prepared for that.

The ‘execution’ side of this is most apparent in our financial reporting products. When we first built our stock market project a couple of years ago, we developed a process to evaluate and grade the changes in price of stocks on an index. If the stock fell in price by, for example, 0.1 per cent, we could record this as having a ‘neutral’ change, recognising that its movement had been negligible. The template generated would have reflected this, saying something along the lines of, “The price of stock A recorded no real change today, dipping just 0.1 per cent over the course of trading.” If the change had been larger, say a 10 per cent fall, the system would return a template resembling this, “Stock A saw its value plummet 10 per cent today over the course of trading.”

Those sentences are basic, but with proper data, increasingly-complex formulations are returned. At some point, with the correct data in place, this kind passage could appear: “The stock price of A fell sharply today, dropping 8 per cent over the course of trading. Similar large falls were seen in B and C, which saw respective declines of 6.5 and 7 per cent. The fall in price of A was the largest decrease in its value since 1 January 1980.”

Such a disastrous change in a stock’s price would be in the lede in any newspaper. If all an NLG system did was turn numbers into sentences, that vital information may find be buried within a long paragraph, deep within the story and surrounded by other, less-important news. What the rtr textengine products can do is, through our technical magic, put that information at the top after identifying it as important and mapping out which templates are needed to buttress that story.

Until we have truly autonomous and intelligent AI, the parameters that decide on this are decided and implemented by humans, meaning that flesh-and-blood creatures are still vital to the process. And not just at the beginning, but as an ongoing feature, since these systems are continually updated and improved upon. This means that people are still the most-important driver in determining the success of a project.


About Retresco

Founded in Berlin in 2008, Retresco has become one of the leading companies in the field of natural language processing (NLP) and machine learning. Retresco develops semantic applications in the areas of content classification, recommendation, as well as highly innovative technology for natural language generation (NLG). Through nearly a decade of deep industry experience, Retresco helps its clients accelerate digital transformation, increase operational efficiencies, and enhance customer engagement.