When it comes to Natu­ral Lan­guage Gene­ra­ti­on (NLG), many feel that one advan­ta­ge that human repor­ters have over their robotic bre­thren is that the lat­ter are unab­le to accu­ra­te­ly defi­ne, when con­fron­ted with raw data, what is worthy of being cal­led ‘news’. Or, more spe­ci­fi­cal­ly, they can­not lea­ve to one side ever­y­thing that is NOT news.

It’s a view­point that has some foun­da­ti­on, albeit one that is neit­her par­ti­cu­lar­ly strong nor sta­ble; it assu­mes all auto­ma­ti­on does is go through data­sets and regur­gi­ta­te num­bers into sen­ten­ces and para­graphs rather than cells and rows.

Auto­ma­ti­on pro­ducts are actual­ly a lot more com­plex. While a very sim­plistic pro­duct takes raw data—single wor­ds, num­bers, figu­res, etc—and turns it into rote sen­ten­ces with litt­le varia­ti­on or con­text, more-com­plex sys­tems, such as Retresco’s tex­ten­gi­ne pro or our stream­li­ned ser­vice textengine.io, go much fur­t­her and deeper.

This can be loo­ked at in two dis­tinct­ly-dif­fe­rent ways: design and exe­cu­ti­on.

When it comes to design, more-com­plex NLG sys­tems are built with the behind-the-sce­nes abi­li­ty to assess and ana­ly­se data, and find the cor­re­spon­ding sen­tence (known as a ‘tem­pla­te’).

A very simp­le ver­si­on of this could be when gene­ra­ting per­son­nel records for a lar­ge com­pa­ny. If the­re was a data field that recor­ded gen­der, that could be input in one of three ways, e.g. ‘male’, ‘fema­le’, or ‘not spe­ci­fied’. With that data field in the sys­tem, the NLG sys­tem could choo­se to gene­ra­te sen­ten­ces begin­ning with ‘he’, ‘she’, or ‘they’, depen­ding on wha­te­ver was in that data field. The pie­ce of code that inst­ructs the sys­tem to do that is known as a ‘con­di­ti­on’.

The­se ‘con­di­ti­ons’ go from the very sim­plistic in the examp­le above to much more com­plex, par­ti­cu­lar­ly in are­as such as sports and finan­ce. With the for­mer, the wealth of infor­ma­ti­on we have has allo­wed us to crea­te over 6,000 tem­pla­tes, with a fur­t­her 700 plan­ned and 500 that could be used at a later date, for the Ger­man-lan­guage soc­cer pro­ject. That we have so much varia­ti­on on this pro­ject is becau­se we reco­gnise that in soc­cer, any­thing can—and often does—happen, and we need to be pre­pa­red for that.

The ‘exe­cu­ti­on’ side of this is most appa­rent in our finan­ci­al reporting pro­ducts. When we first built our stock mar­ket pro­ject a coup­le of years ago, we deve­lo­ped a pro­cess to eva­lua­te and gra­de the chan­ges in pri­ce of stocks on an index. If the stock fell in pri­ce by, for examp­le, 0.1 per cent, we could record this as having a ‘neu­tral’ chan­ge, reco­gnis­ing that its move­ment had been negli­gi­ble. The tem­pla­te gene­ra­ted would have reflec­ted this, say­ing some­thing along the lines of, “The pri­ce of stock A recor­ded no real chan­ge today, dipping just 0.1 per cent over the cour­se of tra­ding.” If the chan­ge had been lar­ger, say a 10 per cent fall, the sys­tem would return a tem­pla­te resem­bling this, “Stock A saw its value plum­met 10 per cent today over the cour­se of tra­ding.”

Tho­se sen­ten­ces are basic, but with pro­per data, increa­singly-com­plex for­mu­la­ti­ons are retur­ned. At some point, with the cor­rect data in place, this kind pas­sa­ge could appe­ar: “The stock pri­ce of A fell shar­ply today, drop­ping 8 per cent over the cour­se of tra­ding. Simi­lar lar­ge falls were seen in B and C, which saw respec­tive decli­nes of 6.5 and 7 per cent. The fall in pri­ce of A was the lar­gest decrea­se in its value sin­ce 1 Janu­ary 1980.”

Such a dis­astrous chan­ge in a stock’s pri­ce would be in the lede in any news­pa­per. If all an NLG sys­tem did was turn num­bers into sen­ten­ces, that vital infor­ma­ti­on may find be buried wit­hin a long para­graph, deep wit­hin the sto­ry and sur­roun­ded by other, less-important news. What the rtr tex­ten­gi­ne pro­ducts can do is, through our tech­ni­cal magic, put that infor­ma­ti­on at the top after iden­ti­fy­ing it as important and map­ping out which tem­pla­tes are nee­ded to but­tress that sto­ry.

Until we have tru­ly auto­no­mous and intel­li­gent AI, the para­me­ters that deci­de on this are deci­ded and imple­men­ted by humans, mea­ning that flesh-and-blood crea­tures are still vital to the pro­cess. And not just at the begin­ning, but as an ongo­ing fea­ture, sin­ce the­se sys­tems are con­ti­nu­al­ly updated and impro­ved upon. This means that peop­le are still the most-important dri­ver in deter­mi­ning the suc­cess of a pro­ject, some­thing I’ve writ­ten about here, here, and here.



For more infor­ma­ti­on, plea­se con­tact:

Pete Car­vill (@pete_carvill)
Com­mu­ni­ca­ti­ons Mana­ger
+49 (0)30 555 781 999



About Ret­res­co

Foun­ded in Ber­lin in 2008, Ret­res­co has beco­me one of the lea­ding com­pa­nies in the field of natu­ral lan­guage pro­ces­sing (NLP) and machi­ne learning. Ret­res­co deve­lops seman­tic app­li­ca­ti­ons in the are­as of con­tent clas­si­fi­ca­ti­on, recom­men­da­ti­on, as well as high­ly inno­va­ti­ve tech­no­lo­gy for natu­ral lan­guage gene­ra­ti­on (NLG). Through near­ly a deca­de of deep indus­try expe­ri­ence, Ret­res­co helps its cli­ents acce­le­ra­te digi­tal trans­for­ma­ti­on, increa­se ope­ra­tio­nal effi­ci­en­ci­es, and enhan­ce custo­mer enga­ge­ment.


Contact eng