Peerset

How Many Ways Can I Say "I Love You"?

One of the biggest problems that confronts anyone who wants to understand the meaning of UGC, especially for the purpose of targeted online ads, is Synonymy.  There are many ways to express the same idea and you can always count on the millions of social network users to use them all!  The incredible array of variant forms of words that all express the same underlying idea does make language very rich, but it creates further challenges for machine analysis of the interests and opinions of social network users.  For example, "The Stones", "The Rolling Stones" "Mick Jagger and the guys"  "TRS" and many other terms could all refer to The Rolling Stones, or "Lord of the Rings"  "LOTR" "Return of the King"  "Gandalf is the man!"  "Tolkien"  "Tollkein" (and other innumerable misspellings) could all refer to The Lord of the Rings.

When parsing this data and doing any sort of statistical analysis of this kind of material, these expressions that denote the same underlying concept should be treated as being related to each other and ideally should be normalized in some way.  However, this is especially difficult when we are dealing with the latest movies, books and bands because there is no thesaurus that can be used to disambiguate and relate all of these in the right way.

Some solutions to this problem are to use “fuzzy logic” and pattern matching algorithms that can relate entities that are similar to each other.  These work well when the variants have a good percentage of similarity, such as in the case of “The Wedding Crashers”  and “Wedding Crasher” but when there is no way of knowing based on the form of the words that they are referring to the same thing, this becomes much harder.  Unless it was using a thesaurus or ontology of some kind that included millions of references to popular culture, how would a machine know that “Insane Clown Posse Fan” and “Juggalo” are synonyms?  A good way to solve this problem is to go to the source - to use sites with UGC that is more structured to help build a thesaurus that can be used to digest UGC from less structured contexts.

One also runs into challenges due to the highly idiomatic nature of a lot of UGC, such as the many wonderful ways that people have of expressing their hatred for something, for example:  "I enjoy all music except for country."  "heavy metal sucks big time"  "I love Paris Hilton – yeah right!!"  "I'll chat about anything but philosophy and maybe politics."  A fairly sophisticated sentiment analysis engine is going to be needed to correctly assess the meaning inherent in sentences such as these.  I don't have to mention in addition to all of this the copious usage of slang, jargon and acronyms that further complicate matters…  I'm getting a headache just by thinking about this labyrinth of language.

Peerset's Concept Specific Ontology and Interest Correlation Analyzer work together to solve the problems outlined above. We are very aware of the many pitfalls awaiting those who venture into this quagmire and are constantly upgrading our systems to use cutting edge technologies to keep ahead of the curve.  However you choose to tackle this, it is important to get as much meaning as possible out of the text that you are looking at in order to maximize the returns on your investment in ad targeting technologies.  Social media monetization is a thin-margined game so squeezing the most out of what you have available will benefit the publisher, advertiser and ad network.