Featureengineering is the most important step in designing a modern information retrieval system like Effingo. Independent of the similarity measure and the cluster algorithm the final system will use, the features extracted from the forum content will shape the results like nothing else.
A Feature is a piece of information that is applied to compare two contributions in the final system. From the two sets of all features of two contributions the similarity measure will calculate a similarity value. In classical information retrieval features where simple keywords from the two documents to compare. Modern information retrieval systems use more sophisticated feature types. The success or failure of the system depends strongly on choosing the correct feature types.
Effingos feature types are organised into three categories – local, contextual and structural. The following paragraphs present the feature types, ordered by these three feature type categories.
Local features are taken directly from the raw contributions text. They mostly resemble classical IR features.
The raw text is the most simple source for features. Effingo can handle it with classical information retrieval methods. Filtering stop words, stemming the remaining words and building a keyword index. To improve on this feature type one can apply methods for calculating similarity of documents like described in (Andrei Z. Broder, 2000, 1--10).
However, there is a big problem that applies only to user generated content and thus research about it is still at the beginning. Forum posts vary greatly in quality of spelling and grammar. This is because forum posts are not edited like news articles or books and everyone can produce them. Many established natural language processing methods like stemming, stop word removal or even tokenization are hard to apply to such content. Therefore it might be necessary to concentrate more elaborated text processing on high quality content. It is possible to find such content as shown for example in (Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, Gilad Mishne, 2008, 183-194). Effingo can process the remaining content using language independent methods like N-Gram segmentation and hashing.
Another problem applying especially to unstructured social media contributions is their shortness. It is possible – however subject to further research – that the body text of many contributions does not contain enough information to order it to the correct cluster.
Part of Speech (PoS) Tags are used to assign each word in a sentence to a part of speech like noun, verb or adjective. They are created using so called Tagger programs. Tagger use machine learning approaches. They are assigned tagged sentences, from which they create an internal model of words, sentence structure and tags belonging to words.
PoS Tags are useful to find patterns in the sentence structure of a forum contribution. The hypothesis that must be proven is, that there are established patterns of sentences to state certain kinds of problems or explanations. For example the pattern “Question word, verb” is applicable to question sentences like “What is”, “Who are”. Effingo can use this knowledge to find types of forum posts and use these types to compare only contributions of the same type. That way the candidate set for comparison is reduced and performance increases.
Unfortunately PoS Tags face the same problem as language dependent text features. Poor quality content is hard, if not impossible to tag. In the best case the Tagger recognizes this problem – in the worst it assigns wrong tags to unrecognised words. A second problem is that even high quality content can contain areas that are impossible to tag, like source code, tables or lists formed of sentence parts. Before applying the Tagger Effingo would need to find such content and exclude them from Tagging. Often such content is annotated by the forum engine and thus easy to find. However code can occur within a free text sentence and lists might be placed without using the forums list feature. Such error sources need to be detected automatically.
Named Entities and Facts
In addition to the usual vocabulary a forum also contains many domain dependent terms. Product names in a customer support forum, location names in a travel forum and person names in a political forum, for example. Such entities usually have different representations like abbreviations and acronyms. There are Named Entity Recognisers (NER) that are able to find such entities and even order different representations to the same concept. Facts in addition are relations between entities like “Windows is a product by Microsoft”. Extracting such entities and facts from contributions enables Effingo to assign a high similarity score to posts that share a high amount of them.
NER and fact extraction faces the same problems as PoS Tagging. The NER system must be trained in advance. It is prone to poor quality content and the training phase is required for each new forum or at least for each new domain of forums. Since annotating a large set of examples for learning by hand is a tedious work, one can not expect a forums operator to do this. In addition NER systems also require a free text without code snippets, tables and lists to do their work correctly.
Users usually can add links to external resources in their contribution text. Such links usually provide a reference to some external resource the creator of the thread has requested or the answering user thinks is helpful to solve the discussion. Similar discussions will attract similar resources. Therefore it is reasonable that two contributions pointing to the same resource or even resources share a high amount of similarity.
There are however a few problems with link detection. If we are lucky links are marked explicitly by two tags. We assume this will be the case for most links, since this offers the possibility to click the link directly instead of copying it manually to a browsers address bar. Not marked links are harder to detect since there are tokens like the “.” full stop sign or the “?” question mark, that are valid inside a link but often are used to mark the end of a sentence. If the link is set at the sentence end as well, it is hard to distinguish between the sentence mark and the links characters. However even if we can detect most links it gets harder to find out if two links point to the same resource if a resource is accessible under different URLs. Link normalisation can be carried out by looking at the actual target of the link, but this approach increases the work load on the Effingo system for each extracted link and requires comparison of each link target with each other.
Contextual features are all features from the context of the contribution. One could also say they are the contributions meta information.
A contributions title is similar to the contributions body in the way that it is free text as well. However a title often carries much more significance, since it is a very condensed description of the topic discussed within the contribution. Effingo can handle it similar to body text and apply local features like “Text”, “PoS Tags” as well as “Named Entities and Facts”, but weight them differently.
However using a contributions title faces the same problems about low quality content as its body. In addition there are many lazy users choosing a very short non-sense title like “buh”, “problem”, “need help” or similar. A solution to these problems might be to apply quality filtering and consider only titles above some length threshold.
The second problem is that contribution titles usually are only set at the beginning of a forum thread. Even though it is possible to give each contribution in the thread another title, few use this feature. So the further away a contribution is from the start of the thread the higher the probability that the topic slides away from the opening contribution. Therefore titles in late contributions need to be handled with care.
At first the publication date seems not to be a good indicator for calculating the similarity of user generated content. However since user generated content is usually coupled closely to real time events it is possible to relate contributions in a given time frame to events occurring at that time. After each release of a new “Ubuntu” version for example the forums at ubuntu.com or ubuntuusers.de are flooded with threads about this release. If the new version has some specific bug, multiple discussions about this bug get created. Shortly after the release their usually is a fix or workaround and discussion about the bug gets fewer. This means a burst of contributions in a certain time frame means there is some important event going on and many contributions in this time frame might belong to that contribution. The second conclusion indicates that contributions from the same time might have a higher similarity. This second points is not absolutely sure and subject to further research. It is also possible that the probability of encountering the same question again increases with the time between two contributions.
A second indicator provided by a contributions date is its correlation to the first occurence of a product or event, which can be used for candidate set generation. If Effingo “knows” that the iPhone was introduced the first time on 9th january 2007 it needs not to compare contributions about the iPhone to contributions from the year 2000 or earlier. Of course there might be interesting rumour before this date so candidates should also be chosen from dates shortly in advance to this deadline. This means the candidate suitability value should gradually slow down the further a contribution is away from the original.
To really use this feature type it is necessary to apply additional feature types.
Many forums allow users to give ratings to the contributions provided by other users. These usually have the form of a five star system, simple categorisation (“not helpful”, “helpful”, “solution”) or a point system (see slashdot.org). This is a good feature for the final step of labeling a cluster of similar contributions or finding a representative. It can also be used to find the high quality content as proposed to improve the applicability of local features and the title feature type.
Unfortunately high quality content says nothing about the similarity of two contributions.
The context of a forum can drastically change the meaning of certain terms or improve their usefulness for information retrieval. The term Java has no discriminating power in a Java forum. In a .Net forum on the other hand it marks a small number of threads that can be grouped together, since they will mostly talk about the differences of both programming languages.
In addition if the forums context describes relations between the terms, this would open up the possibility of grouping contributions hierarichally. Posts about components of a bigger concept for example could be grouped as a subcluster. If I have some solution to a problem with Ubuntu’s Twitter client “Gwibber” and another about problems with Ubuntu’s new social media integration and Effingo “knows” that Gwibber is part of Ubuntu it is able to group both together. The answer to the generic problem might be provided for the specific problem without mentioning it and the other way around.
To really apply such a context data structure other techniques like “Named Entity and Fact Recognition” are applied to detect the concepts and relations described in the data structure.
Assigning a context to a forum is however no easy task. Some data structure like an ontology, taxonomy or simply a list of important terms is necessary. Such a data structure is hard to assign and even harder to create. It is subject to further research if there already are structures usable for certain forums. If there are not, an interesting question is if it is possible to motivate forum users to create one as they use the forum. This works quite well for tagging, but is it possible to change tools like Ontofly or Webprotegé, so users of all forums (or a meaningful subset) are able and willing to use them?
Structural features finally are all pieces of information describing the forum around the contribution. They characterise how it is embedded into the forums structure.
In contrast to external links internal links are links between two forum contributions in the same forum. Usually they are created by some expert linking two threads because he knows there is a similarity between both. This leads to the conclusion that two linked threads contain similar contributions with a higher probability. It is even possible to create a link graph over a forum and analyse tightly coupled subgraphs. At least all questions in such a graph that are close to each other might correlate. The further apart two threads are the less similarity between their contributions exists.
In addition to the same problems as described for the detection of external links, internal links are rare and thus Effingo can not depend on them.
Social Graph, Reply Graph
One of the most interesting structural features is a forum’s social or reply graph. The users are the nodes in the graph and there is a directed edge between two users if one user answered to the contribution of the other. There already is some research showing that this graph can be used to capture the expertise of a forums user. Each answer a user provided is the result of some part of the users expertise. If there are several users with overlapping expertise, topics will become evident when looking at the intersections of threads these users answered. In addition it is possible to look at the left and right context of a thread’s development to find out who answers to whom. Users usually posting the last contribution in most threads have a high probability, that they are seen as experts in their area.
Unfortunately I believe that significant patterns will only occur on long threads.
As each contribution is assigned to some channel of threads and each channel defines a certain subtopic of the domain discussed in the forum, contributions from the same sub forum have a higher probability of sharing equal topics. But since the most interesting similar threads are those one cannot find in the same channel, Effingo should not absolutely rely on this feature type.
This feature type is extensible to the whole web.
Position in Thread
The last interesting structural feature is the position of a contribution inside its thread. Most obvious is that the first contribution is a question or statement, that is discussed in the other contributions. This means it is usually not helpful to compare apples and oranges so opening contributions should be compared with other opening contributions. At least their probability of being similar is higher. For the general location of two contributions in a thread I think that the distance of two contributions in the same thread has some influence on their probability to be similar. For inter thread similarity the position is helpful to filter out off topic contributions, since they tend to occur at the end of long threads.
Unfortunately there are forums where multiple threads are mingled and questions occur later in a thread’s message stream. Such questions can be detected quite good by existing systems however and there is research on decoupling such mingled threads.
This list of feature types might not be complete and is created from observations of existing forums and partly compiled from related work adapted to the forum domain. In future entries I will show results of detecting similarity relations using some of the features alone and combinations of them.
Andrei Z. Broder, AZB 2000, 'Identifying and Filtering Near-Duplicate Documents', Lecture Notes In Computer Science
, Annual Symposium on Combinatorial Pattern Matching, 11, 1--10, viewed , , .^
Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, Gilad Mishne, 2008, 'Finding high-quality content in social media', Web Search and Web Data Mining
, International conference on Web search and web data mining, 1, 183-194, viewed , , .^