Freitag, 27. September 2013

Feature Toggles in JSP with Togglz

Feature Toggles are a useful pattern when you are working on several features but want to keep your application in a deployable state. One of the implementations of the pattern available for Java is Togglz. It provides ways to check if a feature is enabled programmatically, from JSF or JSP pages or even when wiring Spring beans. I couldn't find a single example on how to use the JSP support so I created an example project and pushed it to GitHub. In this post I will show you the basics of Togglz and how to use it in Java Server Pages.

Togglz

Features that you want to make configurable are described with a Java Enum. This is an example with two features that can be enabled or disabled:

public enum ToggledFeature implements Feature {

    TEXT,
    MORE_TEXT;

    public boolean isActive() {
        return FeatureContext.getFeatureManager().isActive(this);
    }
}

This Enum can then be used to check if a feature is enabled in any part of your code:

if (ToggledFeature.TEXT.isActive()) {
    // do something clever
}

The config class is used to wire the feature enum with a configuration mechanism:

public class ToggledFeatureConfiguration implements TogglzConfig {

    public Class<? extends Feature> getFeatureClass() {
        return ToggledFeature.class;
    }

    public StateRepository getStateRepository() {
        return new FileBasedStateRepository(new File("/tmp/features.properties"));
    }

    public UserProvider getUserProvider() {
        return new ServletUserProvider("ADMIN_ROLE");
    }
}

The StateRepository is used for enabling and disabling features. We are using a file based one but there are others available.

To configure Togglz for your webapp you can either do it using CDI, Spring or via manual configuration in the web.xml:

<web-app xmlns="http://java.sun.com/xml/ns/javaee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
    version="3.0">

    <context-param>
        <param-name>org.togglz.core.manager.TogglzConfig</param-name>
        <param-value>de.fhopf.togglz.ToggledFeatureConfiguration</param-value>
    </context-param>

    <filter>
        <filter-name>TogglzFilter</filter-name>
        <filter-class>org.togglz.servlet.TogglzFilter</filter-class>
    </filter>
    <filter-mapping>
        <filter-name>TogglzFilter</filter-name>
        <url-pattern>/*</url-pattern>
    </filter-mapping>

</web-app>

In my example I had to add the filter manually though with Servlet 3.0 this shouldn't be necessary. I am not sure if this is caused by the way Gradle runs Jetty or if this is always the case when doing the configuration via a context-param.

Togglz with Java Server Pages

For the integration of Togglz in JSPs you need to add the dependency togglz-jsp to your project. It contains a tag that can be used to group code which can then be enabled or disabled. A simple example for our ToggledFeature:

<%@ taglib uri="http://togglz.org/taglib" prefix="togglz" %>

This is some text that is always shown.

<togglz:feature name="TEXT">
This is the text of the TEXT feature.
</togglz:feature>

<togglz:feature name="MORE_TEXT">
This is the text of the MORE_TEXT feature.
</togglz:feature>

Both features will be disabled by default so you will only see the first sentence. You can control which features are enabled (even at runtime) in /tmp/features.properties. This is what it looks like when the TEXT feature is enabled:

TEXT=true
MORE_TEXT=false

A Word of Caution

I am just starting using feature toggles in an application so I wouldn't call me experienced. But I have the impression that you need to be really disciplined when using it. Old feature toggles that are not used should be removed as soon as possible. Unfortunately the huge benefit of compile time safety in Java for removing a feature from the enum is gone with JSPs; the names of the features are only Strings so you will have to do some file searches when removing a feature.

Freitag, 20. September 2013

Kibana and Elasticsearch: See What Tweets Can Say About a Conference

In my last post I showed how you can index tweets for an event in Elasticsearch and how to do some simple queries on it using its HTTP API. This week I will show how you can use Kibana 3 to visualize the data and make it explorable without having to learn the Elasticsearch API.

Installing Kibana

Kibana 3 is a pure HTML/JS frontend for Elasticsearch that you can use to build dashboards for your data. We are still working with the example data the is indexed using the Twitter River. It consists of tweets for FrOSCon but can be anything, especially data that contains some kind of timestamp as it's the case for tweets. To install Kibana you can just fetch it from the GitHub repostory (Note: now there are also prepackaged archives available that you can download without cloning the repository):

git clone https://github.com/elasticsearch/kibana.git

You will now have a folder kibana that contains the html files as well as all the assets needed. The files need to be served by a webserver so you can just copy the folder to the directory e.g. Apache is serving. If you don't have a webserver installed you can simply serve the current directory using python:

python -m SimpleHTTPServer 8080

This will make Kibana available at http://localhost:8080/kibana/src. With the default configuration Elasticsearch needs to be running on the same machine as well.

Dashboards

A dashboard in Kibana consists of rows that can contain different panels. Each panel can either display data, control which data is being displayed or both. Panels do not stand on their own; the results that are getting displayed are the same for the whole dashboard. So if you choose something in one panel you will notice that the other panels on the page will also get updated with new values.

When accessing Kibana you are directed to a welcome page from where you can choose between several dashboard templates. As Kibana is often used for logfile analytics there is an existing dashboard that is preconfigured to work with Logstash data. Another generic dashboard can be used to query some data from the index but we'll use the option "Unconfigured Dashboard" which gives some hints on which panels you might want to have.

This will present you with a dashboard that contains some rows and panels already.

Starting from the top it contains these rows:

  • The "Options" row that contains one text panel
  • The "Query" row that contains a text query panel
  • A hidden "Filter" row that contains a text panel and the filter panel. The row can be toggled visible by clicking on the text Filter on the left.
  • The "Graph" row two text panels
  • The large "Table" row with one text panel.

Those panels are already laid out in a way that they can display the widgets that are described in the text. We will now add those to get some data from the event tweets.

Building a Dashboard

The text panels are only there to guide you when adding the widgets you need and can then be removed. To add or remove panels for a row you can click the little gear next to the title of the row. This will open an options menu. For the top row we are choosing a timepicker panel with a default mode of absolute. This gives you the opportunity to choose a begin and end date for your data. The field that contains the timestamp is called "created_at". After saving you can also remove the text panel on the second tab.

If you now open the "Filters" row you will see that there now is a filter displayed. It is best to keep this row open to see which filters are currently applied. You can remove the text panel in the row.

In the graph section we will add two graph panels instead of the text panels: A pie chart that displays the terms of the tweet texts and a date histogram that shows how many tweets there are for a certain time. For the pie chart we use the field "text" and exclude some common terms again. Note that if you are adding terms to the excluded terms when the panel is already created that you need to initiate another query, e.g. by clicking the button in the timepicker. For the date histogram we are again choosing the timestamp field "created_at".

Finally, in the last row we are adding a table to display the resulting tweet documents. Besides adding the columns "text", "user.screen_name" and "created_at" we can leave the settings like it's proposed.

We now have a dashboard to play with the data and see the results immediately. Data can be explored by using any of the displays, you can click in the pie chart to choose a certain term or choose a time range in the date histogram. This makes it really easy to work with the data.

Answering questions

Now we have a visual representation of all the terms and the time of day people are tweeting most. As you can see, people are tweeting slightly more during the beginning of the day.

You can now check for any relevant terms that you might be interested in. For example, let's see when people tweet about beer. As we do have tweets in multiple languages (german, english and people from cologne) we need to add some variation. We can enter the query

text:bier* OR text:beer* OR text:kölsch
in the query box.

There are only few tweets about it but it will be a total surprise to you that most of the tweets about beer tend to be send later during the day (I won't go into detail why there are so many tweets mentioning the terms horse and piss when talking about Kölsch).

Some more surprising facts: There is not a single tweet mentioning Java but a lot of tweets that mention php, especially during the first day. This day seems to be far more successful for the PHP dev room.

Summary

I hope that I could give you some hints on how powerful Kibana can be when it comes to analytics of data, not only with log data. If you'd like to read another detailed step by step guide on using Kibana to visualize Twitter data have a look at this article by Laurent Broudoux.

Mittwoch, 11. September 2013

Simple Event Analytics with ElasticSearch and the Twitter River

Tweets can say a lot about an event. The hashtags that are used and the time that is used for tweeting can be interesting to see. Some of the questions you might want answers to:

  • Who tweeted the most?
  • What are the dominant keywords/hashtags?
  • When is the time people are tweeting the most?
  • And, most importantly: Is there a correlation between the time and the amount of tweets mentioning coffee or beer?

During this years FrOSCon I indexed all relevant tweets in ElasticSearch using the Twitter River. In this post I'll show you how you can index tweets in ElasticSearch to have a dataset you can do analytics with. We will see how we can get answers to the first two questions using the ElasticSearch Query DSL. Next week I will show how Kibana can help you to get a visual representation of the data.

Indexing Tweets in ElasticSearch

To run ElasticSearch you need to have a recent version of Java installed. Then you can just download the archive and unpack it. It contains a bin directory with the necessary scripts to start ElasticSearch:

bin/elasticsearch -f

-f will take care that ElasticSearch starts in the foreground so you can also stop it using Ctrl-C. You can see if your installation is working by calling http://localhost:9200 in your browser.

After stopping it again we need to install the ElasticSearch Twitter River that uses the Twitter streaming API to get all the tweets we are interested in.

bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0

Twitter doesn't allow anonymous access to its API anymore so you need to register for the OAuth access at https://dev.twitter.com/apps. Choose a name for your application and generate the key and token. Those will be needed to configure the plugin via the REST API. In the configuration you need to pass your OAuth information as well as any keyword you would like to track and the index that should be used to store the data.

curl -XPUT localhost:9200/_river/frosconriver/_meta -d '
{
    "type" : "twitter",
    "twitter" : {
        "oauth" : {
            "consumer_key" : "YOUR_KEY",
            "consumer_secret" : "YOUR_SECRET",
            "access_token" : "YOUR_TOKEN",
            "access_token_secret" : "YOUR_TOKEN_SECRET"
        },
        "filter" : {
            "tracks" : "froscon"
        }
    },
    "index" : {
        "index" : "froscon",
        "type" : "tweet",
        "bulk_size" : 1
    }
}
'

The index doesn't need to exist yet, it will be created automatically. I am using a bulk size of 1 as there aren't really many tweets. If you are indexing a lot of data you might consider setting this to a higher value.

After issuing the call you should see some information in the logs that the river is starting and receiving data. You can see how many tweets there are in your index by issuing a count query:

curl 'localhost:9200/froscon/_count?pretty=true

You can see the basic structure of the documents created by looking at the mapping that is created automatically.

http://localhost:9200/froscon/_mapping?pretty=true

The result is quite long so I am not replicating it here but it contains all the relevant information you might be interested in like the user who tweeted, the location of the user, the text, the mentions and any links in it.

Doing Analytics Using the ElasticSearch REST API

Once you have enough tweets indexed you can already do some analytics using the ElasticSearch REST API and the Query DSL. This requires you to have some understanding of the query syntax but you should be able to get started by skimming through the documentation.

Top Tweeters

First, we'd like to see who tweeted the most. This can be done by doing a query for all documents and facet on the user name. This will give us the names and count in a section of the response.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "user" : { 
        "terms" : {
          "field" : "user.screen_name"
        } 
      }                            
    }
  }
'

Those are the top tweeters for FrOSCon:

Dominant Keywords

The dominant keywords can also be retrieved using a facet query, this time on the text of the tweet. As there are a lot of german tweets for FrOSCon and the text field is processed using the StandardAnalyzer that only removes english stopwords it might be necessary to exclude some terms. Also you might want to remove some other common terms that indicate retweets or are part of urls.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "keywords" : { 
        "terms" : {
          "field" : "text", 
          "exclude" : ["froscon", "rt", "t.co", "http", "der", "auf", "ich", "my", "die", "und", "wir", "von"] 
        }
      }                            
    }
  }
'

Those are the dominant keywords for FrOSCon:

  • talk (no surprise for a conference)
  • slashme
  • teamix (a company that does very good marketing. Unfortunately in this case this is more because their fluffy tux got stolen. The tweet about it is the most retweeted tweet of the data.)

Summary

Using the Twitter River it is really easy to get some data into ElasticSearch. The Query DSL makes it easy to extract some useful information. Next week we will have a look at Kibana that doesn't necessarily require a deep understanding of the ElasticSearch queries and can visualize our data.

Mittwoch, 4. September 2013

Developing with CoreMedia

A while ago I had the chance to attend a training on web development with CoreMedia. It's a quite enterprisey commercial Content Management System that powers large corporate websites like telekom.com as well as news sites like Bild.de (well, you can't hold CoreMedia responsible for the kind of "content" people put into their system). As I have been working with different Java based Content Management Systems over the years I was really looking forward to learn about the system I heard really good things about. In this post I'll describe the basic structure of the system as well how it feels like to develop with it.

System Architecture

As CoreMedia is built to scale to really large sites the architecture is also built around redundant and distributed components. The part of the system the editors are working on is seperated from the parts that serve the content to the internet audience. A publication process copies the content from the editorial system to the live system.

The heart of CoreMedia is the Content Server. It stores all the content in a database and makes it retrievable. You rarely access it directly but only via other applications that then talk to it in the background via CORBA. Editors used to work with CoreMedia using a Java client (used to be called the Editor, now known as the Site Manager), starting with CoreMedia 7 there is also the web based Studio that is used to create and edit content. A preview application can be used to see how the site looks before being published. Workflows, that are managed using the Workflow Server, can be used to control the processes around editing as well as publication.

The live system consists of several components that are mostly laid out in a redundant way. There is one Master Live Server as well as 0 to n Replication Live Servers that are used for distributing the load as well as fault tolerance. The Content Management Servers are accessed from the Content Application Engine (CAE) that contains all the delivery and additional logic for your website. One or more Solr instances are used to provide the search services for your application.

Document Model

The document model for your application describes the content types that are available in the system. CoreMedia provides a blueprint application that contains a generic document model that can be used as a basis for your application but you are also free to build something completely different. The document model is used throughout the whole system as it describes the way your content is stored. The model is object oriented in nature with documents that consist of attributes. There are 6 attribute types like String (fixed length Strings), XML (variable length Strings) and Blob (binary data) available that form the basis of all your types. An XML configuration file is used to describe your specific document model. This is an example of an article that contains a title, the text and a list of related articles.

<DocType Name="Article">
  <StringProperty Name="title"/>
  <XmlProperty Grammar="coremedia-richtext-1.0" Name="text"/>
  <LinkListProperty LinkType="Article" Name="related"/>
</DocType>

Content Application Engine

Most of the code you will be writing is the delivery code that is part of the Content Application Engine, either for preview or for the live site. This is a standard Java webapp that is assembled from different Maven based modules. CAE code is heavily based on Spring MVC with the CoreMedia specific View Dispatcher that takes care of the rendering of different documents. The document model is made available using the so called Contentbeans that can be generated from the document model. Contentbeans access the content on demand and can contain additional business logic. So those are no POJOs but more active objects similar to Active Record entities in the Rails world.

Our example above would translate to a Contentbean with getters for the title (a java.lang.String), the text (a com.coremedia.xml.Markup) and a getter for a java.util.List that is typed to de.fhopf.Article.

Rendering of the Contentbeans happens in JSPs that are named according to classes or interfaces with a specific logic to determine which JSP should be used. An object Article that resides in the package de.fhopf would then be found in the path de/fhopf/Article.jsp, if you want to add a special rendering mechanism for List this would be in java/util/List.jsp. Different rendering of objects can be done by using a view name. An Article that is rendered as a link would then be in de/fhopf/Artilcle.link.jsp.

This is done using one of the custom Spring components of CoreMedia, the View Dispatcher, a View Resolver that determines the correct view to be invoked for a certain model based on the content element in the Model. The JSP that is used can then contain further includes on other elements of the content, be it documents in the sense of CoreMedia or one of the attributes that are available. Those includes are again routed through the View Dispatcher.

Let's see an example for rendering the list of related articles for an article. Say you call the CAE with a certain content id, that is an Article. The standard mechanism routes this request to the Article.jsp described above. It might contain the following fragment to include the related articles:

<cm:include self="${self.related}"/>

Note that we do not tell which JSP to include. CoreMedia automatically figures out that we are including a List, for example a java.util.ArrayList. As there is no JSP available at java/util/ArrayList.jsp Coremedia will automatically look for any interfaces that are implemented by that class, in this case it will find java/util/List.jsp. This could then contain the following fragment:

<ul>
<c:forEach items="${self}" var="item">
  <li><cm:include self="${item}" view="link"></li>
</c:forEach>
</ul>

As the List in our case contains Article implementations this will then hit the Article.link.jsp that would finally render the link. This is a very flexible approach with a high degree of reusability for the fragments. The List.jsp we are seeing above has no connection to the Article. You can use it for any objects that should be rendered in a List structure, the View Dispatcher of CoreMedia takes care of which JSP to include for a certain type.

To minimize the load on the Content Server you can also add caching via configuration settings. Data Views, that are a layer on top of the Contentbeans, are then held in memory and contain prefilled beans that don't need to access the Content Management Server anymore. This object cache approach is different to the html fragment caching a lot of other systems are doing.

Summary

Though this is only a very short introduction you should have seen that CoreMedia really is a nice system to work with. The distributed nature not only makes it scalable but this also has implications when developing for it: When you are working on the CAE you are only changing code in this component. You can start the more heavyweight Contentserver only once and afterwards work with the lightweight CAE that can be run using the Maven jetty plugin. Restarts don't take a long time so you have short turnaround times. The JSPs are very cleanly structured and don't need to include scriptlets (I heard that this has been different for earlier versions). As most of the application is build around Spring MVC you can use a lot of knowledge that is around already.

Elasticsearch - Der praktische Einstieg
Java Code Geeks