Donnerstag, 30. Januar 2014

Proxying Solr

The dominant deployment model for Solr is running it as a standalone webapp. You can use it in embedded mode in Java but then you are missing some of the goodies like the seperate JVM (your GC will thank you for it) and you are of course tied to Java then.

Most of the time Solr is considered similar to a database; only custom webapp code can talk to it and it is not exposed to the net. In your webapp you are then using any of the client libraries to access Solr and build your queries.

With the rise of JavaScript on the client side sometimes people get the idea to put Solr to the web directly. There is no custom webapp layer in between, only the web talking to Solr.

A proxy needs to sit in front of the Solr server that only allows certain requests. You won't allow any requests that are potentially modifying your index or do any other harm. This can be done but you need to be aware of some things:

Most of the time putting Solr directly to the web is not an option, but you can, if you are willing to take some risk. I think that especially the possibility of DOS attacks shouldn't be taken lightly. The more flexibility you want to have on the query side the more care needs to be taken to secure the system. If you'd like to do it anyway see this post on how to use nginx as a proxy to Solr and this list of specialized proxies for Solr. For general instructions on securing your Solr server see the project wiki.

Donnerstag, 23. Januar 2014

Analyze your Maven Project Dependencies with dependency:analyze

When working on a larger Maven project it might happen that you lose track of the dependecies in your project. Over time you are adding new dependencies, remove code or move code to modules so some of the dependencies become obsolete. Though I did lots of Maven projects I have to admit I didn't know until recently that the dependency plugin contains a useful goal for solving this problem: dependency:analyze.

The dependency:analyze mojo can find dependencies that are declared for your project but are not necessary. Additionally it can find dependecies that are used but are undeclared, which happens when you are directly using transitive dependencies in your code.

Analyzing Dependencies

I am showing an example with the Odftoolkit project. It contains quite some dependencies and is old enough that some of them are outdated. ODFDOM is the most important module of the project, providing low level access to the Open Document structure from Java code. Running mvn dependency:tree we can see its dependencies at the time of writing:

mvn dependency:tree
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building ODFDOM 0.8.10-incubating-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ odfdom-java ---
[INFO] org.apache.odftoolkit:odfdom-java:jar:0.8.10-incubating-SNAPSHOT
[INFO] +- org.apache.odftoolkit:taglets:jar:0.8.10-incubating-SNAPSHOT:compile
[INFO] |  \- com.sun:tools:jar:1.7.0:system
[INFO] +- xerces:xercesImpl:jar:2.9.1:compile
[INFO] |  \- xml-apis:xml-apis:jar:1.3.04:compile
[INFO] +- junit:junit:jar:4.8.1:test
[INFO] +- org.apache.jena:jena-arq:jar:2.9.4:compile
[INFO] |  +- org.apache.jena:jena-core:jar:2.7.4:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.5:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.6.4:compile
[INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.1.3:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.6.4:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.6.4:compile
[INFO] |  \- log4j:log4j:jar:1.2.16:compile
[INFO] +- org.apache.jena:jena-core:jar:tests:2.7.4:test
[INFO] +- net.rootdev:java-rdfa:jar:0.4.2:compile
[INFO] |  \- org.apache.jena:jena-iri:jar:0.9.1:compile
[INFO] \- commons-validator:commons-validator:jar:1.4.0:compile
[INFO]    +- commons-beanutils:commons-beanutils:jar:1.8.3:compile
[INFO]    +- commons-digester:commons-digester:jar:1.8:compile
[INFO]    \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.877s
[INFO] Finished at: Mon Jan 20 00:41:05 CET 2014
[INFO] Final Memory: 13M/172M
[INFO] ------------------------------------------------------------------------

The project contains some direct dependencies with a lot of transitive dependencies. When running mvn dependency:analyze on the project we will see that our dependencies don't seem to be correct:

mvn dependency:analyze
[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building ODFDOM 0.8.10-incubating-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[...] 
[INFO] <<< maven-dependency-plugin:2.1:analyze (default-cli) @ odfdom-java <<<
[INFO] 
[INFO] --- maven-dependency-plugin:2.1:analyze (default-cli) @ odfdom-java ---
[WARNING] Used undeclared dependencies found:
[WARNING]    org.apache.jena:jena-core:jar:2.7.4:compile
[WARNING]    xml-apis:xml-apis:jar:1.3.04:compile
[WARNING] Unused declared dependencies found:
[WARNING]    org.apache.odftoolkit:taglets:jar:0.8.10-incubating-SNAPSHOT:compile
[WARNING]    org.apache.jena:jena-arq:jar:2.9.4:compile
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.769s
[INFO] Finished at: Mon Jan 20 00:43:27 CET 2014
[INFO] Final Memory: 28M/295M
[INFO] ------------------------------------------------------------------------

The second part of the warnings is easier to understand; we have declared some dependencies that we are never using, the taglets and jena-arq. When comparing this with the output we got above you will notice that the largest set of transitive dependencies was imported by the jena-arq dependency. And we don't even need it.

The first part seems to be more difficult: there are two used but undeclared dependencies found. What does it mean? Shouldn't compiling fail if there are any dependencies that are undeclared? No, it just means that we are directly using a transitive dependency from our code which we should better declare ourselves.

Breaking the Build on Dependency Problems

If you want to find problems with your dependencies as early as possible it's best to integrate the check in your build. The dependency:analyze goal we have seen above is meant to be used in a standalone way, for automatic execution there is the analyze-only mojo. It automatically binds to the verify phase and can be declared like this:

<plugin>
    <artifactId>maven-dependency-plugin</artifactId>
    <version>2.8</version>
    <executions>
        <execution>
            <id>analyze</id>
            <goals>
                <goal>analyze-only</goal>
            </goals>
            <configuration>
                <failOnWarning>true</failOnWarning>
                <outputXML>true</outputXML>
            </configuration>
        </execution>
    </executions>
</plugin>

Now the build will fail if there are any problems found. Conveniently, if an undeclared dependency has been found, it will also output the XML that you can then paste in your pom file.

A final word of caution: the default analyzer works on the bytecode level so in special cases it might not notice a dependency correctly, e.g. when you are using constants from a dependency that are inlined.

Freitag, 17. Januar 2014

Geo-Spatial Features in Solr 4.2

Last week I have shown how you can use the classic spatial support in Solr. It uses the LatLonType to index locations that can then be used to query, filter or sort by distance. Starting with Solr 4.2 there is a new module available. It uses the Lucene Spatial module which is more powerful but also needs to be used differently. You can still use the old approach but in this post I will show you how to use the new features to do the same operations we saw last week.

Indexing Locations

Again we are indexing talks that contain a title and a location. For the new spatial support you need to add a different field type:

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
    distErrPct="0.025"
    maxDistErr="0.000009"
    units="degrees"/>

Contrary to the LatLonType the SpatialRecursivePrefixTreeFieldType is no subfield type but stores the data structure itself. The attribute maxDistErr determines the accuracy of the location, in this case it is 0.000009 degrees which is close to one meter and should be enough for most location searches.

To use the type in our documents of course we also need to add it as a field:

<field name="location" type="location_rpt" indexed="true" stored="true"/>

Now we are indexing some documents with three fields: the path (which is our id), the title of the talk and the location.

curl http://localhost:8082/solr/update/json?commit=true -H 'Content-type:application/json' -d '
[
 {"path" : "1", "title" : "Search Evolution", "location" : "49.487036,8.458001"},
 {"path" : "2", "title" : "Suchen und Finden mit Lucene und Solr", "location" : "49.013787,8.419936"}
]'

Again, the location of the first document is Mannheim, the second Karlsruhe. You can see that the locations are encoded in an ngram-like fashion when looking at the schema browser in the administration backend:

Sorting by Distance

A common use case is to sort the results by distance from a certain location. You can't use the Solr 3 syntax anymore but need to use a the geofilt query parser that maps the distance to the score which you then sort on.

http://localhost:8082/solr/select?q={!geofilt%20score=distance%20sfield=location%20pt=49.487036,8.458001%20d=100}&sort=score asc

As the name implies the geofilt query parser originally is for filtering. You need to pass in the distance that is used for filtering so by sorting you might also cause an impact on the results that are returned. For our example passing in a distance of 10 kilometers will only yield one result. This is something to be aware of.

Filtering by Distance

We can use the same approach we saw above to filter our results to only match talks in a given area. We can either use the geofilt query parser (that filters by radius) or the bbox query parser (which filters on a box around the radius). As you can imagine, the query looks similar:

http://localhost:8082/solr/select?q=*:*&fq={!geofilt%20score=distance%20sfield=location%20pt=49.013787,8.419936%20d=10}

This will return all talks in a distance of 10 kilometers from Karlsruhe.

Doing Fancy Stuff

Besides the features we have looked at in this post you can also do more advanced stuff. In Solr 3 Spatial you can't have multivalued location fields, which is possible with Solr 4.2. Also now you can also index lines or polygons that can then be queried and intersected. In this presentation Chris Hostetter uses this feature to determine overlapping of time, an interesting use case that you might not think of at first.

Donnerstag, 9. Januar 2014

Geo-Spatial Features in Solr 3

Solr is mainly known for its full text search capabilities. You index text and are able to search it in lowercase or stemmed form, depending on your analyzer chain. But besides text Solr can do more: You can use RangeQueries to query numeric fields ("Find all products with a price lower than 2€"), do date arithmetic ("Find me all news entries from last week") or do geospatial queries, which we will look at in this post. What I am describing here is the old spatial search support. Next week I will show you how to do the same things using recent versions of Solr.

Indexing Locations

Suppose we are indexing talks in Solr that contain a title and a location. We need to add the field type for locations to our schema:

<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>

LatLonType is a subfield type which means that it not only creates one field but also additional fields, one for longitude and one for latitude. The subFieldSuffix attribute determines the name of the field that will be <fieldname>_<i>_<subFieldSuffix>. If the name of our field is location and we are indexing a latitude/longitude pair this would lead to three fields: location, location_0_coordinate, location_1_coordinate.

To use the type in our schema we need to add one field and one dynamic field definition for the sub fields:

<field name="location" type="location" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false"/>

The dynamic field is of type tdouble so we need to make sure that it is also available in our schema. The attributes indexed on location is special in this case: It determines if the subfields for the coordinates are created at all.

Let's index some documents. We are adding three fields, the path (which is our id), the title of the talk and the location.

curl http://localhost:8983/solr/update/json?commit=true -H 'Content-type:application/json' -d '
[
 {"path" : "1", "title" : "Search Evolution", "location" : "49.487036,8.458001"},
 {"path" : "2", "title" : "Suchen und Finden mit Lucene und Solr", "location" : "49.013787,8.419936"}
]'

The location of the first document is Mannheim, the second Karlsruhe. We can see that our documents are indexed and that the location is stored by querying all documents:

curl "http://localhost:8983/solr/select?q=*%3A*&wt=json&indent=true"

Looking at the schema browser we can also see that the two subfields have been created. Each contains the terms for the Trie field.

Sorting by Distance

One use case you might have when indexing locations is to sort the results by distance from a certain location. This can for example be useful for classifieds or rentals to show the nearest results first.

Sorting can be done via the geodist() function. We need to pass in the location that is used as a basis via the pt parameter and the location field to use in the function via the sfield parameter. We can see this in action by sorting twice, once for a location in Durlach near Karlsruhe and once for Heidelberg, which is near Mannheim:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&sfield=location&pt=49.003421,8.483133&sort=geodist%28%29%20asc"
curl http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&sfield=location&pt=49.399119,8.672479&sort=geodist%28%29%20asc

Both return the results in the correct order. You can also use the geodist() function to boost results that are closer to your location. See the Solr wiki for details.

Filtering by Distance

Another common use case is to filter the search results to only show results from a certain area, e.g. in a distance of 10 kilometers. This can either be done automatically or via facets.

Filtering is done using another function, geofilt(). It accepts the same parameters we have seen before but of course for filtering you add it as a filter query. The distance can be passed using the parameter d, the unit defaults to kilometers. Suppose you are in Durlach and only want to see talk that are in a distance of 10 kilometers:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&fq={!geofilt}&pt=49.003421,8.483133&sfield=location&d=10"

This only returns the result in Karlsruhe. Once we decide that we want to see results in a distance of 100 kilometers we again see both results:

curl "http://localhost:8983/solr/select?wt=json&indent=true&q=*:*&fq={!geofilt}&pt=49.003421,8.483133&sfield=location&d=100"

Pretty useful! If you are interested, there is more on the Solr wiki. Next week I will show you how to do the same using the new spatial support in Solr versions starting from 4.2.

Mittwoch, 1. Januar 2014

20 Months of Freelancing

It's now 20 months that I am working as a freelancer on my own. With the end of year thing going on it's time to look back on what happened and I would like to take the chance to write about what I did, what works and what I would like to do in the future.

Why Freelancing?

During my time at university I started working for a small consulting company that specialized in open source software. I was the first employee and started working part time but even paused my studies for half a year to work full time with them. The company grew and in 2006 after finally getting my degree I joined them full time. I always enjoyed the work and dedicated a lot of my energy and time. In 2012, with around 30 employees I noticed that I needed something else. I had already switched to a 4 day work week in 2011 to have more time for myself, to learn and experiment. Though the company is still great to work with it just didn't fit me anymore.

After a long time at a company that has partly been home and family it is difficult to just switch to another company. Also I wanted to have more control over what kind of projects I am doing and I liked to have some time on my own to write blog posts and do talks at user groups and conferences. I always spent time at customer projects and often liked it so it was an obvious decision to go with freelancing.

The Start

Before I quit I decided that I wanted to do more with search technologies. I had worked a lot on content management systems and search always is a crucial part. Having done several larger projects with Lucene and Solr and even the first Solr integration in OpenCms I knew that I had the necessary experience and that I liked it.

I had minimal savings when I quit and no customers so far. Other freelancers are often surprised when I tell this and advice to only quit when you already know who you will be working for next. I guess this was some kind of hubris, I was really determined that I wanted to do freelancing and knew that there were companies who needed my help.

I started freelancing in May and had already organized to give a talk at our local Java User Group on Lucene and Solr in July. I wanted to have the full month of May for talk preparation and bootstrapping the business, all the things like getting a website, getting an accountant and so on. Unfortunately I didn't find a project until the beginning of July with a lot of my savings already spent on living cost and necessary items for the business. Be aware that it will take you up to two months from the beginning of a project until you see the first money.

Marketing

The good thing about freelancing: I can call all the activities I like to do marketing and tell myself that those are necessary. The bad thing about it: I don't spend enough time on paid projects.

I am spending lots of my time that is not paid for on learning: Blog posts, books and conferences. I got to a quite frequent rhythm with weekly posts, spoke at several user groups and conferences and joined an open source project, OdfToolkit. A lot of freelancers don't do any of those and dedicate all their time working on customer projects but those activities are part of the reason I went with freelancing.

The Projects

When talking about freelancing you probably think about sitting in the coffee shop, doing several projects in parallel. For me this is different, lots of Java projects are rather long term and require you to work in a team, which is best done on premise. Though I like the idea of doing more diverse projects I am also happy to have some stability. Having long term clients prevents some of the context switching involved with multiple projects and you have to spend less time on sales.

My first project involved working on an online shop for a large retailer built on Hybris, a commercial Ecommerce engine. I did a lot of Solr stuff and though it was rather stressful working on product search was really interesting. Also the people are nice.

Though I started with the intention of doing more search projects I am currently involved in a large CMS project for a retailer, (re)building parts of their online presence. Search only plays a minor part in it but I like working with the people, it's a great work atmosphere and some of the problems they face are really interesting. Before doing the project I had to think a lot whether I want to sign this long term contract but I am glad I did. Fortunately I still have time to do some short term consulting on the side (mostly single days, mostly Solr).

Where Do I Get The Projects From?

When starting I thought it would be a lot easier to get projects but customers are not exactly magically lining up to get my services. I try to avoid working with freelance agents, though a lot of Java projects are only possible to get through them. Most of the project inquiries I get directly are from people who know me from organizing the local Java User Group. I didn't start helping the user group for the marketing but I have to admit, it really paid of.

Besides that I am still working for customers of my old employer. They contact me with interesting projects and though of course they are taking their share I still earn enough for myself.

Most of the inquiries I get from agencies, mostly through my XING profile are for Hybris and CoreMedia, two of the commercial systems I did work in. I enjoy working with CoreMedia and could imagine to do projects with Hybris again but I would be far more happy if agencies contacted me for Lucene, Solr or Elasticsearch.

There have been some inquiries from people who found me through my blog but never something that was really doable (mostly overseas). Speaking at user groups and conferences has led to some contacts but never to a real project so far. So you could say that the marketing activities I spend most of my time on didn't pay off. But getting direct projects is not the only benefit of both of these activities. Those are also important for me for learning and growing.

The Future

Freelancing has been exactly the right choice for me. I managed to find projects where I can do my 4 day work week, leaving enough time for blogging, preparing talks and learning. I managed to do weekly blog posts for quite some months during the year, cut back a bit because it became overwhelming. Starting with the new year I hope I can get back to more frequent posts. Also, I'll be submitting talks to conferences again and hope that I can find more time to work on the OdfToolkit.

I'll be staying with my current client for as long as they need me but I am determined to only do search centric projects afterwards. Also I am planning to do a bit of work in other countries in Europe with a special twist. Watch this blog for the announcement.

When starting with freelancing you have a lot of questions and even simple things can take some time to find out on your own. I will compile a list of resources that helped me and publish those on my blog soon. If you are just starting with freelancing you are of course also welcome to contact me anytime.

Elasticsearch - Der praktische Einstieg
Java Code Geeks