S3 Contest - JGD Evaluation Track
Special track at the 2009 edition of the S3 Contest on Semantic Service Selection based on the Jena Geography Dataset (JGD)
This evaluation track targets the use case of a human developer that is searching for a web service that provides a functionality needed in some application being developed. Currently, a developer will query and browse a registry (like seekda, programmableweb, or xmethods) to identify promising candidate services. Semantic descriptions are expected to make such manual discovery more efficient by improving the filtering and ranking of the services in the registries. It is the aim of this evaluation to test this hypothesis by comparing the performance of different semantic and non-semantic service retrieval approaches.
The main questions that should be investigated are:
- What is the right level of detail to describe services for the given task of retrieval from a registry?
- How is the trade-off between description effort and retrieval precision?
- What is the best pattern to describe services?
- What is the most suitable formalism to do so?
- Which retrieval techniques are good for which retrieval problems? What are the properties that make a specific retrieval problem difficult for some or all techniques? What features of services make their correct and precise retrieval difficult for certain or all approaches?
- How much information need to be shared between providers of the service descriptions and developers posing service queries to allow for efficient retrieval?
- How much better is semantic retrieval compared to traditional technologies (structural WSDL matchmaking, natural language processing, keyword-based search, ...)? What is the involved extra cost for developing the service descriptions?
Detailed Description
Task
The task to perform in this scenario is to rank a list of web services with respect to their relevance to a user query. This task resembles classic IR retrieval as, for instance, evaluated by TREC.
A basic example of a query could be a request for a service, that, given a US zip code as input, provides current weather information and a weather forecast for the location identified by the input zip code.
Given such a query as input, participating frameworks have to return a ranking of a predefined set of services. They will be evaluated with respect to the quality of that ranking, i.e. whether services that are most similar to the envisioned and desired one obtain top ranks in the returned ranking.
Dataset
Data: The Jena Geography Dataset (JGD) used for this scenario consists of about 200 services that have been gathered from seekda.com, xmethods.com, webservicelist.com, programmableweb.com, and geonames.org. They are available at the OPOSSum Portal. All services are from the domain of geography and geocoding. For the planned evaluation it is desireable, that for each query there is a large number of similar (but not identical) services that match the query to different degrees. Queries for which there are very few matching services make the evaluation statistically unreliable and instable. However, we did not want to create fictitious services or create fictitious variations of existing real services. Thus, the geocoding/geography domain was chosen because it seems to be the domain with the largest number of publicly available services. In the future the dataset should be expanded to cover different domains, too.
In order to reduce the entry barrier to participate in the evaluation, the dataset has been subdivided into inclusive smaller datasets of 200 (full dataset), 150, 100, respectively 50 services. Participants may choose the dataset they want to employ, i.e. they are encouraged to use the full dataset, but may start with a smaller subset if usage of the full dataset exceeds the resources that can be devoted to this activity.
Please also note, that the dataset is available in structured
format already. This allows to semi-automatically generate templates
for semantic descriptions which may reduce the effort involved in
creating the semantic descriptions significantly. If you are
interested, please do not hesitate to contact
Ulrich Küster.
Descriptions: SWS retrieval differs in an important aspect from traditional IR. The latter may use background knowledge (e.g. WordNet) to improve the retrieval, but apart from this general background knowledge, a retrieval algorithm operates directly on the resources to retrieve. In contrast, in the context of SWS retrieval the service itself, maybe represented by a WSDL or the natural language documentation found on the website of a Web API, is not used as the base of retrieval. Instead, a semantic description that is explicitly manually created for the purpose of supporting precise retrieval is used. This has a significant implication. A retrieval error (an irrelevant service is retrieved or a relevant service is omitted) can be attributed to either a deficient matchmaking algorithm, to deficient service descriptions (either due to a lack of expressivity of the employed formalism or to an inappropriate use of the formalism), or to a lack of alignment between the service description and the algorithm used to operate on them.
This evaluation therefore does not preset the semantic descriptions. The dataset of services is not provided as a collection of semantic service descriptions but instead with exactly the information that a human would have when retrieving services manually:
- Services are documented by a short English natural language text, taken from the website of the service provider.
- Inputs and outputs of the service are specified with more or less detailed English natural language documentation of their type and semantics, either taken from the WSDL of the service, from the documentation of the service on the website of its provider, or by trying the service and estimating the semantics of its IOs.
- Where available, the original WSDL of a service is provided. However, the dataset also contains a number of REST-based services which do not necessarily use XML and generally did not have a WSDL originally. For these service we have constructed WSDL descriptions so that there are WSDL descriptions available for the full JGD (see JGD Web site for further information).
Queries and Reference Relevance Judgments: Service queries are not released together with the dataset (see below). However, the dataset does contain a number of service queries, formulated as a natural language text (specifying the desired functionality) and an imaginary desired service (specifying the desired interface). The dataset also contains full reference relevance judgments obtained and confirmed by at least three human assessors. These judgments are available for different definitions of relevance (more details available in the Evaluation Measures Section).
Evaluation Procedure
Step 1: Annotation of the services
Participating groups will be asked to register the set of services in their system. This will typically involve to create semantic descriptions for those services. However, no requirements with respect to these descriptions are dictated by the evaluation. A service may be described by a full logic formalization of its semantics, but also simply by a set of keywords. The level of detail to be included in the service descriptions is entirely up to the choice of each participant. In the following, the term description will refer to whatever a participant uses to annotate a service to support its retrieval, whether it be a semantic description or a simple tag. Note that participants will not know the queries when providing the service descriptions to emulate real world environments.
While creating the service descriptions, participants are asked to document the information that a requester would need to have in order to create meaningful service queries for the given service retrieval system. Such information might regard ontologies that have to be commonly used, description templates, fixed business category classifications, a predefined set of usable keywords, or no information at all. We refer to it as description documentation.
After Step 1, the service descriptions as well as the description documentation have to be provided to the evaluation organizers.
Step 2: Annotation of the requests
Once the services are described, the queries will be released and need to be described. To mirror real-world environments, this has to be done by people different from those that described the services. People encoding the service queries must have access only to the general information about a formalism and matchmaking approach and the description instructions created by the people that encoded the services. They must not have access to the service descriptions themselves when creating the queries. After all, the queries are meant to be used to retrieve the services. Therefore, participating groups are expected to let people different from those who annotated the queries annotate the requests and document whatever information is being shared between the two groups.
After Step 2, the request descriptions as well as binaries of the matchmaking engine (see next step) have to be provided to the evaluation organizers.
Step 3: Retrieval evaluation
Retrieval evaluation will be performed via the SME2 Matchmaking Evaluation environment. Therefore, participants have to implement the Java interface from the S3 Contest Participation Package that is used to plug their retrieval engine into the SME2 evaluation environment. The interface provides methods to register services (created in Step 1) with the engine and to query the engine with service requests (created in Step 2) to retrieve the ranked list of matching services.
Step 4: Result analysis
Results will be provided to the participants in order to analyze and discuss them. Note that the main goal of this evaluation is not to declare a winning technology, but to learn about the tradeoffs of the technologies and the problem space in general. It is hoped that this will contribute to improving existing approaches, but also future evaluations and test collections.
We encourage participants to write papers that provide analysis of the evaluation results (the process, the difficulties encountered, the lessons learned, strategies for technology improvement, etc.) which will likely be published as post-proceedings after the evaluation (deatails to be worked out).
Evaluation Measures
Correctness
The main evaluation goal is to determine the correctness achieved by the different retrieval systems. Correctness will be measured via the quality of the returned rankings using adapted versions of recall and precision based on cumulative gain. We will perform the evaluation with different (i.e. more or less strict) definitions of relevance to study the effect of the chosen relevance definition to the evaluation results.
Description Complexity and Effort
It will be attempted to analyze the complexity of the descriptions and estimate the effort of creating them to put the achieved correctness in relation to the invested effort.
Coupling
The amount and characteristics of the description instructions will be used to analyze the level of coupling among service providers (creating the offer descriptions) and service users (creating the request descriptions).
Runtime Performance
Although not the primary focus of this track, the runtime performance by means of the average response time and the overall time to process all queries will also be evaluated.
Evaluation Execution
This evaluation will be performed at the 3rd S3 Contest on Semantic Service Selection, hosted by the 3rd International Workshop SMR2 on Semantic Matchmaking and Resource Retrieval at the 8th International Semantic Web Conference 2009, Washington DC, USA.
Participation, Feedback, Contact
Participation requires to annotate at least 50 service offers and the service requests for usage in your retrieval system according to the procedure outlined above and to implement the IMatchmakerPlugin interface that is part of the S3 Contest Participation Package.
The matchmaker and the annotated services and requests
(completion of Step 2) will have to be available by October 6 2009. If
you can provide your matchmaker implementation (even though
preliminary) earlier, this will be greatly appreciated to have time to
address potential problems related to getting it running on our
machines.
Attendance of the SMR2 workshop at ISWC 2009 which hosts the S3 Contest is encouraged, but not mandatory for participation in the evaluation.
In case you are interested, please contact Ulrich Küster to coordinate your participation, in particular the creation of the service annotations. Please also do not hesitate to contact him if you have any questions about or remarks on the evaluation in general.
Important Notes Regarding Service Annotations
- If you don't mind, it is appreciated if you record how much time it roughly takes you to annotate the services and create the necessary ontologies. (You may optionally base your ontology on the basic geography ontology available here.)
- Please note that we asked the reference judges to ignore any license information when judging the service's relevance (things like login, user name, license key etc.). You may want to delete such inputs from your descriptions if otherwise their existence might result in reduced retrieval performance with respect to the relevance judgments.
- Please keep in mind that whoever is going to annotate the service requests later, should not have had access to the services.
Materials
Further information on the services are available at the Jena Geography Dataset website.
Quicklinks to the collection in the OPOSSum portal: Full dataset, JGD150, JGD100 and JGD50.
An OWL domain ontology based on the Proton top level ontology has been created to reduce the effort for participants. Usage of this ontology is by no means required. The ontology is exclusively meant to serve as a starting point for those that want to use it. Participants are free to use it as is, to change it in any way or to not use it at all and use another ontology or no ontology at all.
WSDL descriptions: For those services in the dataset which are WSDL based, the original WSDL files are attached to the service entries in OPOSSum. We have created derived versions of those by removing the overhead operations, bindings, messages and types (see Jena Geography Dataset) and provided fictitious WSDL descriptions for the REST-based services in order to make entry for participants relying on the SAWSDL standard easier.
Quicklinks to the WSDL descriptions in the OPOSSum portal: Full dataset WSDLs, JGD150 WSDLs, JGD100 WSDLs and JGD50 WSDLs.
Quicklinks to the WSDL description as zip files: Full dataset WSDLs zip, JGD150 WSDLs zip, JGD100 WSDLs zip and JGD50 WSDLs zip.
Future Extensions
Data Characteristics
Obviously, evaluation results depend on the data that is being used for evaluation. The geography/geocoding dataset has some specific characteristics that may influence the evaluation results. Most important, the contained services are exclusively data services. They provide or manipulate data, but they do not cause real world effects that involve a real committment of some kind (like the reservation of a flight or the purchasing of an article). For future editions it is planned to extend the dataset with domains showing different characteristics (e.g., containing services that cause lasting real-world effects).

