HYDRO Accessor implementation details
The HYDRO accessor is the component in charge of brokering HIS Central and HIS Server data sources (HIS Central is an aggregator of HIS Server instances).
The HYDRO accessor main tasks are:
- to translate each query issued by the GI-cat user (internally based on CSW ISO) to one or more HIS Central/HIS Server requests
- to translate the HIS results (based on HIS data model) to datasets metadata records based on ISO 19115 (19139 XML implementation) through a data model mapping
The mapping currently in use translates each HIS time series to a ISO dataset. Hence, the base unit of discovery and access is the HIS time series.
User usually issues queries with a combination of the following constraints:
- temporal interval
- area of interest
A typical query is translated to a getSeriesCatalogInBoxPaged HIS request, at runtime. (Harvesting of HIS sources is not feasible, as the number of time series is very high and increasing in real-time).
E.g. following is a HIS request correspondent to a user query for time series on Gulf of Mexico area, related to "dissolved oxygen" and acquired during the last twenty years.
The request is sent to the HIS central endpoint provided to the accessor in the configuration phase. The response from the HIS server is an array of series matching the request constraints:
contains spatio-temporal information, the variable and location names, as well as the HIS server URL to contact to obtain the data correspondent to a single time series.
Results can be displayed on a map by a client GUI as in the following screenshot:
Number of Results
Mediating from CSW/ISO to HIS Server can sometimes be a bit more complicated. In fact, the CSW/ISO standard requires to return the total number of records that matched the query constraints. Indeed no "hits query" is available from either HIS Central or HIS Server (so far) giving back the exact number of results matching query constraints. Thus the HYDRO accessor is forced to use a (rough) estimation of the total number of records. The original query is splitted into 16 queries using a 4x4 grid; for each square an estimate of the matching datasets is calculated, by sampling a portion of the square (using the getSeriesCatalogInBoxPaged HIS request).
Since the request for the actual records has to be made (and it might take long time due to the size of the result set, e.g. millions of entry to be transferred over the internet), the idea is to use only a portion of the original square as the query area, in order to limit the number of results (and increase the speed of each answer). Finally, knowing the number of returned results, the total number of results matching the original square is estimated. The number of the results in the original area is estimated as well, by summing up the results from the 16 squares. Of course this number can be quite different from the actual number in some cases, we are working to provide soon a better (and equally quick) way to estimate the number of results.
Due to the high number of matches for each query to these services and the paging approach implemented by GI-cat, a ranking is needed to return users the first 10 results (then from 11 to 20 and so on) in a meaningful way. Since no ranking is provide by HIS Central or HIS Server (so far), we implemented an "aggregated" record approach for this accessor. Results are returned in metadata records which provide basic information about the estimated number of records matching user's query. User's original query is splitted into 16 queries using a 4x4 grid; for each square an estimate of the matching datasets is calculated (using the method described in the previous section) and this information is stored into the aggregated record together with the estimation of the total number of matches.
This representation is useful to graphically show "where the data is" and provides hints to further refine the initial query. (The query can then iteratively be refined on a smaller and smaller area, until a small result set is obtained and the real single datasets can be individually displayed).
A client GUI could graphically display the following aggregated results on a map when too many datasets have matched: (double clicking the aggregated datasets icons on the map will issue a new query on the given sub-area).
In same cases unexpected results may be returned by the HYDRO accessor. Three types of "strange results" are possible:
- Zero results for an area known to have time series: this may happen because of a timeout. As the HYDRO accessor query on the fly the HIS central service, sometimes the service may be slow to respond, because of network traffic or service congestion. In this case the accessor reply with a zero results list. As a workaround the query may be repeated or repeated on a smaller area (in this case the service may be able to reply in-time).
- Wrong number of results: this may happen because the HYDRO accessor currently only estimates the number of matching time series in case of aggregated datasets to optimize the response times (see last paragraph in the previous section).
- The result remain aggregated: this will happen when a given station has a high number of time series, which can not be displayed in a non-aggregated style. Indeed, in this case increasing the zoom will have no effect. As a workaround the query can be constrained by variable name and temporal extent, to limit the number of results along the non-spatial dimensions.