Introduction#
How to access your endpoint#
The API access information is normally placed in a configuration file (see the section below). Create a config file, servicex.yaml
, in the yaml
format, in the appropriate place for your work that contains the following (for the xaod
backend; use uproot
for the type
for the uproot backend):
api_endpoints:
- name: <your-endpoint-name>
endpoint: <your-endpoint>
token: <api-token>
type: xaod
All strings are expanded using python’s os.path.expand method - so $NAME
and ${NAME}
will work to expand existing environment variables.
You can list multiple end points by repeating the block of dictionary items, but using a different name.
Finally, you can create the objects ServiceXAdaptor
and MinioAdaptor
by hand in your code, passing them as arguments to ServiceXDataset
and inject custom endpoints and credentials, avoiding the configuration system. This is probably only useful for advanced users.
These config files are used to keep confidential credential information - so that it isn’t accidentally placed in a public repository.
If no endpoint is specified or config file containing a useful endpoint is found, then the library defaults to the developer endpoint, which is http://localhost:5000
for the web-service API. No passwords are used in this case.
Usage#
The following lines will return a pandas.DataFrame
containing all the jet pT’s from an ATLAS xAOD file containing Z->ee Monte Carlo:
from servicex import ServiceXDataset
query = "(call ResultTTree (call Select (call SelectMany (call EventDataset (list 'localds:bogus')) (lambda (list e) (call (attr e 'Jets') 'AntiKt4EMTopoJets'))) (lambda (list j) (/ (call (attr j 'pt')) 1000.0))) (list 'JetPt') 'analysis' 'junk.root')"
dataset = "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
ds = ServiceXDataset(dataset, backend_name=`xaod`)
r = ds.get_data_pandas_df(query)
print(r)
And the output in a terminal window from running the above script (takes about 1-2 minutes to complete):
python scripts/run_test.py http://localhost:5000/servicex
JetPt
entry
0 38.065707
1 31.967096
2 7.881337
3 6.669581
4 5.624053
... ...
710183 42.926141
710184 30.815709
710185 6.348002
710186 5.472711
710187 5.212714
[11355980 rows x 1 columns]
If your query is badly formed or there is an other problem with the backend, an exception will be thrown with information about the error.
If you’d like to be able to submit multiple queries and have them run on the ServiceX
back end in parallel, it is best to use the asyncio
interface, which has the identical signature, but is called get_data_pandas_df_async
.
For documentation of get_data
and get_data_async
see the servicex.py
source file.
The backend_name
tells the library where to look in the servicex.yaml
configuration file to find an end point (url and authentication information). See above for more information.
How to specify the input data#
How you specify the input data, and what data can be ingested, is ultimately defined by the configuration of the ServiceX
backend you are running against. This servicex
library supports the following:
A Dataset Identifier (DID): For example,
rucio://mc16a_13TeV:my_dataset
, orcernopendata://1507
, both of which are resolved to a list of files (in one case, a set of ATLAS data files, and in the other some CMS Run 1 AOD files).A single file located at a
http
orroot
endpoint: For example,root://myfile.root
orhttp://myfile.root
. ServiceX must be able to access these files without special permissions.A list of files located at
http
orroot
endpoints: For example,[root://myfile1.root, http://myfile2.root]
. ServiceX must be able to access these files without special permissions.[depreciated] A bare (DID): this is an unadorned identifier, and is routed to the backend’s default DID resolver. The default is defined at runtime. It is depreciated because a backend configuration change can break your code.
The Local Data Cache#
To speed things up - especially when you run the same query multiple times, the servicex
package will cache queries data that comes back from Servicex. You can control where this is stored with the cache_path
in the configuration file (see below). By default it is written in the temp directory of your system, under a servicex_{USER}
directory. The cache is unbound: it will continuously fill up. You can delete it at any time that you aren’t processing data: data will be re-downloaded or re-transformed in ServiceX
.
There are times when you want the system to ignore the cache when it is running. You can do this by using ignore_cache()
:
from servicex import ignore_cache
with ignore_cache():
do_query():
If you are using a Jupyter notebook, the with
statement can’t really span cells. So use ignore_cache().__enter__()
instead. Or you can do something like:
from servicex import ignore_cache
ic = ignore_cache()
ic.__enter__()
...
ic.__exit__(None, None, None)
If you wish to disable the cache for a single dataset, use the ignore_cache
parameter when you create it:
ds = ServiceXDataset(dataset, ignore_cache=True)
Finally, you can ignore the cache for a dataset for a short period of time by using the same context manager pattern:
ds = ServiceXData(dataset)
with ds.ignore_cache():
do_query(ds) # Cache is ignored
do_query(ds) # Cache is not ignored
Analysis And Query Cache#
The servicex
library can write out a local file which will map queries to backend request-id
’s. This file can then be used on other people, checked into repositories, etc., to reference the same data in the backend. The advantage is that the backend does not need to re-run the query - the servicex
library need only download it again. When a user uses multiple machines or shares analysis code with an analysis team, this is a much more efficient use of resources.
By default the library looks for a file
servicex_query_cache.json
in the current working directory, or a parent directory of the current working directory.To trigger the creation and updating of a cache file call the function
update_local_query_cache()
. If you like you can pass in a filename/path. By default it will useservicex_query_cache.json
in the local directory. The file will be both used for look-ups and will be updated with all subsequent queries. Except under very special cases, it is suggested that one users the filenameservicex_query_cache.json
.You can also create the file by using the bash command
touch servicex_query_cache.json
- if you are using the default name.If that file is present when a query is run, it will attempt to download the data from the endpoint, only resubmitting the query if the endpoint doesn’t know about the query. As long as the file
servicex_query_cache.json
is in the current working directory (or above), it will be picked up automatically: no need to callupdate_local_query_cache()
.
The cache search order is as follows:
The analysis query cache is searched first.
If nothing is found there, then the local query cache is used next.
If nothing is found there, then the query is resubmitted.
Note: Eventually the backends will contain automatic cache lookup and this feature will be much less useful as it will occur automatically, on the backend.
Deleting Files from the local Data Cache#
It is not recommended to alter the cache. The software expects the cache to be in a certain state, and randomly altering it can lead to unexpected behavior.
Besides telling the servicex
library to ignore the cache in the above ways, you can also delete files from the local cache.
The local cache directory is split up into sub-directories. Deleting files from each of the directories:
query_cache
- this directory contains the mapping between the query text (or its hash) and the ServiceX backend’srequest-id
. If you delete a file from here, it is as if the query was never made, and is the same as using the ignore methods above.query_cache_status
- contains the last retrieved status from the backend. Deleting this will cause the library to refresh the missing status. This file is updated continuously until the query is completed.file_list_cache
- Each file contains a json list of all the files in theminio
bucket for a particular request id. Deleting a file from this directory will cause the frontend to re-download the complete list of files (the file in this directory isn’t created until all files have been downloaded). -data
- This directory contains the files that have been downloaded locally. If you delete a data file from this directory, it will trigger a re-download. Note that if the servicex endpoint doesn’t know about the original query, or the minio bucket is missing, it will force the transform being re-run from scratch.
Configuration#
The servicex
library searches for configuration information in several locations to determine what end-point it should connect to:
The config file can be called
servicex.yaml
,servicex.yml
, or.servicex
. The files are searched in that order, and all present are used.A config file in the current working directory.
A config file in any working directory above your current working directory.
A config file in the user’s home directory (
$HOME
on Linux and Mac, and your profile directory on Windows).The
config_defaults.yaml
file distributed with theservicex
package.
The file can contain an api_endpoint
as mentioned earlier. In addition the other following things can be put in:
cache_path
: Location where queries, data, and a record of queries are written. This should be an absolute path the person running the library has r/w access to. On windows, make sure to escape\
- and best to follow standardyaml
conventions and put the path in quotes - especially if it contains a space. Top level yaml item (don’t indent it accidentally!). Defaults to/tmp/servicex_<username>
(with the temp directory as appropriate for your platform) Examples:Windows:
cache_path: "C:\\Users\\gordo\\Desktop\\cacheme"
Linux:
cache_path: "/home/servicex-cache"
backend_types
- a list of yaml dictionaries that contains some defaults for the backends. By default only thereturn_data
is there, which forxaod
isroot
anduproot
isparquet
. There is also acms_run1_aod
which returnsroot
. Allowsservicex
to convert topandas.DataFrame
orawkward
if requested by the user.
All strings are expanded using python’s os.path.expand method - so $NAME
and ${NAME}
will work to expand existing environment variables.