Tables¶
Synapse Tables enable storage of tabular data in Synapse in a form that can be queried using a SQL-like query language.
A table has a Schema
and holds a set of rows conforming to that schema.
A Schema
defines a series of Column
of the following types: STRING, DOUBLE, INTEGER, BOOLEAN,
DATE, ENTITYID, FILEHANDLEID, LINK, LARGETEXT, USERID
~~~~~~~
Example
~~~~~~~
Preliminaries:
import synapseclient
from synapseclient import Project, File, Folder
from synapseclient import Schema, Column, Table, Row, RowSet, as_table_columns
syn = synapseclient.Synapse()
syn.login()
project = syn.get('syn123')
First, let’s load some data. Let’s say we had a file, genes.csv:
Name,Chromosome,Start,End,Strand,TranscriptionFactor
foo,1,12345,12600,+,False
arg,2,20001,20200,+,False
zap,2,30033,30999,-,False
bah,1,40444,41444,-,False
bnk,1,51234,54567,+,True
xyz,1,61234,68686,+,False
To create a Table:
table = build_table('My Favorite Genes', project, "/path/to/genes.csv")
syn.store(table)
build_table()
will set the Table Schema
which defines the columns of the table.
To create a table with a custom Schema
, first create the Schema
:
cols = [
Column(name='Name', columnType='STRING', maximumSize=20),
Column(name='Chromosome', columnType='STRING', maximumSize=20),
Column(name='Start', columnType='INTEGER'),
Column(name='End', columnType='INTEGER'),
Column(name='Strand', columnType='STRING', enumValues=['+', '-'], maximumSize=1),
Column(name='TranscriptionFactor', columnType='BOOLEAN')]
schema = Schema(name='My Favorite Genes', columns=cols, parent=project)
Let’s store that in Synapse:
table = Table(schema, "/path/to/genes.csv")
table = syn.store(table)
The Table()
function takes two arguments, a schema object and data in some form, which can be:
With a bit of luck, we now have a table populated with data. Let’s try to query:
results = syn.tableQuery("select * from %s where Chromosome='1' and Start < 41000 and End > 20000"
% table.schema.id)
for row in results:
print(row)
Pandas¶
Pandas is a popular library for working with tabular data. If you have Pandas installed, the goal is that Synapse Tables will play nice with it.
Create a Synapse Table from a DataFrame:
import pandas as pd
df = pd.read_csv("/path/to/genes.csv", index_col=False)
table = build_table('My Favorite Genes', project, df)
table = syn.store(table)
build_table()
uses pandas DataFrame dtype to set the Table Schema
.
To create a table with a custom Schema
, first create the Schema
:
schema = Schema(name='My Favorite Genes', columns=as_table_columns(df), parent=project)
table = syn.store(Table(schema, df))
Get query results as a DataFrame:
results = syn.tableQuery("select * from %s where Chromosome='2'" % table.schema.id)
df = results.asDataFrame()
Changing Data¶
Once the schema is settled, changes come in two flavors: appending new rows and updating existing ones.
Appending new rows is fairly straightforward. To continue the previous example, we might add some new genes from another file:
table = syn.store(Table(table.schema.id, "/path/to/more_genes.csv"))
To quickly add a few rows, use a list of row data:
new_rows = [["Qux1", "4", 201001, 202001, "+", False],
["Qux2", "4", 203001, 204001, "+", False]]
table = syn.store(Table(schema, new_rows))
Updating rows requires an etag, which identifies the most recent change set plus row IDs and version numbers for each row to be modified. We get those by querying before updating. Minimizing changesets to contain only rows that actually change will make processing faster.
For example, let’s update the names of some of our favorite genes:
results = syn.tableQuery("select * from %s where Chromosome='1'" % table.schema.id)
df = results.asDataFrame()
df['Name'] = ['rzing', 'zing1', 'zing2', 'zing3']
Note that we’re propagating the etag from the query results. Without it, we’d get an error saying something about an “Invalid etag”:
table = syn.store(Table(schema, df, etag=results.etag))
The etag is used by the server to prevent concurrent users from making conflicting changes, a technique called optimistic concurrency. In case of a conflict, your update may be rejected. You then have to do another query and try your update again.
Changing Table Structure¶
Adding columns can be done using the methods Schema.addColumn()
or addColumns()
on the
Schema
object:
schema = syn.get("syn000000")
bday_column = syn.store(Column(name='birthday', columnType='DATE'))
schema.addColumn(bday_column)
schema = syn.store(schema)
Renaming or otherwise modifying a column involves removing the column and adding a new column:
cols = syn.getTableColumns(schema)
for col in cols:
if col.name == "birthday":
schema.removeColumn(col)
bday_column2 = syn.store(Column(name='birthday2', columnType='DATE'))
schema.addColumn(bday_column2)
schema = syn.store(schema)
Table attached files¶
Synapse tables support a special column type called ‘File’ which contain a file handle, an identifier of a file stored in Synapse. Here’s an example of how to upload files into Synapse, associate them with a table and read them back later:
# your synapse project
project = syn.get(...)
covers_dir = '/path/to/album/covers/'
# store the table's schema
cols = [
Column(name='artist', columnType='STRING', maximumSize=50),
Column(name='album', columnType='STRING', maximumSize=50),
Column(name='year', columnType='INTEGER'),
Column(name='catalog', columnType='STRING', maximumSize=50),
Column(name='cover', columnType='FILEHANDLEID')]
schema = syn.store(Schema(name='Jazz Albums', columns=cols, parent=project))
# the actual data
data = [["John Coltrane", "Blue Train", 1957, "BLP 1577", "coltraneBlueTrain.jpg"],
["Sonny Rollins", "Vol. 2", 1957, "BLP 1558", "rollinsBN1558.jpg"],
["Sonny Rollins", "Newk's Time", 1958, "BLP 4001", "rollinsBN4001.jpg"],
["Kenny Burrel", "Kenny Burrel", 1956, "BLP 1543", "burrellWarholBN1543.jpg"]]
# upload album covers
for row in data:
file_handle = syn.uploadSynapseManagedFileHandle(os.path.join(covers_dir, row[4]))
row[4] = file_handle['id']
# store the table data
row_reference_set = syn.store(RowSet(columns=cols, schema=schema, rows=[Row(r) for r in data]))
# Later, we'll want to query the table and download our album covers
results = syn.tableQuery("select artist, album, year, catalog, cover from %s where artist = 'Sonny Rollins'" % schema.id)
cover_files = syn.downloadTableColumns(results, ['cover'])
Deleting rows¶
Query for the rows you want to delete and call syn.delete on the results:
results = syn.tableQuery("select * from %s where Chromosome='2'" % table.schema.id)
a = syn.delete(results)
Deleting the whole table¶
Deleting the schema deletes the whole table and all rows:
syn.delete(schema)
Queries¶
The query language is quite similar to SQL select statements, except that joins are not supported. The documentation for the Synapse API has lots of query examples.
Schema¶
-
class
synapseclient.table.
Schema
(*args, **kwargs) A Schema is an
synapseclient.entity.Entity
that defines a set of columns in a table.- Parameters
name – the name for the Table Schema object
description – User readable description of the schema
columns – a list of
Column
objects or their IDsparent – the project in Synapse to which this table belongs
properties – A map of Synapse properties
annotations – A map of user defined annotations
local_state – Internal use only
Example:
cols = [Column(name='Isotope', columnType='STRING'), Column(name='Atomic Mass', columnType='INTEGER'), Column(name='Halflife', columnType='DOUBLE'), Column(name='Discovered', columnType='DATE')] schema = syn.store(Schema(name='MyTable', columns=cols, parent=project))
-
class
synapseclient.table.
EntityViewSchema
(*args, **kwargs) A EntityViewSchema is a
synapseclient.entity.Entity
that displays all files/projects (depending on user choice) within a given set of scopes- Parameters
name – the name of the Entity View Table object
columns – a list of
Column
objects or their IDs. These are optional.parent – the project in Synapse to which this table belongs
scopes – a list of Projects/Folders or their ids
type – This field is deprecated. Please use includeEntityTypes
includeEntityTypes –
- a list of entity types to include in the view. Supported entity types are:
EntityViewType.FILE, EntityViewType.PROJECT, EntityViewType.TABLE, EntityViewType.FOLDER, EntityViewType.VIEW, EntityViewType.DOCKER
If none is provided, the view will default to include EntityViewType.FILE.
addDefaultViewColumns – If true, adds all default columns (e.g. name, createdOn, modifiedBy etc.) Defaults to True. The default columns will be added after a call to
synapseclient.Synapse.store()
.addAnnotationColumns – If true, adds columns for all annotation keys defined across all Entities in the EntityViewSchema’s scope. Defaults to True. The annotation columns will be added after a call to
synapseclient.Synapse.store()
.ignoredAnnotationColumnNames – A list of strings representing annotation names. When addAnnotationColumns is True, the names in this list will not be automatically added as columns to the EntityViewSchema if they exist in any of the defined scopes.
properties – A map of Synapse properties
annotations – A map of user defined annotations
local_state – Internal use only
- Example::
from synapseclient import EntityViewType
project_or_folder = syn.get(“syn123”) schema = syn.store(EntityViewSchema(name=’MyTable’, parent=project, scopes=[project_or_folder_id, ‘syn123’],
includeEntityTypes=[EntityViewType.FILE]))
-
set_entity_types
(includeEntityTypes) - Parameters
includeEntityTypes –
a list of entity types to include in the view. This list will replace the previous settings. Supported entity types are:
EntityViewType.FILE, EntityViewType.PROJECT, EntityViewType.TABLE, EntityViewType.FOLDER, EntityViewType.VIEW, EntityViewType.DOCKER
Column¶
-
class
synapseclient.table.
Column
(**kwargs)¶ Defines a column to be used in a table
synapseclient.table.Schema
synapseclient.table.EntityViewSchema
.- Variables
id – An immutable ID issued by the platform
- Parameters
columnType (string) – The column type determines the type of data that can be stored in a column. It can be any of: “STRING”, “DOUBLE”, “INTEGER”, “BOOLEAN”, “DATE”, “FILEHANDLEID”, “ENTITYID”, “LINK”, “LARGETEXT”, “USERID”. For more information, please see: https://docs.synapse.org/rest/org/sagebionetworks/repo/model/table/ColumnType.html
maximumSize (integer) – A parameter for columnTypes with a maximum size. For example, ColumnType.STRINGs have a default maximum size of 50 characters, but can be set to a maximumSize of 1 to 1000 characters.
name (string) – The display name of the column
enumValues (array of strings) – Columns type of STRING can be constrained to an enumeration values set on this list.
defaultValue (string) – The default value for this column. Columns of type FILEHANDLEID and ENTITYID are not allowed to have default values.
-
__init__
(**kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
Row¶
-
class
synapseclient.table.
Row
(values, rowId=None, versionNumber=None, etag=None, **kwargs)¶ A row in a Table.
- Parameters
values – A list of values
rowId – The immutable ID issued to a new row
versionNumber – The version number of this row. Each row version is immutable, so when a row is updated a new version is created.
-
__init__
(values, rowId=None, versionNumber=None, etag=None, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
RowSet¶
-
class
synapseclient.table.
RowSet
(columns=None, schema=None, **kwargs)¶ A Synapse object of type org.sagebionetworks.repo.model.table.RowSet.
- Parameters
schema – A
synapseclient.table.Schema
object that will be used to set the tableIdheaders (array of SelectColumns) – The list of SelectColumn objects that describe the fields in each row.
columns – An alternative to ‘headers’, a list of column objects that describe the fields in each row.
tableId (string) – The ID of the TableEntity that owns these rows
rows (array of rows) – The
synapseclient.table.Row
s of this set. The index of each row value aligns with the index of each header.
- Variables
etag – Any RowSet returned from Synapse will contain the current etag of the change set. To update any rows from a RowSet the etag must be provided with the POST.
-
__init__
(columns=None, schema=None, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
Table¶
-
class
synapseclient.table.
TableAbstractBaseClass
(schema, headers=None, etag=None)¶ Abstract base class for Tables based on different data containers.
-
class
RowMetadataTuple
(row_id, row_version, row_etag)¶ -
property
row_etag
¶ Alias for field number 2
-
property
row_id
¶ Alias for field number 0
-
property
row_version
¶ Alias for field number 1
-
property
-
abstract
iter_row_metadata
()¶ Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)
- Returns
a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)
-
class
-
class
synapseclient.table.
RowSetTable
(schema, rowset)¶ A Table object that wraps a RowSet.
-
iter_row_metadata
()¶ Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)
- Returns
a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)
-
-
class
synapseclient.table.
TableQueryResult
(synapse, query, limit=None, offset=None, isConsistent=True)¶ An object to wrap rows returned as a result of a table query. The TableQueryResult object can be used to iterate over results of a query.
Example
results = syn.tableQuery("select * from syn1234") for row in results: print(row)
-
asDataFrame
(rowIdAndVersionInIndex=True)¶ Convert query result to a Pandas DataFrame. :param rowIdAndVersionInIndex: Make the dataframe index consist of the row_id and row_version (and row_etag
if it exists)
-
iter_row_metadata
()¶ Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, row_version,None)
- Returns
a generator that gives :py:class::collections.namedtuple with format (row_id, row_version, row_etag)
-
next
()¶ Python 2 iterator
-
-
class
synapseclient.table.
CsvFileTable
(schema, filepath, etag=None, quoteCharacter='"', escapeCharacter='\\', lineEnd='\n', separator=',', header=True, linesToSkip=0, includeRowIdAndRowVersion=None, headers=None)¶ An object to wrap a CSV file that may be stored into a Synapse table or returned as a result of a table query.
-
asDataFrame
(rowIdAndVersionInIndex=True, convert_to_datetime=False)¶ Convert query result to a Pandas DataFrame. :param rowIdAndVersionInIndex: Make the dataframe index consist of the row_id and row_version
(and row_etag if it exists)
- Parameters
convert_to_datetime – If set to True, will convert all Synapse DATE columns from UNIX timestamp integers into UTC datetime objects
- Returns
-
classmethod
from_table_query
(synapse, query, quoteCharacter='"', escapeCharacter='\\', lineEnd='\n', separator=',', header=True, includeRowIdAndRowVersion=True, downloadLocation=None)¶ Create a Table object wrapping a CSV file resulting from querying a Synapse table. Mostly for internal use.
-
iter_row_metadata
()¶ Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)
- Returns
a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)
-
setColumnHeaders
(headers)¶ Set the list of
synapseclient.table.SelectColumn
objects that will be used to convert fields to the appropriate data types.Column headers are automatically set when querying.
-
Module level methods¶
-
synapseclient.table.
as_table_columns
(values)¶ Return a list of Synapse table
Column
objects that correspond to the columns in the given values.- Params values
an object that holds the content of the tables - a string holding the path to a CSV file, a filehandle, or StringIO containing valid csv content - a Pandas DataFrame
- Returns
A list of Synapse table
Column
objects
Example:
import pandas as pd df = pd.DataFrame(dict(a=[1, 2, 3], b=["c", "d", "e"])) cols = as_table_columns(df)
-
synapseclient.table.
build_table
(name, parent, values)¶ Build a Table object
- Parameters
name – the name for the Table Schema object
parent – the project in Synapse to which this table belongs
values –
an object that holds the content of the tables - a string holding the path to a CSV file - a Pandas DataFrame
- Returns
a Table object suitable for storing
Example:
path = "/path/to/file.csv" table = build_table("simple_table", "syn123", path) table = syn.store(table) import pandas as pd df = pd.DataFrame(dict(a=[1, 2, 3], b=["c", "d", "e"])) table = build_table("simple_table", "syn123", df) table = syn.store(table)
-
synapseclient.table.
Table
(schema, values, **kwargs)¶ Combine a table schema and a set of values into some type of Table object depending on what type of values are given.
- Parameters
- Returns
a Table object suitable for storing
Usually, the immediate next step after creating a Table object is to store it:
table = syn.store(Table(schema, values))
End users should not need to know the details of these Table subclasses: