Tables¶

Synapse Tables enable storage of tabular data in Synapse in a form that can be queried using a SQL-like query language.

A table has a Schema and holds a set of rows conforming to that schema.

A Schema defines a series of Column of the following types: STRING, DOUBLE, INTEGER, BOOLEAN, DATE, ENTITYID, FILEHANDLEID, LINK, LARGETEXT, USERID ~~~~~~~ Example ~~~~~~~

Preliminaries:

import synapseclient
from synapseclient import Project, File, Folder
from synapseclient import Schema, Column, Table, Row, RowSet, as_table_columns

syn = synapseclient.Synapse()
syn.login()

project = syn.get('syn123')

First, let’s load some data. Let’s say we had a file, genes.csv:

Name,Chromosome,Start,End,Strand,TranscriptionFactor
foo,1,12345,12600,+,False
arg,2,20001,20200,+,False
zap,2,30033,30999,-,False
bah,1,40444,41444,-,False
bnk,1,51234,54567,+,True
xyz,1,61234,68686,+,False

To create a Table:

table = build_table('My Favorite Genes', project, "/path/to/genes.csv")
syn.store(table)

build_table() will set the Table Schema which defines the columns of the table. To create a table with a custom Schema, first create the Schema:

cols = [
    Column(name='Name', columnType='STRING', maximumSize=20),
    Column(name='Chromosome', columnType='STRING', maximumSize=20),
    Column(name='Start', columnType='INTEGER'),
    Column(name='End', columnType='INTEGER'),
    Column(name='Strand', columnType='STRING', enumValues=['+', '-'], maximumSize=1),
    Column(name='TranscriptionFactor', columnType='BOOLEAN')]

schema = Schema(name='My Favorite Genes', columns=cols, parent=project)

Let’s store that in Synapse:

table = Table(schema, "/path/to/genes.csv")
table = syn.store(table)

The Table() function takes two arguments, a schema object and data in some form, which can be:

a path to a CSV file

a Pandas DataFrame

a RowSet object

a list of lists where each of the inner lists is a row

With a bit of luck, we now have a table populated with data. Let’s try to query:

results = syn.tableQuery("select * from %s where Chromosome='1' and Start < 41000 and End > 20000"
                         % table.schema.id)
for row in results:
    print(row)

Pandas¶

Pandas is a popular library for working with tabular data. If you have Pandas installed, the goal is that Synapse Tables will play nice with it.

Create a Synapse Table from a DataFrame:

import pandas as pd

df = pd.read_csv("/path/to/genes.csv", index_col=False)
table = build_table('My Favorite Genes', project, df)
table = syn.store(table)

build_table() uses pandas DataFrame dtype to set the Table Schema. To create a table with a custom Schema, first create the Schema:

schema = Schema(name='My Favorite Genes', columns=as_table_columns(df), parent=project)
table = syn.store(Table(schema, df))

Get query results as a DataFrame:

results = syn.tableQuery("select * from %s where Chromosome='2'" % table.schema.id)
df = results.asDataFrame()

Changing Data¶

Once the schema is settled, changes come in two flavors: appending new rows and updating existing ones.

Appending new rows is fairly straightforward. To continue the previous example, we might add some new genes from another file:

table = syn.store(Table(table.schema.id, "/path/to/more_genes.csv"))

To quickly add a few rows, use a list of row data:

new_rows = [["Qux1", "4", 201001, 202001, "+", False],
            ["Qux2", "4", 203001, 204001, "+", False]]
table = syn.store(Table(schema, new_rows))

Updating rows requires an etag, which identifies the most recent change set plus row IDs and version numbers for each row to be modified. We get those by querying before updating. Minimizing changesets to contain only rows that actually change will make processing faster.

For example, let’s update the names of some of our favorite genes:

results = syn.tableQuery("select * from %s where Chromosome='1'" % table.schema.id)
df = results.asDataFrame()
df['Name'] = ['rzing', 'zing1', 'zing2', 'zing3']

Note that we’re propagating the etag from the query results. Without it, we’d get an error saying something about an “Invalid etag”:

table = syn.store(Table(schema, df, etag=results.etag))

The etag is used by the server to prevent concurrent users from making conflicting changes, a technique called optimistic concurrency. In case of a conflict, your update may be rejected. You then have to do another query and try your update again.

Changing Table Structure¶

Adding columns can be done using the methods Schema.addColumn() or addColumns() on the Schema object:

schema = syn.get("syn000000")
bday_column = syn.store(Column(name='birthday', columnType='DATE'))
schema.addColumn(bday_column)
schema = syn.store(schema)

Renaming or otherwise modifying a column involves removing the column and adding a new column:

cols = syn.getTableColumns(schema)
for col in cols:
    if col.name == "birthday":
        schema.removeColumn(col)
bday_column2 = syn.store(Column(name='birthday2', columnType='DATE'))
schema.addColumn(bday_column2)
schema = syn.store(schema)

Table attached files¶

Synapse tables support a special column type called ‘File’ which contain a file handle, an identifier of a file stored in Synapse. Here’s an example of how to upload files into Synapse, associate them with a table and read them back later:

# your synapse project
project = syn.get(...)

covers_dir = '/path/to/album/covers/'

# store the table's schema
cols = [
    Column(name='artist', columnType='STRING', maximumSize=50),
    Column(name='album', columnType='STRING', maximumSize=50),
    Column(name='year', columnType='INTEGER'),
    Column(name='catalog', columnType='STRING', maximumSize=50),
    Column(name='cover', columnType='FILEHANDLEID')]
schema = syn.store(Schema(name='Jazz Albums', columns=cols, parent=project))

# the actual data
data = [["John Coltrane",  "Blue Train",   1957, "BLP 1577", "coltraneBlueTrain.jpg"],
        ["Sonny Rollins",  "Vol. 2",       1957, "BLP 1558", "rollinsBN1558.jpg"],
        ["Sonny Rollins",  "Newk's Time",  1958, "BLP 4001", "rollinsBN4001.jpg"],
        ["Kenny Burrel",   "Kenny Burrel", 1956, "BLP 1543", "burrellWarholBN1543.jpg"]]

# upload album covers
for row in data:
    file_handle = syn.uploadSynapseManagedFileHandle(os.path.join(covers_dir, row[4]))
    row[4] = file_handle['id']

# store the table data
row_reference_set = syn.store(RowSet(columns=cols, schema=schema, rows=[Row(r) for r in data]))

# Later, we'll want to query the table and download our album covers
results = syn.tableQuery("select artist, album, year, catalog, cover from %s where artist = 'Sonny Rollins'"                              % schema.id)
cover_files = syn.downloadTableColumns(results, ['cover'])

Deleting rows¶

Query for the rows you want to delete and call syn.delete on the results:

results = syn.tableQuery("select * from %s where Chromosome='2'" % table.schema.id)
a = syn.delete(results)

Deleting the whole table¶

Deleting the schema deletes the whole table and all rows:

syn.delete(schema)

Queries¶

The query language is quite similar to SQL select statements, except that joins are not supported. The documentation for the Synapse API has lots of query examples.

Schema¶

class synapseclient.table.Schema(*args, **kwargs)

A Schema is an synapseclient.entity.Entity that defines a set of columns in a table.

Parameters

name – the name for the Table Schema object
description – User readable description of the schema
columns – a list of Column objects or their IDs
parent – the project in Synapse to which this table belongs
properties – A map of Synapse properties
annotations – A map of user defined annotations
local_state – Internal use only

Example:

cols = [Column(name='Isotope', columnType='STRING'),
        Column(name='Atomic Mass', columnType='INTEGER'),
        Column(name='Halflife', columnType='DOUBLE'),
        Column(name='Discovered', columnType='DATE')]

schema = syn.store(Schema(name='MyTable', columns=cols, parent=project))

class synapseclient.table.EntityViewSchema(*args, **kwargs)

A EntityViewSchema is a synapseclient.entity.Entity that displays all files/projects (depending on user choice) within a given set of scopes

Parameters

name – the name of the Entity View Table object
columns – a list of Column objects or their IDs. These are optional.
parent – the project in Synapse to which this table belongs
scopes – a list of Projects/Folders or their ids
type – This field is deprecated. Please use includeEntityTypes
includeEntityTypes –

a list of entity types to include in the view. Supported entity types are:
EntityViewType.FILE, EntityViewType.PROJECT, EntityViewType.TABLE, EntityViewType.FOLDER, EntityViewType.VIEW, EntityViewType.DOCKER

If none is provided, the view will default to include EntityViewType.FILE.
addDefaultViewColumns – If true, adds all default columns (e.g. name, createdOn, modifiedBy etc.) Defaults to True. The default columns will be added after a call to synapseclient.Synapse.store().
addAnnotationColumns – If true, adds columns for all annotation keys defined across all Entities in the EntityViewSchema’s scope. Defaults to True. The annotation columns will be added after a call to synapseclient.Synapse.store().
ignoredAnnotationColumnNames – A list of strings representing annotation names. When addAnnotationColumns is True, the names in this list will not be automatically added as columns to the EntityViewSchema if they exist in any of the defined scopes.
properties – A map of Synapse properties
annotations – A map of user defined annotations
local_state – Internal use only

Example::

from synapseclient import EntityViewType

project_or_folder = syn.get(“syn123”) schema = syn.store(EntityViewSchema(name=’MyTable’, parent=project, scopes=[project_or_folder_id, ‘syn123’],

includeEntityTypes=[EntityViewType.FILE]))

set_entity_types(includeEntityTypes)

Parameters

includeEntityTypes –

a list of entity types to include in the view. This list will replace the previous settings. Supported entity types are:

EntityViewType.FILE, EntityViewType.PROJECT, EntityViewType.TABLE, EntityViewType.FOLDER, EntityViewType.VIEW, EntityViewType.DOCKER

Column¶

class synapseclient.table.Column(**kwargs)¶

Defines a column to be used in a table synapseclient.table.Schema synapseclient.table.EntityViewSchema.

Variables

id – An immutable ID issued by the platform

Parameters

columnType (string) – The column type determines the type of data that can be stored in a column. It can be any of: “STRING”, “DOUBLE”, “INTEGER”, “BOOLEAN”, “DATE”, “FILEHANDLEID”, “ENTITYID”, “LINK”, “LARGETEXT”, “USERID”. For more information, please see: https://docs.synapse.org/rest/org/sagebionetworks/repo/model/table/ColumnType.html
maximumSize (integer) – A parameter for columnTypes with a maximum size. For example, ColumnType.STRINGs have a default maximum size of 50 characters, but can be set to a maximumSize of 1 to 1000 characters.
name (string) – The display name of the column
enumValues (array of strings) – Columns type of STRING can be constrained to an enumeration values set on this list.
defaultValue (string) – The default value for this column. Columns of type FILEHANDLEID and ENTITYID are not allowed to have default values.

__init__(**kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Row¶

class synapseclient.table.Row(values, rowId=None, versionNumber=None, etag=None, **kwargs)¶

A row in a Table.

Parameters

values – A list of values
rowId – The immutable ID issued to a new row
versionNumber – The version number of this row. Each row version is immutable, so when a row is updated a new version is created.

__init__(values, rowId=None, versionNumber=None, etag=None, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

RowSet¶

class synapseclient.table.RowSet(columns=None, schema=None, **kwargs)¶

A Synapse object of type org.sagebionetworks.repo.model.table.RowSet.

Parameters

schema – A synapseclient.table.Schema object that will be used to set the tableId
headers (array of SelectColumns) – The list of SelectColumn objects that describe the fields in each row.
columns – An alternative to ‘headers’, a list of column objects that describe the fields in each row.
tableId (string) – The ID of the TableEntity that owns these rows
rows (array of rows) – The synapseclient.table.Row s of this set. The index of each row value aligns with the index of each header.

Variables

etag – Any RowSet returned from Synapse will contain the current etag of the change set. To update any rows from a RowSet the etag must be provided with the POST.

__init__(columns=None, schema=None, **kwargs)¶: Initialize self. See help(type(self)) for accurate signature.

Table¶

class synapseclient.table.TableAbstractBaseClass(schema, headers=None, etag=None)¶

Abstract base class for Tables based on different data containers.

class RowMetadataTuple(row_id, row_version, row_etag)¶

property row_etag¶: Alias for field number 2

property row_id¶: Alias for field number 0

property row_version¶: Alias for field number 1

abstract iter_row_metadata()¶

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)

Returns: a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)

class synapseclient.table.RowSetTable(schema, rowset)¶

A Table object that wraps a RowSet.

iter_row_metadata()¶

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)

Returns: a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)

class synapseclient.table.TableQueryResult(synapse, query, limit=None, offset=None, isConsistent=True)¶

An object to wrap rows returned as a result of a table query. The TableQueryResult object can be used to iterate over results of a query.

Example

results = syn.tableQuery("select * from syn1234")
for row in results:
    print(row)

asDataFrame(rowIdAndVersionInIndex=True)¶: Convert query result to a Pandas DataFrame. :param rowIdAndVersionInIndex: Make the dataframe index consist of the row_id and row_version (and row_etag

if it exists)

iter_row_metadata()¶

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, row_version,None)

Returns: a generator that gives :py:class::collections.namedtuple with format (row_id, row_version, row_etag)

next()¶: Python 2 iterator

class synapseclient.table.CsvFileTable(schema, filepath, etag=None, quoteCharacter='"', escapeCharacter='\\', lineEnd='\n', separator=',', header=True, linesToSkip=0, includeRowIdAndRowVersion=None, headers=None)¶

An object to wrap a CSV file that may be stored into a Synapse table or returned as a result of a table query.

asDataFrame(rowIdAndVersionInIndex=True, convert_to_datetime=False)¶

Convert query result to a Pandas DataFrame. :param rowIdAndVersionInIndex: Make the dataframe index consist of the row_id and row_version

(and row_etag if it exists)

Parameters: convert_to_datetime – If set to True, will convert all Synapse DATE columns from UNIX timestamp integers into UTC datetime objects
Returns

classmethod from_table_query(synapse, query, quoteCharacter='"', escapeCharacter='\\', lineEnd='\n', separator=',', header=True, includeRowIdAndRowVersion=True, downloadLocation=None)¶: Create a Table object wrapping a CSV file resulting from querying a Synapse table. Mostly for internal use.

iter_row_metadata()¶

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)

Returns: a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)

setColumnHeaders(headers)¶

Set the list of synapseclient.table.SelectColumn objects that will be used to convert fields to the appropriate data types.

Column headers are automatically set when querying.

Module level methods¶

synapseclient.table.as_table_columns(values)¶

Return a list of Synapse table Column objects that correspond to the columns in the given values.

Params values: an object that holds the content of the tables - a string holding the path to a CSV file, a filehandle, or StringIO containing valid csv content - a Pandas DataFrame
Returns: A list of Synapse table Column objects

Example:

import pandas as pd

df = pd.DataFrame(dict(a=[1, 2, 3], b=["c", "d", "e"]))
cols = as_table_columns(df)

synapseclient.table.build_table(name, parent, values)¶

Build a Table object

Parameters

name – the name for the Table Schema object
parent – the project in Synapse to which this table belongs
values –
an object that holds the content of the tables - a string holding the path to a CSV file - a Pandas DataFrame

Returns

a Table object suitable for storing

Example:

path = "/path/to/file.csv"
table = build_table("simple_table", "syn123", path)
table = syn.store(table)

import pandas as pd

df = pd.DataFrame(dict(a=[1, 2, 3], b=["c", "d", "e"]))
table = build_table("simple_table", "syn123", df)
table = syn.store(table)

synapseclient.table.Table(schema, values, **kwargs)¶

Combine a table schema and a set of values into some type of Table object depending on what type of values are given.

Parameters

schema – a table Schema object
values –
an object that holds the content of the tables - a RowSet - a list of lists (or tuples) where each element is a row - a string holding the path to a CSV file - a Pandas DataFrame - a dict which will be wrapped by a Pandas DataFrame

Returns

a Table object suitable for storing

Usually, the immediate next step after creating a Table object is to store it:

table = syn.store(Table(schema, values))

End users should not need to know the details of these Table subclasses:

TableAbstractBaseClass

RowSetTable

TableQueryResult

CsvFileTable

See also:

Synapse Python Client API

Tables¶

Pandas¶

Changing Data¶

Changing Table Structure¶

Table attached files¶

Deleting rows¶

Deleting the whole table¶

Queries¶

Schema¶

Column¶

Row¶

RowSet¶

Table¶

Module level methods¶

Table of Contents

Previous topic

Next topic

This Page