Tabular data#

Table handles tabular data, storing as columns in a, you guessed it, columns attribute. The latter acts like a dictionary, with the column names as the keys and the column values being numpy.ndarray instances. The table itself is iterable over rows.

Note

Table is immutable at the level of the individual ndarray not being writable.

Loading a csv file#

We load a tab separated data file using the load_table() function. The format is inferred from the filename suffix and you will note, in this case, it’s not actually a csv file.

from cogent3 import load_table

table = load_table("data/stats.tsv")
table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_004893	Con	121351.4264
NP_005079	Con	9516594.9789
NP_005500	NonCon	0.0000
NP_055852	NonCon	10933217.7090

5 rows x 3 columns

Note

The known filename suffixes for reading are .csv, .tsv and .pkl or .pickle (Python’s pickle format).

Note

If you invoke the static column types argument, i.e.``load_table(…, static_column_types=True)`` and the column data are not static, those columns will be left as a string type.

Loading from a url#

The cogent3 load functions support loading from a url. We load the above .tsv file directly from GitHub.

from cogent3 import load_table

table = load_table("https://raw.githubusercontent.com/cogent3/cogent3/develop/doc/data/stats.tsv")

Loading delimited specifying the format#

Although unnecessary in this case, it’s possible to override the suffix by specifying the delimiter using the sep argument.

from cogent3 import load_table

table = load_table("data/stats.tsv", sep="\t")
table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_004893	Con	121351.4264
NP_005079	Con	9516594.9789
NP_005500	NonCon	0.0000
NP_055852	NonCon	10933217.7090

5 rows x 3 columns

Loading delimited data without a header line#

To create a table from the follow examples, you specify your header and use make_table().

Using `load_delimited()`#

This is just a standard parsing function which does not do any filtering or converting elements to non-string types.

from cogent3.parse.table import load_delimited

header, rows, title, legend = load_delimited("data/CerebellumDukeDNaseSeq.pk", header=False, sep="\t")
rows[:4]

[['chr1',
  '29214',
  '29566',
  'chr1.1',
  '626',
  '.',
  '0.0724',
  '3.9',
  '-1',
  '159'],
 ['chr1',
  '89933',
  '90118',
  'chr1.2',
  '511',
  '.',
  '0.0313',
  '1.59',
  '-1',
  '94'],
 ['chr1',
  '545979',
  '546193',
  'chr1.3',
  '543',
  '.',
  '0.0428',
  '2.23',
  '-1',
  '100'],
 ['chr1',
  '713797',
  '714639',
  'chr1.4',
  '1000',
  '.',
  '0.3215',
  '16.0',
  '-1',
  '380']]

Using `FilteringParser`#

from cogent3.parse.table import FilteringParser

reader = FilteringParser(with_header=False, sep="\t")
rows = list(reader("data/CerebellumDukeDNaseSeq.pk"))
rows[:4]

[['chr1',
  '29214',
  '29566',
  'chr1.1',
  '626',
  '.',
  '0.0724',
  '3.9',
  '-1',
  '159'],
 ['chr1',
  '89933',
  '90118',
  'chr1.2',
  '511',
  '.',
  '0.0313',
  '1.59',
  '-1',
  '94'],
 ['chr1',
  '545979',
  '546193',
  'chr1.3',
  '543',
  '.',
  '0.0428',
  '2.23',
  '-1',
  '100'],
 ['chr1',
  '713797',
  '714639',
  'chr1.4',
  '1000',
  '.',
  '0.3215',
  '16.0',
  '-1',
  '380']]

Selectively loading parts of a big file#

Loading a set number of lines from a file#

The limit argument specifies the number of lines to read.

from cogent3 import load_table

table = load_table("data/stats.tsv", limit=2)
table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_004893	Con	121351.4264

2 rows x 3 columns

Loading only some rows#

If you only want a subset of the contents of a file, use the FilteringParser. This allows skipping certain lines by using a callback function. We illustrate this with stats.tsv, skipping any rows with "Ratio" > 10.

from cogent3.parse.table import FilteringParser

reader = FilteringParser(
    lambda line: float(line[2]) <= 10, with_header=True, sep="\t"
)
table = load_table("data/stats.tsv", reader=reader, digits=1)
table

Locus	Region	Ratio
NP_003077	Con	2.5
NP_005500	NonCon	0.0

2 rows x 3 columns

You can also negate a condition, which is useful if the condition is complex. In this example, it means keep the rows for which Ratio > 10.

reader = FilteringParser(
    lambda line: float(line[2]) <= 10, with_header=True, sep="\t", negate=True
)
table = load_table("data/stats.tsv", reader=reader, digits=1)
table

Locus	Region	Ratio
NP_004893	Con	121351.4
NP_005079	Con	9516595.0
NP_055852	NonCon	10933217.7

3 rows x 3 columns

Loading only some columns#

Specify the columns by their names.

from cogent3.parse.table import FilteringParser

reader = FilteringParser(columns=["Locus", "Ratio"], with_header=True, sep="\t")
table = load_table("data/stats.tsv", reader=reader)
table

Locus	Ratio
NP_003077	2.5386
NP_004893	121351.4264
NP_005079	9516594.9789
NP_005500	0.0000
NP_055852	10933217.7090

5 rows x 2 columns

Or, by their index.

from cogent3.parse.table import FilteringParser

reader = FilteringParser(columns=[0, -1], with_header=True, sep="\t")
table = load_table("data/stats.tsv", reader=reader)
table

Locus	Ratio
NP_003077	2.5386
NP_004893	121351.4264
NP_005079	9516594.9789
NP_005500	0.0000
NP_055852	10933217.7090

5 rows x 2 columns

Note

The negate argument does not affect the columns evaluated.

Load raw data as a list of lists of strings#

We just use FilteringParser.

from cogent3.parse.table import FilteringParser

reader = FilteringParser(with_header=True, sep="\t")
data = list(reader("data/stats.tsv"))

We just display the first two lines.

data[:2]

[['Locus', 'Region', 'Ratio'], ['NP_003077', 'Con', '2.5386013224378985']]

Note

The individual elements are all str.

Make a table from header and rows#

from cogent3 import make_table

header = ["A", "B", "C"]
rows = [range(3), range(3, 6), range(6, 9), range(9, 12)]
table = make_table(header=["A", "B", "C"], data=rows)
table

A	B	C
0	1	2
3	4	5
6	7	8
9	10	11

4 rows x 3 columns

Make a table from a `dict`#

For a dict with key’s as column headers.

from cogent3 import make_table

data = dict(A=[0, 3, 6], B=[1, 4, 7], C=[2, 5, 8])
table = make_table(data=data)
table

A	B	C
0	1	2
3	4	5
6	7	8

3 rows x 3 columns

Specify the column order when creating from a `dict`.#

table = make_table(header=["C", "A", "B"], data=data)
table

C	A	B
2	0	1
5	3	4
8	6	7

3 rows x 3 columns

Create the table with an index#

A Table can be indexed like a dict if you designate a column as the index (and that column has a unique value for every row).

table = load_table("data/stats.tsv", index_name="Locus")
table["NP_055852"]

Locus	Region	Ratio
NP_055852	NonCon	10933217.7090

1 rows x 3 columns

table["NP_055852", "Region"]

np.str_('NonCon')

Note

The index_name argument also applies when using make_table().

Create a table from a `pandas.DataFrame`#

from pandas import DataFrame

from cogent3 import make_table

data = dict(a=[0, 3], b=["a", "c"])
df = DataFrame(data=data)
table = make_table(data_frame=df)
table

a	b
0	a
3	c

2 rows x 2 columns

Create a table from header and rows#

from cogent3 import make_table

table = make_table(header=["a", "b"], data=[[0, "a"], [3, "c"]])
table

a	b
0	a
3	c

2 rows x 2 columns

Create a table from dict#

make_table() is the utility function for creating Table objects from standard python objects.

from cogent3 import make_table

data = dict(a=[0, 3], b=["a", "c"])
table = make_table(data=data)
table

a	b
0	a
3	c

2 rows x 2 columns

Create a table from a 2D dict#

from cogent3 import make_table

d2D = {
    "edge.parent": {
        "NineBande": "root",
        "edge.1": "root",
        "DogFaced": "root",
        "Human": "edge.0",
    },
    "x": {
        "NineBande": 1.0,
        "edge.1": 1.0,
        "DogFaced": 1.0,
        "Human": 1.0,
    },
    "length": {
        "NineBande": 4.0,
        "edge.1": 4.0,
        "DogFaced": 4.0,
        "Human": 4.0,
    },
}
table = make_table(
    data=d2D,
)
table

edge.parent	x	length
root	1.0000	4.0000
root	1.0000	4.0000
root	1.0000	4.0000
edge.0	1.0000	4.0000

4 rows x 3 columns

Create a table that has complex python objects as elements#

from cogent3 import make_table

table = make_table(
    header=["abcd", "data"],
    data=[[range(1, 6), "0"], ["x", 5.0], ["y", None]],
    missing_data="*",
    digits=1,
)
table

abcd	data
range(1, 6)	0
x	5.0
y	None

3 rows x 2 columns

Create an empty table#

from cogent3 import make_table

table = make_table()
table

0 rows x 0 columns

Adding a new column#

from cogent3 import make_table

table = make_table()
table.columns["a"] = [1, 3, 5]
table.columns["b"] = [2, 4, 6]
table

a	b
1	2
3	4
5	6

3 rows x 2 columns

Add a title and a legend to a table#

This can be done when you create the table.

from cogent3 import make_table

data = dict(a=[0, 3], b=["a", "c"])
table = make_table(data=data, title="Sample title", legend="a legend")
table

Sample title
a legend
a	b
0	a
3	c

2 rows x 2 columns

It can be done by directly assigning to the corresponding attributes.

data = dict(a=[0, 3], b=["a", "c"])
table = make_table(data=data)
table.title = "My title"
table

My title
a	b
0	a
3	c

2 rows x 2 columns

Iterating over table rows#

Table is a row oriented object. Iterating on the table returns each row as a new Table instance.

from cogent3 import load_table

table = load_table("data/stats.tsv")
for row in table:
    print(row)
    break

=============================
Locus        Region     Ratio
-----------------------------
NP_003077    Con       2.5386
-----------------------------

The resulting rows can be indexed using their column names.

for row in table:
    print(row["Locus"])

NP_003077
NP_004893
NP_005079
NP_005500
NP_055852

How many rows are there?#

The Table.shape attribute is like that of a numpy array. The first element (Table.shape[0]) is the number of rows.

from cogent3 import make_table

data = dict(a=[0, 3, 5], b=["a", "c", "d"])
table = make_table(data=data)
table.shape[0] == 3

True

How many columns are there?#

Table.shape[1] is the number of columns. Using the table from above.

table.shape[1] == 2

True

Iterating over table columns#

The Table.columns attribute is a Columns instance, an object with dict attributes.

from cogent3 import load_table

table = load_table("data/stats.tsv")
table.columns

Columns('Locus': <U9, 'Region': <U6, 'Ratio': float64)

table.columns["Region"]

array(['Con', 'Con', 'Con', 'NonCon', 'NonCon'], dtype='<U6')

So iteration is the same as for dicts.

for name in table.columns:
    print(name)

Locus
Region
Ratio

Table slicing using column names#

table = load_table("data/stats.tsv")
table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_004893	Con	121351.4264
NP_005079	Con	9516594.9789
NP_005500	NonCon	0.0000
NP_055852	NonCon	10933217.7090

5 rows x 3 columns

Slice using the column name.

table[:2, "Region":]

Region	Ratio
Con	2.5386
Con	121351.4264

2 rows x 2 columns

Table slicing using indices#

table = load_table("data/stats.tsv")
table[:2, :1]

Locus
NP_003077
NP_004893

2 rows x 1 columns

Changing displayed numerical precision#

We change the Ratio column to using scientific notation.

from cogent3 import load_table

table = load_table("data/stats.tsv")
table.format_column("Ratio", "%.1e")
table

Locus	Region	Ratio
NP_003077	Con	2.5e+00
NP_004893	Con	1.2e+05
NP_005079	Con	9.5e+06
NP_005500	NonCon	7.4e-08
NP_055852	NonCon	1.1e+07

5 rows x 3 columns

Change digits or column spacing#

This can be done on table loading,

table = load_table("data/stats.tsv", digits=1, space=2)
table

Locus	Region	Ratio
NP_003077	Con	2.5
NP_004893	Con	121351.4
NP_005079	Con	9516595.0
NP_005500	NonCon	0.0
NP_055852	NonCon	10933217.7

5 rows x 3 columns

or, for spacing at least, by modifying the attributes

table.space = "    "
table

Locus	Region	Ratio
NP_003077	Con	2.5
NP_004893	Con	121351.4
NP_005079	Con	9516595.0
NP_005500	NonCon	0.0
NP_055852	NonCon	10933217.7

5 rows x 3 columns

Wrapping tables for display#

Wrapping generates neat looking tables whether or not you index the table rows. We demonstrate here

from cogent3 import make_table

h = ["name", "A/C", "A/G", "A/T", "C/A"]
rows = [["tardigrade", 0.0425, 0.1424, 0.0226, 0.0391]]
wrap_table = make_table(header=h, data=rows, max_width=30)
wrap_table

name	A/C	A/G
tardigrade	0.0425	0.1424

A/T	C/A
0.0226	0.0391

1 rows x 5 columns

wrap_table = make_table(header=h, data=rows, max_width=30, index_name="name")
wrap_table

name	A/C	A/G
tardigrade	0.0425	0.1424

name	A/T	C/A
tardigrade	0.0226	0.0391

1 rows x 5 columns

Display the top of a table using `head()`#

table = make_table(data=dict(a=list(range(10)), b=list(range(10))))
table.head()

a	b
0	0
1	1
2	2
3	3
4	4

Top 5 rows from 10 rows x 2 columns

You change how many rows are displayed.

table.head(2)

a	b
0	0
1	1

Top 2 rows from 10 rows x 2 columns

The table shape is that of the original table.

Display the bottom of a table using `tail()`#

table.tail()

a	b
5	5
6	6
7	7
8	8
9	9

Bottom 5 rows from 10 rows x 2 columns

You change how many rows are displayed.

table.tail(1)

a	b
9	9

Bottom 1 rows from 10 rows x 2 columns

Display random rows from a table#

table.set_repr_policy(random=3)
table

a	b
1	1
4	4
5	5

Random selection of 3 rows from 10 rows x 2 columns

Change the number of rows displayed by `repr()`#

table.set_repr_policy(head=2, tail=3)
table

a	b
0	0
1	1
...	...
7	7
8	8
9	9

Top 2 and bottom 3 rows from 10 rows x 2 columns

Note

The ... indicates the break between the top and bottom rows.

Changing column headings#

The table header is immutable. Changing column headings is done as follows.

table = load_table("data/stats.tsv")
print(table.header)
table = table.with_new_header("Ratio", "Stat")
print(table.header)

('Locus', 'Region', 'Ratio')
('Locus', 'Region', 'Stat')

Adding a new column#

from cogent3 import make_table

table = make_table()
table

0 rows x 0 columns

table.columns["a"] = [1, 3, 5]
table.columns["b"] = [2, 4, 6]
table

a	b
1	2
3	4
5	6

3 rows x 2 columns

Create a new column from existing ones#

This can be used to take a single, or multiple columns and generate a new column of values. Here we’ll take 2 columns and return True/False based on a condition.

table = load_table("data/stats.tsv")
table = table.with_new_column(
    "LargeCon",
    lambda r_v: r_v[0] == "Con" and r_v[1] > 10.0,
    columns=["Region", "Ratio"],
)
table

Locus	Region	Ratio	LargeCon
NP_003077	Con	2.5386	False
NP_004893	Con	121351.4264	True
NP_005079	Con	9516594.9789	True
NP_005500	NonCon	0.0000	False
NP_055852	NonCon	10933217.7090	False

5 rows x 4 columns

Get table data as a numpy array#

table = load_table("data/stats.tsv")
table.array

array([['NP_003077', 'Con', 2.5386013224378985],
       ['NP_004893', 'Con', 121351.42635634111],
       ['NP_005079', 'Con', 9516594.978886133],
       ['NP_005500', 'NonCon', 7.382703020266491e-08],
       ['NP_055852', 'NonCon', 10933217.708952725]], dtype=object)

Get a table column as a list#

Via the Table.to_list() method.

table = load_table("data/stats.tsv")
locus = table.to_list("Locus")
locus

['NP_003077', 'NP_004893', 'NP_005079', 'NP_005500', 'NP_055852']

Or directly from the column array object.

table.columns["Locus"].tolist()

['NP_003077', 'NP_004893', 'NP_005079', 'NP_005500', 'NP_055852']

Note

table.columns["Locus"] is a numpy.ndarray, hence the different method call.

Get multiple table columns as a list#

This returns a row oriented list.

table = load_table("data/stats.tsv")
rows = table.to_list(["Region", "Locus"])
rows

[['Con', 'NP_003077'],
 ['Con', 'NP_004893'],
 ['Con', 'NP_005079'],
 ['NonCon', 'NP_005500'],
 ['NonCon', 'NP_055852']]

Note

column name order dictates the element order per row

Get the table as a row oriented `dict`#

Keys in the resulting dict are the row indices, the value is a dict of column name, value pairs.

table = load_table("data/stats.tsv")
table.to_dict()

{0: {'Locus': 'NP_003077', 'Region': 'Con', 'Ratio': 2.5386013224378985},
{'Locus': 'NP_004893', 'Region': 'Con', 'Ratio': 121351.42635634111},
{'Locus': 'NP_005079', 'Region': 'Con', 'Ratio': 9516594.978886133},
{'Locus': 'NP_005500', 'Region': 'NonCon', 'Ratio': 7.382703020266491e-08},
{'Locus': 'NP_055852', 'Region': 'NonCon', 'Ratio': 10933217.708952725}}

Get the table as a column oriented `dict`#

Keys in the resulting dict are the column names, the value is a list.

table = load_table("data/stats.tsv")
table.columns.to_dict()

{'Locus': ['NP_003077', 'NP_004893', 'NP_005079', 'NP_005500', 'NP_055852'],
 'Region': ['Con', 'Con', 'Con', 'NonCon', 'NonCon'],
 'Ratio': [2.5386013224378985,
  121351.42635634111,
  9516594.978886133,
  7.382703020266491e-08,
  10933217.708952725]}

Get the table as a `pandas.DataFrame`#

table = load_table("data/stats.tsv")
df = table.to_pandas()
df

	Locus	Region	Ratio
0	NP_003077	Con	2.538601e+00
1	NP_004893	Con	1.213514e+05
2	NP_005079	Con	9.516595e+06
3	NP_005500	NonCon	7.382703e-08
4	NP_055852	NonCon	1.093322e+07

You can also specify column(s) are categories

df = table.to_pandas(categories="Region")

Get a table of counts as a contingency table#

If our table consists of counts data, the Table can convert it into a CategoryCount instance that can be used for performing basic contingency table statistical tests, e.g. chisquare, G-test of independence, etc.. To do this, we must specify which column contains the row names using the index_name argument.

table = make_table(data={"Ts": [31, 58], "Tv": [36, 138], "": ["syn", "nsyn"]}, index_name="")
table

	Ts	Tv
syn	31	36
nsyn	58	138

2 rows x 3 columns

contingency = table.to_categorical(["Ts", "Tv"])
contingency

Observed
	Ts	Tv
syn	31	36
nsyn	58	138

Expected
	Ts	Tv
syn	22.6730	44.3270
nsyn	66.3270	129.6730

Residuals
	Ts	Tv
syn	1.7488	-1.2507
nsyn	-1.0225	0.7312

g_test = contingency.G_independence()
g_test

G-test for independence (with Williams correction)
G	df	pvalue
5.973	1	0.0145

Observed
	Ts	Tv
nsyn	58	138
syn	31	36

Expected
	Ts	Tv
nsyn	66.3270	129.6730
syn	22.6730	44.3270

Residuals
	Ts	Tv
nsyn	-1.0225	0.7312
syn	1.7488	-1.2507

Alternatively, you could also specify the index_name of the category column as

table = make_table(data={"Ts": [31, 58], "Tv": [36, 138], "": ["syn", "nsyn"]})
contingency = table.to_categorical(["Ts", "Tv"], index_name="")

Appending tables#

Warning

Only for tables with the same columns.

Can be done without specifying a new column (set the first argument to appended to be None). Here we simply use the same table data.

table1 = load_table("data/stats.tsv")
table2 = load_table("data/stats.tsv")
table = table1.appended(None, table2)
table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_004893	Con	121351.4264
NP_005079	Con	9516594.9789
NP_005500	NonCon	0.0000
NP_055852	NonCon	10933217.7090
NP_003077	Con	2.5386
NP_004893	Con	121351.4264
NP_005079	Con	9516594.9789
NP_005500	NonCon	0.0000
NP_055852	NonCon	10933217.7090

10 rows x 3 columns

Specifying with a new column. In this case, the value of the table.title becomes the value for the new column.

table1.title = "Data1"
table2.title = "Data2"
table = table1.appended("Data#", table2, title="")
table

Data#	Locus	Region	Ratio
Data1	NP_003077	Con	2.5386
Data1	NP_004893	Con	121351.4264
Data1	NP_005079	Con	9516594.9789
Data1	NP_005500	NonCon	0.0000
Data1	NP_055852	NonCon	10933217.7090
Data2	NP_003077	Con	2.5386
Data2	NP_004893	Con	121351.4264
Data2	NP_005079	Con	9516594.9789
Data2	NP_005500	NonCon	0.0000
Data2	NP_055852	NonCon	10933217.7090

10 rows x 4 columns

Note

We assigned an empty string to title, otherwise the resulting table has the same title attribute as that of table1.

Summing a single column#

table = load_table("data/stats.tsv")
table.summed("Ratio")

np.float64(20571166.652796596)

Because each column is just a numpy.ndarray, this also can be done directly via the array methods.

table.columns["Ratio"].sum()

np.float64(20571166.652796596)

Summing multiple columns or rows - strictly numerical data#

We define a strictly numerical table,

from cogent3 import make_table

all_numeric = make_table(
    header=["A", "B", "C"], data=[range(3), range(3, 6), range(6, 9), range(9, 12)]
)
all_numeric

A	B	C
0	1	2
3	4	5
6	7	8
9	10	11

4 rows x 3 columns

and sum all columns (default condition)

all_numeric.summed()

[np.int64(18), np.int64(22), np.int64(26)]

and all rows

all_numeric.summed(col_sum=False)

[3, 12, 21, 30]

Summing multiple columns or rows with mixed non-numeric/numeric data#

We define a table with mixed data, like a distance matrix.

mixed = make_table(
    header=["A", "B", "C"], data=[["*", 1, 2], [3, "*", 5], [6, 7, "*"]]
)
mixed

A	B	C
*	1	2
3	*	5
6	7	*

3 rows x 3 columns

and sum all columns (default condition), ignoring non-numerical data

mixed.summed(strict=False)

[9, 8, 7]

and all rows

mixed.summed(col_sum=False, strict=False)

[3, 8, 13]

Filtering table rows#

We can do this by providing a reference to an external function

table = load_table("data/stats.tsv")
sub_table = table.filtered(lambda x: x < 10.0, columns="Ratio")
sub_table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_005500	NonCon	0.0000

2 rows x 3 columns

or using valid python syntax within a string, which is executed

sub_table = table.filtered("Ratio < 10.0")
sub_table

Locus	Region	Ratio
NP_003077	Con	2.5386
NP_005500	NonCon	0.0000

2 rows x 3 columns

You can also filter for values in multiple columns

sub_table = table.filtered("Ratio < 10.0 and Region == 'NonCon'")
sub_table

Locus	Region	Ratio
NP_005500	NonCon	0.0000

1 rows x 3 columns

Filtering table columns#

We select only columns that have a sum > 20 from the all_numeric table constructed above.

big_numeric = all_numeric.filtered_by_column(lambda x: sum(x) > 20)
big_numeric

B	C
1	2
4	5
7	8
10	11

4 rows x 2 columns

Standard sorting#

table = load_table("data/stats.tsv")
table.sorted(columns="Ratio")

Locus	Region	Ratio
NP_005500	NonCon	0.0000
NP_003077	Con	2.5386
NP_004893	Con	121351.4264
NP_005079	Con	9516594.9789
NP_055852	NonCon	10933217.7090

5 rows x 3 columns

Reverse sorting#

table.sorted(columns="Ratio", reverse="Ratio")

Locus	Region	Ratio
NP_055852	NonCon	10933217.7090
NP_005079	Con	9516594.9789
NP_004893	Con	121351.4264
NP_003077	Con	2.5386
NP_005500	NonCon	0.0000

5 rows x 3 columns

Sorting involving multiple columns, one reversed#

table.sorted(columns=["Region", "Ratio"], reverse="Ratio")

Locus	Region	Ratio
NP_005079	Con	9516594.9789
NP_004893	Con	121351.4264
NP_003077	Con	2.5386
NP_055852	NonCon	10933217.7090
NP_005500	NonCon	0.0000

5 rows x 3 columns

Getting raw data for a single column#

table = load_table("data/stats.tsv")
raw = table.to_list("Region")
raw

['Con', 'Con', 'Con', 'NonCon', 'NonCon']

Getting raw data for multiple columns#

table = load_table("data/stats.tsv")
raw = table.to_list(["Locus", "Region"])
raw

[['NP_003077', 'Con'],
 ['NP_004893', 'Con'],
 ['NP_005079', 'Con'],
 ['NP_005500', 'NonCon'],
 ['NP_055852', 'NonCon']]

Getting distinct values#

table = load_table("data/stats.tsv")
assert table.distinct_values("Region") == set(["NonCon", "Con"])

Counting occurrences of values#

table = load_table("data/stats.tsv")
assert table.count("Region == 'NonCon' and Ratio > 1") == 1

Counting unique values#

This returns a CategoryCounter, a dict like class.

from cogent3 import make_table

table = make_table(
    data=dict(A=["a", "b", "b", "b", "a"], B=["c", "c", "c", "c", "d"])
)
unique = table.count_unique("A")
type(unique)

cogent3.maths.stats.number.CategoryCounter

unique

CategoryCounter({'a': 2, 'b': 3})

For multiple columns.

unique = table.count_unique(["A", "B"])
unique

CategoryCounter({('a', 'c'): 1, ('b', 'c'): 3, ('a', 'd'): 1})

r = unique.to_table()
r

key	count
('a', 'c')	1
('b', 'c')	3
('a', 'd')	1

3 rows x 2 columns

Joining or merging tables#

We do a standard inner join here for a restricted subset. We must specify the columns that will be used for the join. Here we just use Locus.

rows = [
    ["NP_004893", True],
    ["NP_005079", True],
    ["NP_005500", False],
    ["NP_055852", False],
]
region_type = make_table(header=["Locus", "LargeCon"], data=rows)
stats_table = load_table("data/stats.tsv")
new = stats_table.joined(region_type, columns_self="Locus")
new

Locus	Region	Ratio	right_LargeCon
NP_004893	Con	121351.4264	True
NP_005079	Con	9516594.9789	True
NP_005500	NonCon	0.0000	False
NP_055852	NonCon	10933217.7090	False

4 rows x 4 columns

Note

If the tables have titles, column names are prefixed with those instead of right_.

Note

The joined() method is just a wrapper for the inner_join() and cross_join() (row cartesian product) methods, which you can use directly.

Transpose a table#

from cogent3 import make_table

header = ["#OTU ID", "14SK041", "14SK802"]
rows = [
    [-2920, "332", 294],
    [-1606, "302", 229],
    [-393, 141, 125],
    [-2109, 138, 120],
]
table = make_table(header=header, rows=rows)
table

#OTU ID	14SK041	14SK802
-2920	332	294
-1606	302	229
-393	141	125
-2109	138	120

4 rows x 3 columns

We require a new column heading for the current header data. We also need to specify which existing column will become the header.

tp = table.transposed(new_column_name="sample", select_as_header="#OTU ID")
tp

sample	-2920	-1606	-393	-2109
14SK041	332	302	141	138
14SK802	294	229	125	120

2 rows x 5 columns

Specify markdown as the `str()` format#

Using the method provides finer control over formatting.

from cogent3 import load_table

table = load_table("data/stats.tsv", format_name="md")
print(table)

|     Locus | Region |         Ratio |
|-----------|--------|---------------|
| NP_003077 |    Con |        2.5386 |
| NP_004893 |    Con |   121351.4264 |
| NP_005079 |    Con |  9516594.9789 |
| NP_005500 | NonCon |        0.0000 |
| NP_055852 | NonCon | 10933217.7090 |

Specify latex as the `str()` format#

Using the method provides finer control over formatting.

from cogent3 import load_table

table = load_table("data/stats.tsv", format_name="tex")
print(table)

\begin{table}[htp!]
\centering
\begin{tabular}{ r r r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 &    Con &        2.5386 \\
NP_004893 &    Con &   121351.4264 \\
NP_005079 &    Con &  9516594.9789 \\
NP_005500 & NonCon &        0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\end{table}

Get a table as a markdown formatted string#

We use the justify argument to indicate the column justification.

table = load_table("data/stats.tsv")
print(table.to_markdown(justify="ccr"))

|     Locus | Region |         Ratio |
|:---------:|:------:|--------------:|
| NP_003077 |    Con |        2.5386 |
| NP_004893 |    Con |   121351.4264 |
| NP_005079 |    Con |  9516594.9789 |
| NP_005500 | NonCon |        0.0000 |
| NP_055852 | NonCon | 10933217.7090 |

Get a table as a latex formatted string#

table = load_table(
    "data/stats.tsv", title="Some stats.", legend="Derived from something."
)
print(table.to_latex(justify="ccr", label="tab:table1"))

\begin{table}[htp!]
\centering
\begin{tabular}{ c c r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 &    Con &        2.5386 \\
NP_004893 &    Con &   121351.4264 \\
NP_005079 &    Con &  9516594.9789 \\
NP_005500 & NonCon &        0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\caption{Some stats.}
\label{tab:table1}
\end{table}

Get a table as a restructured text csv-table#

table = load_table(
    "data/stats.tsv", title="Some stats.", legend="Derived from something."
)
print(table.to_rst(csv_table=True))

.. csv-table:: Some stats.
    :header: "Locus", "Region", "Ratio"

    NP_003077, Con, 2.5386
    NP_004893, Con, 121351.4264
    NP_005079, Con, 9516594.9789
    NP_005500, NonCon, 0.0000
    NP_055852, NonCon, 10933217.7090

Get a table as a restructured text grid table#

table = load_table(
    "data/stats.tsv", title="Some stats.", legend="Derived from something."
)
print(table.to_rst())

+------------------------------------+
|            Some stats.             |
+-----------+--------+---------------+
|     Locus | Region |         Ratio |
+===========+========+===============+
| NP_003077 |    Con |        2.5386 |
+-----------+--------+---------------+
| NP_004893 |    Con |   121351.4264 |
+-----------+--------+---------------+
| NP_005079 |    Con |  9516594.9789 |
+-----------+--------+---------------+
| NP_005500 | NonCon |        0.0000 |
+-----------+--------+---------------+
| NP_055852 | NonCon | 10933217.7090 |
+-----------+--------+---------------+

Getting a latex format table with `to_string()`#

It is also possible to specify column alignment, table caption and other arguments.

table = load_table("data/stats.tsv")
print(table.to_string(format_name="latex"))

\begin{table}[htp!]
\centering
\begin{tabular}{ r r r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 &    Con &        2.5386 \\
NP_004893 &    Con &   121351.4264 \\
NP_005079 &    Con &  9516594.9789 \\
NP_005500 & NonCon &        0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\end{table}

Getting a bedGraph format with `to_string()`#

This format allows display of annotation tracks on genome browsers. A small sample of a bigger table.

bgraph.head()

chrom	start	end	value
1	100	101	1.1230
1	101	102	1.1230
1	102	103	1.1230
1	103	104	1.1230
1	104	105	1.1230

Top 5 rows from 32 rows x 4 columns

Then converted.

print(
    bgraph.to_string(
        format_name="bedgraph",
        name="test track",
        description="test of bedgraph",
        color=(255, 0, 0),
        digits=0,
    )
)

track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0
1	100	118	1.00
1	118	161	2.00

Getting a table as html#

from cogent3 import load_table

table = load_table("data/stats.tsv")
straight_html = table.to_html()

What formats can be written?#

Appending any of the following to a filename will cause that format to be used for writing.

from cogent3.format.table import known_formats

known_formats

('bedgraph',
 'phylip',
 'rest',
 'rst',
 'markdown',
 'md',
 'latex',
 'tex',
 'html',
 'simple',
 'csv',
 'tsv')

Writing a latex formmated file#

table.write("stats_tab.tex", justify="ccr", label="tab:table1")

Writing delimited formats#

The delimiter can be specified explicitly using the sep argument or implicitly via the file name suffix.

table = load_table("data/stats.tsv")
table.write("stats_tab.txt", sep="\t")

Tabular data#

Loading a csv file#

Loading from a url#

Loading delimited specifying the format#

Loading delimited data without a header line#

Using load_delimited()#

Using FilteringParser#

Selectively loading parts of a big file#

Loading a set number of lines from a file#

Loading only some rows#

Loading only some columns#

Load raw data as a list of lists of strings#

Make a table from header and rows#

Make a table from a dict#

Specify the column order when creating from a dict.#

Create the table with an index#

Create a table from a pandas.DataFrame#

Create a table from header and rows#

Create a table from dict#

Create a table from a 2D dict#

Create a table that has complex python objects as elements#

Create an empty table#

Adding a new column#

Add a title and a legend to a table#

Iterating over table rows#

How many rows are there?#

How many columns are there?#

Iterating over table columns#

Table slicing using column names#

Table slicing using indices#

Changing displayed numerical precision#

Change digits or column spacing#

Wrapping tables for display#

Display the top of a table using head()#

Display the bottom of a table using tail()#

Display random rows from a table#

Change the number of rows displayed by repr()#

Changing column headings#

Adding a new column#

Create a new column from existing ones#

Get table data as a numpy array#

Get a table column as a list#

Get multiple table columns as a list#

Get the table as a row oriented dict#

Get the table as a column oriented dict#

Get the table as a pandas.DataFrame#

Get a table of counts as a contingency table#

Appending tables#

Summing a single column#

Summing multiple columns or rows - strictly numerical data#

Summing multiple columns or rows with mixed non-numeric/numeric data#

Filtering table rows#

Filtering table columns#

Standard sorting#

Reverse sorting#

Sorting involving multiple columns, one reversed#

Getting raw data for a single column#

Getting raw data for multiple columns#

Getting distinct values#

Counting occurrences of values#

Counting unique values#

Joining or merging tables#

Transpose a table#

Specify markdown as the str() format#

Specify latex as the str() format#

Get a table as a markdown formatted string#

Get a table as a latex formatted string#

Get a table as a restructured text csv-table#

Get a table as a restructured text grid table#

Getting a latex format table with to_string()#

Getting a bedGraph format with to_string()#

Getting a table as html#

What formats can be written?#

Writing a latex formmated file#

Writing delimited formats#

Using `load_delimited()`#

Using `FilteringParser`#

Make a table from a `dict`#

Specify the column order when creating from a `dict`.#

Create a table from a `pandas.DataFrame`#

Display the top of a table using `head()`#

Display the bottom of a table using `tail()`#

Change the number of rows displayed by `repr()`#

Get the table as a row oriented `dict`#

Get the table as a column oriented `dict`#

Get the table as a `pandas.DataFrame`#

Specify markdown as the `str()` format#

Specify latex as the `str()` format#

Getting a latex format table with `to_string()`#

Getting a bedGraph format with `to_string()`#