Tabular data#
Table
handles tabular data, storing as columns in a, you guessed it, columns
attribute. The latter acts like a dictionary, with the column names as the keys and the column values being numpy.ndarray
instances. The table itself is iterable over rows.
Note
Table
is immutable at the level of the individual ndarray
not being writable.
Loading a csv file#
We load a tab separated data file using the load_table()
function. The format is inferred from the filename suffix and you will note, in this case, it’s not actually a csv file.
from cogent3 import load_table
table = load_table("data/stats.tsv")
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
NP_005079 | Con | 9516594.9789 |
NP_005500 | NonCon | 0.0000 |
NP_055852 | NonCon | 10933217.7090 |
5 rows x 3 columns
Note
The known filename suffixes for reading are .csv
, .tsv
and .pkl
or .pickle
(Python’s pickle format).
Note
If you invoke the static column types argument, i.e.``load_table(…, static_column_types=True)`` and the column data are not static, those columns will be left as a string type.
Loading from a url#
The cogent3
load functions support loading from a url. We load the above .tsv
file directly from GitHub.
from cogent3 import load_table
table = load_table("https://raw.githubusercontent.com/cogent3/cogent3/develop/doc/data/stats.tsv")
Loading delimited specifying the format#
Although unnecessary in this case, it’s possible to override the suffix by specifying the delimiter using the sep
argument.
from cogent3 import load_table
table = load_table("data/stats.tsv", sep="\t")
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
NP_005079 | Con | 9516594.9789 |
NP_005500 | NonCon | 0.0000 |
NP_055852 | NonCon | 10933217.7090 |
5 rows x 3 columns
Loading delimited data without a header line#
To create a table from the follow examples, you specify your header and use make_table()
.
Using load_delimited()
#
This is just a standard parsing function which does not do any filtering or converting elements to non-string types.
from cogent3.parse.table import load_delimited
header, rows, title, legend = load_delimited("data/CerebellumDukeDNaseSeq.pk", header=False, sep="\t")
rows[:4]
[['chr1',
'29214',
'29566',
'chr1.1',
'626',
'.',
'0.0724',
'3.9',
'-1',
'159'],
['chr1',
'89933',
'90118',
'chr1.2',
'511',
'.',
'0.0313',
'1.59',
'-1',
'94'],
['chr1',
'545979',
'546193',
'chr1.3',
'543',
'.',
'0.0428',
'2.23',
'-1',
'100'],
['chr1',
'713797',
'714639',
'chr1.4',
'1000',
'.',
'0.3215',
'16.0',
'-1',
'380']]
Using FilteringParser
#
from cogent3.parse.table import FilteringParser
reader = FilteringParser(with_header=False, sep="\t")
rows = list(reader("data/CerebellumDukeDNaseSeq.pk"))
rows[:4]
[['chr1',
'29214',
'29566',
'chr1.1',
'626',
'.',
'0.0724',
'3.9',
'-1',
'159'],
['chr1',
'89933',
'90118',
'chr1.2',
'511',
'.',
'0.0313',
'1.59',
'-1',
'94'],
['chr1',
'545979',
'546193',
'chr1.3',
'543',
'.',
'0.0428',
'2.23',
'-1',
'100'],
['chr1',
'713797',
'714639',
'chr1.4',
'1000',
'.',
'0.3215',
'16.0',
'-1',
'380']]
Selectively loading parts of a big file#
Loading a set number of lines from a file#
The limit
argument specifies the number of lines to read.
from cogent3 import load_table
table = load_table("data/stats.tsv", limit=2)
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
2 rows x 3 columns
Loading only some rows#
If you only want a subset of the contents of a file, use the FilteringParser
. This allows skipping certain lines by using a callback function. We illustrate this with stats.tsv
, skipping any rows with "Ratio"
> 10.
from cogent3.parse.table import FilteringParser
reader = FilteringParser(
lambda line: float(line[2]) <= 10, with_header=True, sep="\t"
)
table = load_table("data/stats.tsv", reader=reader, digits=1)
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5 |
NP_005500 | NonCon | 0.0 |
2 rows x 3 columns
You can also negate
a condition, which is useful if the condition is complex. In this example, it means keep the rows for which Ratio > 10
.
reader = FilteringParser(
lambda line: float(line[2]) <= 10, with_header=True, sep="\t", negate=True
)
table = load_table("data/stats.tsv", reader=reader, digits=1)
table
Locus | Region | Ratio |
---|---|---|
NP_004893 | Con | 121351.4 |
NP_005079 | Con | 9516595.0 |
NP_055852 | NonCon | 10933217.7 |
3 rows x 3 columns
Loading only some columns#
Specify the columns by their names.
from cogent3.parse.table import FilteringParser
reader = FilteringParser(columns=["Locus", "Ratio"], with_header=True, sep="\t")
table = load_table("data/stats.tsv", reader=reader)
table
Locus | Ratio |
---|---|
NP_003077 | 2.5386 |
NP_004893 | 121351.4264 |
NP_005079 | 9516594.9789 |
NP_005500 | 0.0000 |
NP_055852 | 10933217.7090 |
5 rows x 2 columns
Or, by their index.
from cogent3.parse.table import FilteringParser
reader = FilteringParser(columns=[0, -1], with_header=True, sep="\t")
table = load_table("data/stats.tsv", reader=reader)
table
Locus | Ratio |
---|---|
NP_003077 | 2.5386 |
NP_004893 | 121351.4264 |
NP_005079 | 9516594.9789 |
NP_005500 | 0.0000 |
NP_055852 | 10933217.7090 |
5 rows x 2 columns
Note
The negate
argument does not affect the columns evaluated.
Load raw data as a list of lists of strings#
We just use FilteringParser
.
from cogent3.parse.table import FilteringParser
reader = FilteringParser(with_header=True, sep="\t")
data = list(reader("data/stats.tsv"))
We just display the first two lines.
data[:2]
[['Locus', 'Region', 'Ratio'], ['NP_003077', 'Con', '2.5386013224378985']]
Note
The individual elements are all str
.
Make a table from header and rows#
from cogent3 import make_table
header = ["A", "B", "C"]
rows = [range(3), range(3, 6), range(6, 9), range(9, 12)]
table = make_table(header=["A", "B", "C"], data=rows)
table
A | B | C |
---|---|---|
0 | 1 | 2 |
3 | 4 | 5 |
6 | 7 | 8 |
9 | 10 | 11 |
4 rows x 3 columns
Make a table from a dict
#
For a dict
with key’s as column headers.
from cogent3 import make_table
data = dict(A=[0, 3, 6], B=[1, 4, 7], C=[2, 5, 8])
table = make_table(data=data)
table
A | B | C |
---|---|---|
0 | 1 | 2 |
3 | 4 | 5 |
6 | 7 | 8 |
3 rows x 3 columns
Specify the column order when creating from a dict
.#
table = make_table(header=["C", "A", "B"], data=data)
table
C | A | B |
---|---|---|
2 | 0 | 1 |
5 | 3 | 4 |
8 | 6 | 7 |
3 rows x 3 columns
Create the table with an index#
A Table
can be indexed like a dict if you designate a column as the index (and that column has a unique value for every row).
table = load_table("data/stats.tsv", index_name="Locus")
table["NP_055852"]
Locus | Region | Ratio |
---|---|---|
NP_055852 | NonCon | 10933217.7090 |
1 rows x 3 columns
table["NP_055852", "Region"]
np.str_('NonCon')
Note
The index_name
argument also applies when using make_table()
.
Create a table from a pandas.DataFrame
#
from pandas import DataFrame
from cogent3 import make_table
data = dict(a=[0, 3], b=["a", "c"])
df = DataFrame(data=data)
table = make_table(data_frame=df)
table
a | b |
---|---|
0 | a |
3 | c |
2 rows x 2 columns
Create a table from header and rows#
from cogent3 import make_table
table = make_table(header=["a", "b"], data=[[0, "a"], [3, "c"]])
table
a | b |
---|---|
0 | a |
3 | c |
2 rows x 2 columns
Create a table from dict#
make_table()
is the utility function for creating Table
objects from standard python objects.
from cogent3 import make_table
data = dict(a=[0, 3], b=["a", "c"])
table = make_table(data=data)
table
a | b |
---|---|
0 | a |
3 | c |
2 rows x 2 columns
Create a table from a 2D dict#
from cogent3 import make_table
d2D = {
"edge.parent": {
"NineBande": "root",
"edge.1": "root",
"DogFaced": "root",
"Human": "edge.0",
},
"x": {
"NineBande": 1.0,
"edge.1": 1.0,
"DogFaced": 1.0,
"Human": 1.0,
},
"length": {
"NineBande": 4.0,
"edge.1": 4.0,
"DogFaced": 4.0,
"Human": 4.0,
},
}
table = make_table(
data=d2D,
)
table
edge.parent | x | length |
---|---|---|
root | 1.0000 | 4.0000 |
root | 1.0000 | 4.0000 |
root | 1.0000 | 4.0000 |
edge.0 | 1.0000 | 4.0000 |
4 rows x 3 columns
Create a table that has complex python objects as elements#
from cogent3 import make_table
table = make_table(
header=["abcd", "data"],
data=[[range(1, 6), "0"], ["x", 5.0], ["y", None]],
missing_data="*",
digits=1,
)
table
abcd | data |
---|---|
range(1, 6) | 0 |
x | 5.0 |
y | None |
3 rows x 2 columns
Create an empty table#
from cogent3 import make_table
table = make_table()
table
0 rows x 0 columns
Adding a new column#
from cogent3 import make_table
table = make_table()
table.columns["a"] = [1, 3, 5]
table.columns["b"] = [2, 4, 6]
table
a | b |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
3 rows x 2 columns
Add a title and a legend to a table#
This can be done when you create the table.
from cogent3 import make_table
data = dict(a=[0, 3], b=["a", "c"])
table = make_table(data=data, title="Sample title", legend="a legend")
table
a | b |
---|---|
0 | a |
3 | c |
2 rows x 2 columns
It can be done by directly assigning to the corresponding attributes.
data = dict(a=[0, 3], b=["a", "c"])
table = make_table(data=data)
table.title = "My title"
table
a | b |
---|---|
0 | a |
3 | c |
2 rows x 2 columns
Iterating over table rows#
Table
is a row oriented object. Iterating on the table returns each row as a new Table
instance.
from cogent3 import load_table
table = load_table("data/stats.tsv")
for row in table:
print(row)
break
=============================
Locus Region Ratio
-----------------------------
NP_003077 Con 2.5386
-----------------------------
The resulting rows can be indexed using their column names.
for row in table:
print(row["Locus"])
NP_003077
NP_004893
NP_005079
NP_005500
NP_055852
How many rows are there?#
The Table.shape
attribute is like that of a numpy
array
. The first element (Table.shape[0]
) is the number of rows.
from cogent3 import make_table
data = dict(a=[0, 3, 5], b=["a", "c", "d"])
table = make_table(data=data)
table.shape[0] == 3
True
How many columns are there?#
Table.shape[1]
is the number of columns. Using the table from above.
table.shape[1] == 2
True
Iterating over table columns#
The Table.columns
attribute is a Columns
instance, an object with dict
attributes.
from cogent3 import load_table
table = load_table("data/stats.tsv")
table.columns
Columns('Locus': <U9, 'Region': <U6, 'Ratio': float64)
table.columns["Region"]
array(['Con', 'Con', 'Con', 'NonCon', 'NonCon'], dtype='<U6')
So iteration is the same as for dicts.
for name in table.columns:
print(name)
Locus
Region
Ratio
Table slicing using column names#
table = load_table("data/stats.tsv")
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
NP_005079 | Con | 9516594.9789 |
NP_005500 | NonCon | 0.0000 |
NP_055852 | NonCon | 10933217.7090 |
5 rows x 3 columns
Slice using the column name.
table[:2, "Region":]
Region | Ratio |
---|---|
Con | 2.5386 |
Con | 121351.4264 |
2 rows x 2 columns
Table slicing using indices#
table = load_table("data/stats.tsv")
table[:2, :1]
Locus |
---|
NP_003077 |
NP_004893 |
2 rows x 1 columns
Changing displayed numerical precision#
We change the Ratio
column to using scientific notation.
from cogent3 import load_table
table = load_table("data/stats.tsv")
table.format_column("Ratio", "%.1e")
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5e+00 |
NP_004893 | Con | 1.2e+05 |
NP_005079 | Con | 9.5e+06 |
NP_005500 | NonCon | 7.4e-08 |
NP_055852 | NonCon | 1.1e+07 |
5 rows x 3 columns
Change digits or column spacing#
This can be done on table loading,
table = load_table("data/stats.tsv", digits=1, space=2)
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5 |
NP_004893 | Con | 121351.4 |
NP_005079 | Con | 9516595.0 |
NP_005500 | NonCon | 0.0 |
NP_055852 | NonCon | 10933217.7 |
5 rows x 3 columns
or, for spacing at least, by modifying the attributes
table.space = " "
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5 |
NP_004893 | Con | 121351.4 |
NP_005079 | Con | 9516595.0 |
NP_005500 | NonCon | 0.0 |
NP_055852 | NonCon | 10933217.7 |
5 rows x 3 columns
Wrapping tables for display#
Wrapping generates neat looking tables whether or not you index the table rows. We demonstrate here
from cogent3 import make_table
h = ["name", "A/C", "A/G", "A/T", "C/A"]
rows = [["tardigrade", 0.0425, 0.1424, 0.0226, 0.0391]]
wrap_table = make_table(header=h, data=rows, max_width=30)
wrap_table
name | A/C | A/G |
---|---|---|
tardigrade | 0.0425 | 0.1424 |
A/T | C/A |
---|---|
0.0226 | 0.0391 |
1 rows x 5 columns
wrap_table = make_table(header=h, data=rows, max_width=30, index_name="name")
wrap_table
name | A/C | A/G |
---|---|---|
tardigrade | 0.0425 | 0.1424 |
name | A/T | C/A |
---|---|---|
tardigrade | 0.0226 | 0.0391 |
1 rows x 5 columns
Display the top of a table using head()
#
table = make_table(data=dict(a=list(range(10)), b=list(range(10))))
table.head()
a | b |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
Top 5 rows from 10 rows x 2 columns
You change how many rows are displayed.
table.head(2)
a | b |
---|---|
0 | 0 |
1 | 1 |
Top 2 rows from 10 rows x 2 columns
The table shape is that of the original table.
Display the bottom of a table using tail()
#
table.tail()
a | b |
---|---|
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
Bottom 5 rows from 10 rows x 2 columns
You change how many rows are displayed.
table.tail(1)
a | b |
---|---|
9 | 9 |
Bottom 1 rows from 10 rows x 2 columns
Display random rows from a table#
table.set_repr_policy(random=3)
table
a | b |
---|---|
3 | 3 |
4 | 4 |
7 | 7 |
Random selection of 3 rows from 10 rows x 2 columns
Change the number of rows displayed by repr()
#
table.set_repr_policy(head=2, tail=3)
table
a | b |
---|---|
0 | 0 |
1 | 1 |
... | ... |
7 | 7 |
8 | 8 |
9 | 9 |
Top 2 and bottom 3 rows from 10 rows x 2 columns
Note
The ...
indicates the break between the top and bottom rows.
Changing column headings#
The table header
is immutable. Changing column headings is done as follows.
table = load_table("data/stats.tsv")
print(table.header)
table = table.with_new_header("Ratio", "Stat")
print(table.header)
('Locus', 'Region', 'Ratio')
('Locus', 'Region', 'Stat')
Adding a new column#
from cogent3 import make_table
table = make_table()
table
0 rows x 0 columns
table.columns["a"] = [1, 3, 5]
table.columns["b"] = [2, 4, 6]
table
a | b |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
3 rows x 2 columns
Create a new column from existing ones#
This can be used to take a single, or multiple columns and generate a new column of values. Here we’ll take 2 columns and return True/False based on a condition.
table = load_table("data/stats.tsv")
table = table.with_new_column(
"LargeCon",
lambda r_v: r_v[0] == "Con" and r_v[1] > 10.0,
columns=["Region", "Ratio"],
)
table
Locus | Region | Ratio | LargeCon |
---|---|---|---|
NP_003077 | Con | 2.5386 | False |
NP_004893 | Con | 121351.4264 | True |
NP_005079 | Con | 9516594.9789 | True |
NP_005500 | NonCon | 0.0000 | False |
NP_055852 | NonCon | 10933217.7090 | False |
5 rows x 4 columns
Get table data as a numpy array#
table = load_table("data/stats.tsv")
table.array
array([['NP_003077', 'Con', 2.5386013224378985],
['NP_004893', 'Con', 121351.42635634111],
['NP_005079', 'Con', 9516594.978886133],
['NP_005500', 'NonCon', 7.382703020266491e-08],
['NP_055852', 'NonCon', 10933217.708952725]], dtype=object)
Get a table column as a list#
Via the Table.to_list()
method.
table = load_table("data/stats.tsv")
locus = table.to_list("Locus")
locus
['NP_003077', 'NP_004893', 'NP_005079', 'NP_005500', 'NP_055852']
Or directly from the column array object.
table.columns["Locus"].tolist()
['NP_003077', 'NP_004893', 'NP_005079', 'NP_005500', 'NP_055852']
Note
table.columns["Locus"]
is a numpy.ndarray
, hence the different method call.
Get multiple table columns as a list#
This returns a row oriented list.
table = load_table("data/stats.tsv")
rows = table.to_list(["Region", "Locus"])
rows
[['Con', 'NP_003077'],
['Con', 'NP_004893'],
['Con', 'NP_005079'],
['NonCon', 'NP_005500'],
['NonCon', 'NP_055852']]
Note
column name order dictates the element order per row
Get the table as a row oriented dict
#
Keys in the resulting dict are the row indices, the value is a dict of column name, value pairs.
table = load_table("data/stats.tsv")
table.to_dict()
{0: {'Locus': 'NP_003077', 'Region': 'Con', 'Ratio': 2.5386013224378985},
1: {'Locus': 'NP_004893', 'Region': 'Con', 'Ratio': 121351.42635634111},
2: {'Locus': 'NP_005079', 'Region': 'Con', 'Ratio': 9516594.978886133},
3: {'Locus': 'NP_005500', 'Region': 'NonCon', 'Ratio': 7.382703020266491e-08},
4: {'Locus': 'NP_055852', 'Region': 'NonCon', 'Ratio': 10933217.708952725}}
Get the table as a column oriented dict
#
Keys in the resulting dict are the column names, the value is a list.
table = load_table("data/stats.tsv")
table.columns.to_dict()
{'Locus': ['NP_003077', 'NP_004893', 'NP_005079', 'NP_005500', 'NP_055852'],
'Region': ['Con', 'Con', 'Con', 'NonCon', 'NonCon'],
'Ratio': [2.5386013224378985,
121351.42635634111,
9516594.978886133,
7.382703020266491e-08,
10933217.708952725]}
Get the table as a pandas.DataFrame
#
table = load_table("data/stats.tsv")
df = table.to_pandas()
df
Locus | Region | Ratio | |
---|---|---|---|
0 | NP_003077 | Con | 2.538601e+00 |
1 | NP_004893 | Con | 1.213514e+05 |
2 | NP_005079 | Con | 9.516595e+06 |
3 | NP_005500 | NonCon | 7.382703e-08 |
4 | NP_055852 | NonCon | 1.093322e+07 |
You can also specify column(s) are categories
df = table.to_pandas(categories="Region")
Get a table of counts as a contingency table#
If our table consists of counts data, the Table
can convert it into a CategoryCount
instance that can be used for performing basic contingency table statistical tests, e.g. chisquare, G-test of independence, etc.. To do this, we must specify which column contains the row names using the index_name
argument.
table = make_table(data={"Ts": [31, 58], "Tv": [36, 138], "": ["syn", "nsyn"]}, index_name="")
table
Ts | Tv | |
---|---|---|
syn | 31 | 36 |
nsyn | 58 | 138 |
2 rows x 3 columns
contingency = table.to_categorical(["Ts", "Tv"])
contingency
Ts | Tv | |
---|---|---|
syn | 31 | 36 |
nsyn | 58 | 138 |
Ts | Tv | |
---|---|---|
syn | 22.6730 | 44.3270 |
nsyn | 66.3270 | 129.6730 |
Ts | Tv | |
---|---|---|
syn | 1.7488 | -1.2507 |
nsyn | -1.0225 | 0.7312 |
g_test = contingency.G_independence()
g_test
G | df | pvalue |
---|---|---|
5.973 | 1 | 0.0145 |
Ts | Tv | |
---|---|---|
nsyn | 58 | 138 |
syn | 31 | 36 |
Ts | Tv | |
---|---|---|
nsyn | 66.3270 | 129.6730 |
syn | 22.6730 | 44.3270 |
Ts | Tv | |
---|---|---|
nsyn | -1.0225 | 0.7312 |
syn | 1.7488 | -1.2507 |
Alternatively, you could also specify the index_name
of the category column as
table = make_table(data={"Ts": [31, 58], "Tv": [36, 138], "": ["syn", "nsyn"]})
contingency = table.to_categorical(["Ts", "Tv"], index_name="")
Appending tables#
Warning
Only for tables with the same columns.
Can be done without specifying a new column (set the first argument to appended
to be None
). Here we simply use the same table data.
table1 = load_table("data/stats.tsv")
table2 = load_table("data/stats.tsv")
table = table1.appended(None, table2)
table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
NP_005079 | Con | 9516594.9789 |
NP_005500 | NonCon | 0.0000 |
NP_055852 | NonCon | 10933217.7090 |
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
NP_005079 | Con | 9516594.9789 |
NP_005500 | NonCon | 0.0000 |
NP_055852 | NonCon | 10933217.7090 |
10 rows x 3 columns
Specifying with a new column. In this case, the value of the table.title
becomes the value for the new column.
table1.title = "Data1"
table2.title = "Data2"
table = table1.appended("Data#", table2, title="")
table
Data# | Locus | Region | Ratio |
---|---|---|---|
Data1 | NP_003077 | Con | 2.5386 |
Data1 | NP_004893 | Con | 121351.4264 |
Data1 | NP_005079 | Con | 9516594.9789 |
Data1 | NP_005500 | NonCon | 0.0000 |
Data1 | NP_055852 | NonCon | 10933217.7090 |
Data2 | NP_003077 | Con | 2.5386 |
Data2 | NP_004893 | Con | 121351.4264 |
Data2 | NP_005079 | Con | 9516594.9789 |
Data2 | NP_005500 | NonCon | 0.0000 |
Data2 | NP_055852 | NonCon | 10933217.7090 |
10 rows x 4 columns
Note
We assigned an empty string to title
, otherwise the resulting table has the same title
attribute as that of table1
.
Summing a single column#
table = load_table("data/stats.tsv")
table.summed("Ratio")
np.float64(20571166.652796596)
Because each column is just a numpy.ndarray
, this also can be done directly via the array methods.
table.columns["Ratio"].sum()
np.float64(20571166.652796596)
Summing multiple columns or rows - strictly numerical data#
We define a strictly numerical table,
from cogent3 import make_table
all_numeric = make_table(
header=["A", "B", "C"], data=[range(3), range(3, 6), range(6, 9), range(9, 12)]
)
all_numeric
A | B | C |
---|---|---|
0 | 1 | 2 |
3 | 4 | 5 |
6 | 7 | 8 |
9 | 10 | 11 |
4 rows x 3 columns
and sum all columns (default condition)
all_numeric.summed()
[np.int64(18), np.int64(22), np.int64(26)]
and all rows
all_numeric.summed(col_sum=False)
[3, 12, 21, 30]
Summing multiple columns or rows with mixed non-numeric/numeric data#
We define a table with mixed data, like a distance matrix.
mixed = make_table(
header=["A", "B", "C"], data=[["*", 1, 2], [3, "*", 5], [6, 7, "*"]]
)
mixed
A | B | C |
---|---|---|
* | 1 | 2 |
3 | * | 5 |
6 | 7 | * |
3 rows x 3 columns
and sum all columns (default condition), ignoring non-numerical data
mixed.summed(strict=False)
[9, 8, 7]
and all rows
mixed.summed(col_sum=False, strict=False)
[3, 8, 13]
Filtering table rows#
We can do this by providing a reference to an external function
table = load_table("data/stats.tsv")
sub_table = table.filtered(lambda x: x < 10.0, columns="Ratio")
sub_table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_005500 | NonCon | 0.0000 |
2 rows x 3 columns
or using valid python syntax within a string, which is executed
sub_table = table.filtered("Ratio < 10.0")
sub_table
Locus | Region | Ratio |
---|---|---|
NP_003077 | Con | 2.5386 |
NP_005500 | NonCon | 0.0000 |
2 rows x 3 columns
You can also filter for values in multiple columns
sub_table = table.filtered("Ratio < 10.0 and Region == 'NonCon'")
sub_table
Locus | Region | Ratio |
---|---|---|
NP_005500 | NonCon | 0.0000 |
1 rows x 3 columns
Filtering table columns#
We select only columns that have a sum > 20 from the all_numeric
table constructed above.
big_numeric = all_numeric.filtered_by_column(lambda x: sum(x) > 20)
big_numeric
B | C |
---|---|
1 | 2 |
4 | 5 |
7 | 8 |
10 | 11 |
4 rows x 2 columns
Standard sorting#
table = load_table("data/stats.tsv")
table.sorted(columns="Ratio")
Locus | Region | Ratio |
---|---|---|
NP_005500 | NonCon | 0.0000 |
NP_003077 | Con | 2.5386 |
NP_004893 | Con | 121351.4264 |
NP_005079 | Con | 9516594.9789 |
NP_055852 | NonCon | 10933217.7090 |
5 rows x 3 columns
Reverse sorting#
table.sorted(columns="Ratio", reverse="Ratio")
Locus | Region | Ratio |
---|---|---|
NP_055852 | NonCon | 10933217.7090 |
NP_005079 | Con | 9516594.9789 |
NP_004893 | Con | 121351.4264 |
NP_003077 | Con | 2.5386 |
NP_005500 | NonCon | 0.0000 |
5 rows x 3 columns
Sorting involving multiple columns, one reversed#
table.sorted(columns=["Region", "Ratio"], reverse="Ratio")
Locus | Region | Ratio |
---|---|---|
NP_005079 | Con | 9516594.9789 |
NP_004893 | Con | 121351.4264 |
NP_003077 | Con | 2.5386 |
NP_055852 | NonCon | 10933217.7090 |
NP_005500 | NonCon | 0.0000 |
5 rows x 3 columns
Getting raw data for a single column#
table = load_table("data/stats.tsv")
raw = table.to_list("Region")
raw
['Con', 'Con', 'Con', 'NonCon', 'NonCon']
Getting raw data for multiple columns#
table = load_table("data/stats.tsv")
raw = table.to_list(["Locus", "Region"])
raw
[['NP_003077', 'Con'],
['NP_004893', 'Con'],
['NP_005079', 'Con'],
['NP_005500', 'NonCon'],
['NP_055852', 'NonCon']]
Getting distinct values#
table = load_table("data/stats.tsv")
assert table.distinct_values("Region") == set(["NonCon", "Con"])
Counting occurrences of values#
table = load_table("data/stats.tsv")
assert table.count("Region == 'NonCon' and Ratio > 1") == 1
Counting unique values#
This returns a CategoryCounter
, a dict like class.
from cogent3 import make_table
table = make_table(
data=dict(A=["a", "b", "b", "b", "a"], B=["c", "c", "c", "c", "d"])
)
unique = table.count_unique("A")
type(unique)
cogent3.maths.stats.number.CategoryCounter
unique
CategoryCounter({'a': 2, 'b': 3})
For multiple columns.
unique = table.count_unique(["A", "B"])
unique
CategoryCounter({('a', 'c'): 1, ('b', 'c'): 3, ('a', 'd'): 1})
r = unique.to_table()
r
key | count |
---|---|
('a', 'c') | 1 |
('b', 'c') | 3 |
('a', 'd') | 1 |
3 rows x 2 columns
Joining or merging tables#
We do a standard inner join here for a restricted subset. We must specify the columns that will be used for the join. Here we just use Locus
.
rows = [
["NP_004893", True],
["NP_005079", True],
["NP_005500", False],
["NP_055852", False],
]
region_type = make_table(header=["Locus", "LargeCon"], data=rows)
stats_table = load_table("data/stats.tsv")
new = stats_table.joined(region_type, columns_self="Locus")
new
Locus | Region | Ratio | right_LargeCon |
---|---|---|---|
NP_004893 | Con | 121351.4264 | True |
NP_005079 | Con | 9516594.9789 | True |
NP_005500 | NonCon | 0.0000 | False |
NP_055852 | NonCon | 10933217.7090 | False |
4 rows x 4 columns
Note
If the tables have titles, column names are prefixed with those instead of right_
.
Note
The joined()
method is just a wrapper for the inner_join()
and cross_join()
(row cartesian product) methods, which you can use directly.
Transpose a table#
from cogent3 import make_table
header = ["#OTU ID", "14SK041", "14SK802"]
rows = [
[-2920, "332", 294],
[-1606, "302", 229],
[-393, 141, 125],
[-2109, 138, 120],
]
table = make_table(header=header, rows=rows)
table
#OTU ID | 14SK041 | 14SK802 |
---|---|---|
-2920 | 332 | 294 |
-1606 | 302 | 229 |
-393 | 141 | 125 |
-2109 | 138 | 120 |
4 rows x 3 columns
We require a new column heading for the current header data. We also need to specify which existing column will become the header.
tp = table.transposed(new_column_name="sample", select_as_header="#OTU ID")
tp
sample | -2920 | -1606 | -393 | -2109 |
---|---|---|---|---|
14SK041 | 332 | 302 | 141 | 138 |
14SK802 | 294 | 229 | 125 | 120 |
2 rows x 5 columns
Specify markdown as the str()
format#
Using the method provides finer control over formatting.
from cogent3 import load_table
table = load_table("data/stats.tsv", format="md")
print(table)
| Locus | Region | Ratio |
|-----------|--------|---------------|
| NP_003077 | Con | 2.5386 |
| NP_004893 | Con | 121351.4264 |
| NP_005079 | Con | 9516594.9789 |
| NP_005500 | NonCon | 0.0000 |
| NP_055852 | NonCon | 10933217.7090 |
Specify latex as the str()
format#
Using the method provides finer control over formatting.
from cogent3 import load_table
table = load_table("data/stats.tsv", format="tex")
print(table)
\begin{table}[htp!]
\centering
\begin{tabular}{ r r r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 & Con & 2.5386 \\
NP_004893 & Con & 121351.4264 \\
NP_005079 & Con & 9516594.9789 \\
NP_005500 & NonCon & 0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\end{table}
Get a table as a markdown formatted string#
We use the justify
argument to indicate the column justification.
table = load_table("data/stats.tsv")
print(table.to_markdown(justify="ccr"))
| Locus | Region | Ratio |
|:---------:|:------:|--------------:|
| NP_003077 | Con | 2.5386 |
| NP_004893 | Con | 121351.4264 |
| NP_005079 | Con | 9516594.9789 |
| NP_005500 | NonCon | 0.0000 |
| NP_055852 | NonCon | 10933217.7090 |
Get a table as a latex formatted string#
table = load_table(
"data/stats.tsv", title="Some stats.", legend="Derived from something."
)
print(table.to_latex(justify="ccr", label="tab:table1"))
\begin{table}[htp!]
\centering
\begin{tabular}{ c c r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 & Con & 2.5386 \\
NP_004893 & Con & 121351.4264 \\
NP_005079 & Con & 9516594.9789 \\
NP_005500 & NonCon & 0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\caption{Some stats.}
\label{tab:table1}
\end{table}
Get a table as a restructured text csv-table#
table = load_table(
"data/stats.tsv", title="Some stats.", legend="Derived from something."
)
print(table.to_rst(csv_table=True))
.. csv-table:: Some stats.
:header: "Locus", "Region", "Ratio"
NP_003077, Con, 2.5386
NP_004893, Con, 121351.4264
NP_005079, Con, 9516594.9789
NP_005500, NonCon, 0.0000
NP_055852, NonCon, 10933217.7090
Get a table as a restructured text grid table#
table = load_table(
"data/stats.tsv", title="Some stats.", legend="Derived from something."
)
print(table.to_rst())
+------------------------------------+
| Some stats. |
+-----------+--------+---------------+
| Locus | Region | Ratio |
+===========+========+===============+
| NP_003077 | Con | 2.5386 |
+-----------+--------+---------------+
| NP_004893 | Con | 121351.4264 |
+-----------+--------+---------------+
| NP_005079 | Con | 9516594.9789 |
+-----------+--------+---------------+
| NP_005500 | NonCon | 0.0000 |
+-----------+--------+---------------+
| NP_055852 | NonCon | 10933217.7090 |
+-----------+--------+---------------+
Getting a latex format table with to_string()
#
It is also possible to specify column alignment, table caption and other arguments.
table = load_table("data/stats.tsv")
print(table.to_string(format="latex"))
\begin{table}[htp!]
\centering
\begin{tabular}{ r r r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 & Con & 2.5386 \\
NP_004893 & Con & 121351.4264 \\
NP_005079 & Con & 9516594.9789 \\
NP_005500 & NonCon & 0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\end{table}
Getting a bedGraph format with to_string()
#
This format allows display of annotation tracks on genome browsers. A small sample of a bigger table.
bgraph.head()
chrom | start | end | value |
---|---|---|---|
1 | 100 | 101 | 1.1230 |
1 | 101 | 102 | 1.1230 |
1 | 102 | 103 | 1.1230 |
1 | 103 | 104 | 1.1230 |
1 | 104 | 105 | 1.1230 |
Top 5 rows from 32 rows x 4 columns
Then converted.
print(
bgraph.to_string(
format="bedgraph",
name="test track",
description="test of bedgraph",
color=(255, 0, 0),
digits=0,
)
)
track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0
1 100 118 1.00
1 118 161 2.00
Getting a table as html#
from cogent3 import load_table
table = load_table("data/stats.tsv")
straight_html = table.to_html()
What formats can be written?#
Appending any of the following to a filename will cause that format to be used for writing.
from cogent3.format.table import known_formats
known_formats
('bedgraph',
'phylip',
'rest',
'rst',
'markdown',
'md',
'latex',
'tex',
'html',
'simple',
'csv',
'tsv')
Writing a latex formmated file#
table.write("stats_tab.tex", justify="ccr", label="tab:table1")
Writing delimited formats#
The delimiter can be specified explicitly using the sep
argument or implicitly via the file name suffix.
table = load_table("data/stats.tsv")
table.write("stats_tab.txt", sep="\t")