Exploring tabular data#
When working with data in tables, the ability of quickly getting an overview about the data is key.
import pandas as pd
Loading CSV files from disk#
To ensure compatility beween different software for processing tabular data the CSV file format is commonly used. We can open those files using pandas.read_csv.
data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
388 | 17.0 | 742.706 | 69.624 | 620.0 | 884.0 | 420.500 | 496.382 | 420.513 | NaN | NaN | 3.663 | 49.457 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
391 rows × 14 columns
Viewing the data#
Viewing data can be tricky, especially when working with large tables.
data.head(10) # top 10 rows
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
6 | 669.0 | 697.164 | 72.863 | 539.0 | 957.0 | 471.696 | 26.253 | 471.694 | 26.197 | 36.656 | 23.237 | 124.340 | 100 | A |
7 | 5.0 | 658.600 | 49.161 | 607.0 | 710.0 | 28.300 | 8.100 | 28.284 | 8.103 | 3.144 | 2.025 | 161.565 | 100 | A |
8 | 7.0 | 677.571 | 49.899 | 596.0 | 768.0 | 415.357 | 8.786 | 415.360 | 8.804 | 4.110 | 2.168 | 112.500 | 100 | A |
9 | 14.0 | 691.071 | 63.873 | 586.0 | 808.0 | 493.286 | 9.000 | 493.295 | 9.016 | 5.120 | 3.481 | 38.802 | 100 | C |
10 | 39.0 | 763.615 | 88.786 | 623.0 | 1016.0 | 157.526 | 12.731 | 157.592 | 12.757 | 8.815 | 5.633 | 46.437 | 100 | C |
data.tail(10) # bottom 10 rows
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
382 | 45.0 | 734.356 | 68.637 | 575.0 | 867.0 | 171.500 | 494.789 | 171.492 | 494.739 | 14.630 | 3.916 | 95.698 | 100 | B |
383 | 94.0 | 746.617 | 85.198 | 550.0 | 1021.0 | 194.032 | 498.223 | 194.014 | 498.239 | 17.295 | 6.920 | 52.720 | 100 | B |
384 | 35.0 | 776.257 | 74.746 | 611.0 | 961.0 | 268.957 | 493.586 | 268.977 | NaN | NaN | 5.990 | 111.193 | 100 | A |
385 | 35.0 | 739.286 | NaN | 593.0 | 928.0 | 291.871 | 493.843 | 291.871 | 493.806 | NaN | 5.352 | 79.368 | 100 | A |
386 | 14.0 | 736.143 | 81.533 | 646.0 | 902.0 | 315.000 | 493.000 | 314.989 | 493.003 | NaN | 3.676 | 45.000 | 100 | A |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
388 | 17.0 | 742.706 | 69.624 | 620.0 | 884.0 | 420.500 | 496.382 | 420.513 | NaN | NaN | 3.663 | 49.457 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
Overview descriptive statistics#
To get a glimpse of the range of values which exist in the given table, we can ask the DateFrame to describe itself using DataFrame.describe()
. It will display count, mean, standard deviation and other descriptive statistics for each column in our table.
data.describe()
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 389.000000 | 386.000000 | 388.000000 | 388.000000 | 388.000000 | 389.000000 | 388.000000 | 388.000000 | 386.000000 | 383.000000 | 388.000000 | 390.000000 | 391.0 |
mean | 107.164524 | 743.455565 | 76.575309 | 610.414948 | 962.922680 | 256.419859 | 254.384088 | 256.183338 | 253.353005 | 12.481016 | 9.500662 | 86.598441 | 100.0 |
std | 241.037082 | 42.252140 | 31.844864 | 57.156709 | 244.897224 | 152.261694 | 155.080074 | 152.380388 | 154.426250 | 11.979176 | 49.714280 | 60.593686 | 0.0 |
min | 1.000000 | 587.000000 | 0.000000 | 516.000000 | 587.000000 | 3.978000 | 4.722000 | 4.012000 | 4.697000 | 1.128000 | 1.128000 | 0.000000 | 100.0 |
25% | 15.000000 | 717.060750 | 63.861000 | 570.750000 | 847.750000 | 127.142000 | 102.875250 | 126.923250 | 103.813750 | 5.098000 | 3.637250 | 34.517250 | 100.0 |
50% | 44.000000 | 741.077500 | 74.727000 | 599.000000 | 917.500000 | 243.300000 | 271.490000 | 242.288000 | 271.272000 | 9.374000 | 5.886000 | 89.703500 | 100.0 |
75% | 116.000000 | 767.260750 | 86.826500 | 633.250000 | 1014.500000 | 400.167000 | 395.058250 | 400.363500 | 393.800750 | 16.283000 | 9.017250 | 134.617250 | 100.0 |
max | 2755.000000 | 912.938000 | 377.767000 | 877.000000 | 3880.000000 | 508.214000 | 503.022000 | 508.169000 | 502.979000 | 144.475000 | 981.000000 | 568.000000 | 100.0 |
Sorting in tables#
In many cases, we are interested in table rows that contain the maximum value, e.g. in the area
column we can find the largest object:
data.sort_values(by = "Area", ascending=False)
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
190 | 2755.0 | 859.928 | 235.458 | 539.0 | 3880.0 | 108.710 | 302.158 | 110.999 | 300.247 | 144.475 | 24.280 | 39.318 | 100 | C |
81 | 2295.0 | 765.239 | 96.545 | 558.0 | 1431.0 | 375.003 | 134.888 | 374.982 | 135.359 | 65.769 | 44.429 | 127.247 | 100 | B |
209 | 1821.0 | 847.761 | 122.074 | 600.0 | 1510.0 | 287.795 | 321.115 | 288.074 | 321.824 | 55.879 | 41.492 | 112.124 | 100 | A |
252 | 1528.0 | 763.777 | 83.183 | 572.0 | 1172.0 | 191.969 | 385.944 | 192.487 | 385.697 | 63.150 | 30.808 | 34.424 | 100 | B |
265 | 1252.0 | 793.371 | 117.139 | 579.0 | 1668.0 | 262.071 | 394.497 | 262.268 | 394.326 | 60.154 | 26.500 | 50.147 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 1.0 | 587.000 | 0.000 | 587.0 | 587.0 | 399.500 | 117.500 | 399.500 | 117.500 | 1.128 | 1.128 | 0.000 | 100 | A |
310 | 1.0 | 866.000 | 0.000 | 866.0 | 866.0 | 343.500 | 408.500 | 343.500 | 408.500 | 1.128 | 1.128 | 0.000 | 100 | A |
219 | 1.0 | 763.000 | 0.000 | 763.0 | 763.0 | 411.500 | 296.500 | 411.500 | 296.500 | 1.128 | 1.128 | 0.000 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
391 rows × 14 columns