Pandas Made Simple: Analyze Real-World Data Like a Pro

#pandas

Pandas Python Made Simple

Pandas for Data Analysis

Table of Contents

Why Pandas ?

If you work with data in Python, chances are you’ve heard of Pandas, the powerful library that makes data analysis easy. Whether you're a data scientist, analyst, or just getting started, Pandas is a must-know tool.

In present world, python is a go to programming language for Data Science, AI & ML. For those technologies, Data's are like oxygen. Without data, we will not be able to think of those technologies.

Data's are scattered in different formats and types. Handling those with a single tool is what pandas tries to achieve.

Easy and efficient data handling.
Works with different file formats like CSV, JSON, EXCEl, SQL DataBase etc.
Integrates easily with other tools like Matplotlib, Scikit, Numpy etc.
Beginner friendly.

Time line

time line

Pandas

Installation

ryeuvpipconda

$ rye init --virtual
$ rye add pandas

$ uv init .
$ uv pip install pandas

$ pip install pandas

$ conda install pandas

Launch in Binder | No configuration or downloads required

No need to worry about configuring your system with libraries and data to follow along with this article.

You can run it online using binder.

Part-1 :
Part-2 :

Dependencies

---
config:
  sankey:
    showValues: false
---

sankey-beta

%% source,target,value
Numpy, required, 30
python-dateutil, required, 30
pytz, required, 30
tzdata, required, 30



Visualization, matplotlib,  10
Visualization, Jinja2,  10
Visualization, tabulate,  10
optional, Visualization, 30

Performance, numexpr,  10
Performance, bottleneck,  10
Performance, numba,  10
optional, Performance, 30

Computation, Scipy,  10
Computation, xarray,  10
optional, Computation, 20

Excel Files, xlrd, 10
Excel Files, xlsxwriter, 10
Excel Files, openpyxl, 10
Excel Files, pyxlsb, 10
Excel Files, python-calamine, 10
optional, Excel Files , 50

BeautifulSoup4, HTML Files, 10
html5lib, HTML Files, 10
lxml, HTML Files, 10
HTML Files, optional, 30

lxml, XML, 10
XML, optional, 10

SQLAlchemy, PostgresSQL, 10
psycopg2, PostgresSQL, 10
adbc-driver-postgresql, PostgresSQL, 10
PostgresSQL, DataBase, 30

SQLAlchemy,MySQL,10
pymysql,MySQL,10
MySQL, DataBase, 20

SQLAlchemy, SQLite,10
adbc-driver-sqlite, SQLite, 10
SQLite, DataBase, 20

DataBase, optional, 70

required, pandas , 120
pandas, optional, 120

Required Dependencies

The list of packages gets installed along with Pandas for its operations.

kanban
required[Required Dependency]
  numpy[Numpy]@{ticket: Version, assigned: '1.224' }
  dateutil[python-dateutil]@{ticket: Version, assigned: '2.8.2' }
  pytz[pytz]@{ticket: Version, assigned: '2020.1' }
  tzdata[tzdatal]@{ticket: Version, assigned: '2022.7' }

Optional Dependencies

It has many optional dependencies, to improvise the performance, visualization, accessing particular API's or methods.

Performance

While working with large files, it is advised to install performance dependencies for pandas using

kanban
  [Performance]
    [numexpr]
    [bottleneck]
    [numba]

ryeuvpipconda

$ rye add pandas[performance]

$ uv pip install pandas[performance]

$ pip install pandas[performance]

$ conda install pandas[performance]

Visualization : Plotting & Formatting

For plotting graph using pandas api, optional dependency matplotlib can be installed along with pandas using pip-extra [plot] for visualization and for markdown and DataFrame styles using output-formatting.

kanban
  [Plot]
    [matplotlib]

  [Output-formatting]
    [Jinja2]
    [tabulate]

ryeuvpipconda

$ rye add pandas[plot, output-formatting]

$ uv pip install pandas[plot, output-formatting]

$ pip install pandas[plot, output-formatting]

$ conda install pandas[plot, output-formatting]

Computation

For N-dimensional data and statistical functions

kanban
  [Computation]
    [Scipy]
    [xarray]

ryeuvpipconda

$ rye add pandas[computation]

$ uv pip install pandas[computation]

$ pip install pandas[computation]

$ conda install pandas[computation]

Excel files

To work with excel files, it necessary to install optional dependencies along with pandas.

kanban
  [Excel Files]
    [xlrd]@{ticket: reading Excel}
    [xlsxwriter]@{ticket: Writing Excel}
    [openpyxl]@{ticket: read/write xlsx files}
    [pyxlsb]@{ticket: reading xlsb files}
    [python-calamine]@{ticket: Reading for xls/xlsx/xlsb/ods files}

  [HTML Files]
    [BeautifulSoup4]@{ticket: requires lxml or html5lib}
    [html5lib]
    [lxml]

  [XML]
    [lxml]

ryeuvpipconda

$ rye add pandas[excel]

$ uv pip install pandas[excel]

$ pip install pandas[excel]

$ conda install pandas[excel]

HTML & XML file

ryeuvpipconda

$ rye add pandas[html]

$ uv pip install pandas[html]

$ pip install pandas[html]

$ conda install pandas[html]

SQL databases

To access data from database files, it necessary to install optional dependencies along with pandas.

kanban
  [PostgresSQL]
    [SQLAlchemy]@{assigned: 'postgresql'}
    [psycopg2]@{assigned: 'postgresql'}
    [adbc-driver-postgresql]@{assigned: 'postgresql'}

  [MySQL]
    [SQLAlchemy]@{assigned: 'mysql'}
    [pymysql]@{assigned: 'mysql'}

  [SQLite]
    [SQLAlchemy]@{assigned: 'sql-other'}
    [adbc-driver-sqlite]@{assigned: 'sql-other'}

ryeuvpipconda

$ rye add pandas[postgresql, mysql, sql-other]

$ uv pip install pandas[postgresql, mysql, sql-other]

$ pip install pandas[postgresql, mysql, sql-other]

$ conda install pandas[postgresql, mysql, sql-other]

Info

All optional dependencies can be installed with

ryeuvpipconda

$ rye add pandas[all]

$ uv pip install pandas[all]

$ pip install pandas[all]

$ conda install pandas[all]

Pandas is part of major Python distributions like Anaconda, ActiveState Python, WinPython etc.
Install from GitHub repository. Pandas : Click here
For detailed information on Pandas installation, check out the official link

Data Structure

Series
DataFrame

Creating Pandas DataStructure

Pandas data structure comprises of rows and columns. Column represents the series data and Rows are represented by Index.

A single column data structure is known as Series.
Multi-column data structure is known as DataFrame.

Pandas DataFrame can be created using python objects, flat files, databases, API requests etc. In this article will go through python objects and CSV file with some examples. The complete list of supported sources can be found here.

Series

pd.Series(
    data=None,
    index=None,
    dtype: 'Dtype | None' = None,
    name=None,
    copy: 'bool | None' = None,
    fastpath: 'bool | lib.NoDefault' = <no_default>,
)

pd.Series creates one-dimensional ndarray with axis labels including time series.

Creating Series using python objects.
list.
tuple.
dictionary.

Using Python List

import pandas as pd

even = [2, 4, 6, 8, 10, 12]

series_data = pd.Series(even)
print(series_data)

OUTPUT: output

From the output we can draw out.

Index of the series data is same as the index of the list data.
dtype indicates the data type of the series.

We can name the series or column.

series_data.name = "even"
print(series_data.info)

OUTPUT: output

Using Python Dictionary

import pandas as pd

student = {'name':"Kishore", 
            'age':20, 
            'place':"Bengaluru", 
            'email':"[email protected]",}
student_data = pd.Series(student)
print(student_data)

OUTPUT:

Index of the series data is same as the key of the dictionary data.
dtype indicates the data type of the series.
The object supports both integer and label based indexing

DataFrame

pd.DataFrame(
    data=None,
    index: 'Axes | None' = None,
    columns: 'Axes | None' = None,
    dtype: 'Dtype | None' = None,
    copy: 'bool | None' = None,
)

DataFrame is a two-dimensional, size-mutable, tabular data. Similar to pandas Series, DataFrame can be created using python objects.

Using Python List

import pandas as pd

students = [
    ['Kishore', 'Bengaluru'], # use list or tuple
    ['Vikram',  'Hyderabad'],
    ['Vinay',  'Chennai'],
    ['Adithya',  'Mumbai'],
    ['Kumar',  'Kochi'],
]

students_df = pd.DataFrame(students)
print(students_df)

OUTPUT: output

0,1,2,3,4 indicates the index
0 and 1 column indicates the Series.
More than one Series data makes it a DataFrame.

Using Python Dictionary

import pandas as pd

students = {
    "name": ['Kishore', 'Vikram', 'Vinay', 'Adithya', 'Kumar'],
    "place": ['Bengaluru', 'Hyderabad', 'Chennai', 'Mumbai', 'Kochi']
}

students_df = pd.DataFrame(students)
print(students_df)

OUTPUT: output

0,1,2,3,4 indicates the index
name and place column indicates the Series.
More than one Series data makes it a DataFrame.
Data structure also contains labelled axes (rows and columns)

Reading CSV files

For this article, will use the Major Commercial Crops of India downloaded from the RBI Database

To follow along with this article download the crop data.

CSV Crop Data

Launch in Binder | No configuration or downloads required

No need to worry about configuring your system with libraries and data.

You can run it online using binder.

Part-1 :
Part-2 :

pd.read_csv() method is used to read Comma-Separated Values (CSV) file into DataFrame. Reading CSV files doesn't require optional dependencies to be installed (openpyxl required to work with xlsx files).

import pandas as pd

crop_info = pd.read_csv('./data/clean_data_crop.csv')

print(crop_info)

OUTPUT: output

Info

Working with other file formats like Excel, JSON, DataBase etc will be updated shortly in a new blog, link for the same will be listed over in this section.

Inspecting

Number of rows and columns

crop_info.shape

OUTPUT: output

Column names

crop_info.columns

OUTPUT: output

Data Types

crop_info.dtypes

OUTPUT:

Data overview can be seen with head() and tail() method.

crop_info.head()

OUTPUT: output

crop_info.tail()

OUTPUT: output

head() method outputs top 5 rows content
tail() method outputs last 5 row content

Information on DataFrames.

crop_info.info()

OUTPUT: output

Extracting Subsets

Selecting a particular Column.

Particular column can be selected using . notation or using index method [Series Name | Column Name].

crop_info.State
# or
# crop_info['State']

OUTPUT: output

Multiple Column Selection.

Multiple Column can be selected, by passing it as a list of column or series names

crop_info[['State','Year', 'Rice']]

OUTPUT: output

Selecting Row Data's.

crop_info[11:21]

OUTPUT: output

Indexing.

iloc[] by their position
loc[] by their name

Accessing rows using index number

crop_info.iloc[10:21]

OUTPUT: output

Accessing rows and columns using index number

pd.df[row_slicing , column_slicing ]
```
crop_info.iloc[30:35, 0:4]
```

OUTPUT: output

Accessing rows and columns using index names.

crop_info.iloc[30:35, 'State':'Wheat']

OUTPUT: output

Filtering

Data can be filtered based on true condition Using Boolean Mask. Boolean mask are created using a comparator operators, and using that mask, we can filter out the required data.

Boolean Mask using comparison operator.

crop_info['State'] == 'Karnataka'

OUTPUT: output

Using Boolean Mask filtering data that contains values "Karnataka".

crop_info[crop_info['State'] == 'Karnataka']

OUTPUT: output output output

Using Boolean Mask filtering data with rice value greater than 15000 and column data is limited to State, Year, Rice.

crop_info[crop_info['Rice']>15000][['State', 'Year', 'Rice']]

OUTPUT: output

Summary Statistics

value_counts()

crop_info.State.value_counts()

OUTPUT: output

unique()

crop_info.State.unique()

OUTPUT: output

nunique()

crop_info.State.nunique()

OUTPUT: output

describe()

crop_info.State.describe()

OUTPUT: output

crop_info.State.describe(include='all')

OUTPUT: output

Data Cleaning

Will be Updated Shortly

Plotting

Will be Updated Shortly