Pandas Made Simple: Analyze Real-World Data Like a Pro

Pandas for Data Analysis
Table of Contents
Why Pandas ?
If you work with data in Python, chances are you’ve heard of Pandas, the powerful library that makes data analysis easy. Whether you're a data scientist, analyst, or just getting started, Pandas is a must-know tool.
In present world, python is a go to programming language for Data Science, AI & ML. For those technologies, Data's are like oxygen. Without data, we will not be able to think of those technologies.
Data's are scattered in different formats and types. Handling those with a single tool is what pandas tries to achieve.
- Easy and efficient data handling.
- Works with different file formats like CSV, JSON, EXCEl, SQL DataBase etc.
- Integrates easily with other tools like Matplotlib, Scikit, Numpy etc.
- Beginner friendly.
Time line

Pandas
Installation
Launch in Binder | No configuration or downloads required
No need to worry about configuring your system with libraries and data to follow along with this article.
You can run it online using binder.
Dependencies
---
config:
sankey:
showValues: false
---
sankey-beta
%% source,target,value
Numpy, required, 30
python-dateutil, required, 30
pytz, required, 30
tzdata, required, 30
Visualization, matplotlib, 10
Visualization, Jinja2, 10
Visualization, tabulate, 10
optional, Visualization, 30
Performance, numexpr, 10
Performance, bottleneck, 10
Performance, numba, 10
optional, Performance, 30
Computation, Scipy, 10
Computation, xarray, 10
optional, Computation, 20
Excel Files, xlrd, 10
Excel Files, xlsxwriter, 10
Excel Files, openpyxl, 10
Excel Files, pyxlsb, 10
Excel Files, python-calamine, 10
optional, Excel Files , 50
BeautifulSoup4, HTML Files, 10
html5lib, HTML Files, 10
lxml, HTML Files, 10
HTML Files, optional, 30
lxml, XML, 10
XML, optional, 10
SQLAlchemy, PostgresSQL, 10
psycopg2, PostgresSQL, 10
adbc-driver-postgresql, PostgresSQL, 10
PostgresSQL, DataBase, 30
SQLAlchemy,MySQL,10
pymysql,MySQL,10
MySQL, DataBase, 20
SQLAlchemy, SQLite,10
adbc-driver-sqlite, SQLite, 10
SQLite, DataBase, 20
DataBase, optional, 70
required, pandas , 120
pandas, optional, 120
Required Dependencies
The list of packages gets installed along with Pandas for its operations.
kanban
required[Required Dependency]
numpy[Numpy]@{ticket: Version, assigned: '1.224' }
dateutil[python-dateutil]@{ticket: Version, assigned: '2.8.2' }
pytz[pytz]@{ticket: Version, assigned: '2020.1' }
tzdata[tzdatal]@{ticket: Version, assigned: '2022.7' }
Optional Dependencies
It has many optional dependencies, to improvise the performance, visualization, accessing particular API's or methods.
Performance
While working with large files, it is advised to install performance dependencies for pandas using
kanban
[Performance]
[numexpr]
[bottleneck]
[numba]
Visualization : Plotting & Formatting
For plotting graph using pandas api, optional dependency matplotlib can be installed along with pandas using pip-extra [plot] for visualization and for markdown and DataFrame styles using output-formatting.
kanban
[Plot]
[matplotlib]
[Output-formatting]
[Jinja2]
[tabulate]
Computation
For N-dimensional data and statistical functions
kanban
[Computation]
[Scipy]
[xarray]
Excel files
To work with excel files, it necessary to install optional dependencies along with pandas.
kanban
[Excel Files]
[xlrd]@{ticket: reading Excel}
[xlsxwriter]@{ticket: Writing Excel}
[openpyxl]@{ticket: read/write xlsx files}
[pyxlsb]@{ticket: reading xlsb files}
[python-calamine]@{ticket: Reading for xls/xlsx/xlsb/ods files}
[HTML Files]
[BeautifulSoup4]@{ticket: requires lxml or html5lib}
[html5lib]
[lxml]
[XML]
[lxml]
HTML & XML file
SQL databases
To access data from database files, it necessary to install optional dependencies along with pandas.
kanban
[PostgresSQL]
[SQLAlchemy]@{assigned: 'postgresql'}
[psycopg2]@{assigned: 'postgresql'}
[adbc-driver-postgresql]@{assigned: 'postgresql'}
[MySQL]
[SQLAlchemy]@{assigned: 'mysql'}
[pymysql]@{assigned: 'mysql'}
[SQLite]
[SQLAlchemy]@{assigned: 'sql-other'}
[adbc-driver-sqlite]@{assigned: 'sql-other'}
Info
All optional dependencies can be installed with
- Pandas is part of major Python distributions like Anaconda, ActiveState Python, WinPython etc.
- Install from GitHub repository. Pandas : Click here
- For detailed information on Pandas installation, check out the official link
Data Structure
- Series
- DataFrame
Creating Pandas DataStructure
Pandas data structure comprises of rows and columns. Column represents the series data and Rows are represented by Index.
- A single column data structure is known as Series.
- Multi-column data structure is known as DataFrame.

Pandas DataFrame can be created using python objects, flat files, databases, API requests etc. In this article will go through python objects and CSV file with some examples. The complete list of supported sources can be found here.
Series
pd.Series(
data=None,
index=None,
dtype: 'Dtype | None' = None,
name=None,
copy: 'bool | None' = None,
fastpath: 'bool | lib.NoDefault' = <no_default>,
)
pd.Series creates one-dimensional ndarray with axis labels including time series.
- Creating Series using python objects.
- list.
- tuple.
- dictionary.
Using Python List
OUTPUT: 
From the output we can draw out.
- Index of the series data is same as the index of the list data.
dtypeindicates the data type of the series.
We can name the series or column.
OUTPUT: 
Using Python Dictionary
import pandas as pd
student = {'name':"Kishore",
'age':20,
'place':"Bengaluru",
'email':"[email protected]",}
student_data = pd.Series(student)
print(student_data)
- Index of the series data is same as the key of the dictionary data.
dtypeindicates the data type of the series.- The object supports both integer and label based indexing
DataFrame
pd.DataFrame(
data=None,
index: 'Axes | None' = None,
columns: 'Axes | None' = None,
dtype: 'Dtype | None' = None,
copy: 'bool | None' = None,
)
DataFrame is a two-dimensional, size-mutable, tabular data. Similar to pandas Series, DataFrame can be created using python objects.
Using Python List
import pandas as pd
students = [
['Kishore', 'Bengaluru'], # use list or tuple
['Vikram', 'Hyderabad'],
['Vinay', 'Chennai'],
['Adithya', 'Mumbai'],
['Kumar', 'Kochi'],
]
students_df = pd.DataFrame(students)
print(students_df)
OUTPUT: 
0,1,2,3,4indicates theindex0and1column indicates the Series.- More than one Series data makes it a DataFrame.
Using Python Dictionary
import pandas as pd
students = {
"name": ['Kishore', 'Vikram', 'Vinay', 'Adithya', 'Kumar'],
"place": ['Bengaluru', 'Hyderabad', 'Chennai', 'Mumbai', 'Kochi']
}
students_df = pd.DataFrame(students)
print(students_df)
OUTPUT: 
0,1,2,3,4indicates theindexnameandplacecolumn indicates the Series.- More than one Series data makes it a DataFrame.
- Data structure also contains labelled axes (rows and columns)
Reading CSV files
For this article, will use the Major Commercial Crops of India downloaded from the RBI Database
To follow along with this article download the crop data.
Launch in Binder | No configuration or downloads required
No need to worry about configuring your system with libraries and data.
You can run it online using binder.
pd.read_csv() method is used to read Comma-Separated Values (CSV) file into DataFrame. Reading CSV files doesn't require optional dependencies to be installed (openpyxl required to work with xlsx files).
OUTPUT: 
Info
Working with other file formats like Excel, JSON, DataBase etc will be updated shortly in a new blog, link for the same will be listed over in this section.
Inspecting
Number of rows and columns
OUTPUT: 
Column names
OUTPUT: 
Data Types
Data overview can be seen with
head() and tail() method.
OUTPUT: 
OUTPUT: 
head()method outputs top 5 rows contenttail()method outputs last 5 row content
Information on DataFrames.
OUTPUT: 
Extracting Subsets
Selecting a particular Column.
- Particular column can be selected using
.notation or using index method[Series Name | Column Name].
OUTPUT: 
Multiple Column Selection.
- Multiple Column can be selected, by passing it as a list of column or series names
OUTPUT: 
Selecting Row Data's.
OUTPUT: 
Indexing.
iloc[]by their positionloc[]by their name
Accessing rows using index number
OUTPUT: 
Accessing rows and columns using index number
pd.df[row_slicing , column_slicing ]
OUTPUT: 
Accessing rows and columns using index names.
OUTPUT: 
Filtering
Data can be filtered based on true condition Using Boolean Mask. Boolean mask are created using a comparator operators, and using that mask, we can filter out the required data.
Boolean Mask using comparison operator.
OUTPUT: 
Using Boolean Mask filtering data that contains values "Karnataka".
OUTPUT:

Using Boolean Mask filtering data with rice value greater than 15000 and column data is limited to State, Year, Rice.
OUTPUT: 
Summary Statistics
value_counts()
OUTPUT: 
unique()
OUTPUT: 
nunique()
OUTPUT: 
describe()
OUTPUT: 
OUTPUT: 
Data Cleaning
Will be Updated Shortly
Plotting
Will be Updated Shortly