Introduction to Pandas for Data Analysis

Introduction to Pandas for Data Analysis



Introduction to Pandas for Data Analysis

Introduction to Pandas for Data Analysis

Page 1: What is Pandas?

Pandas is a powerful and versatile Python library that is designed for data manipulation and analysis. It provides high-performance data structures and data analysis tools that are essential for anyone working with data in Python.

The core data structure in Pandas is the DataFrame, which is a two-dimensional table-like object that can store data of various types, including numbers, strings, and dates. DataFrames can be created from various sources, such as CSV files, Excel spreadsheets, and databases.

Pandas offers a wide range of functions for data manipulation, including:

  • Filtering and selecting data
  • Sorting and indexing data
  • Adding and removing rows and columns
  • Transforming and aggregating data

In addition to data manipulation, Pandas also provides functions for data analysis, such as:

  • Descriptive statistics
  • Data visualization
  • Time series analysis
  • Data cleaning and preparation

Page 2: Getting Started with Pandas

To start using Pandas, you first need to install it. You can do this using the following command in your terminal:

pip install pandas

Once you have Pandas installed, you can import it into your Python script using the following code:

import pandas as pd

The pd alias is a common convention used to refer to Pandas in Python code.

Creating a DataFrame

You can create a DataFrame from a list of dictionaries:

data = [{'Name': 'John', 'Age': 30, 'City': 'New York'}, {'Name': 'Jane', 'Age': 25, 'City': 'London'}, {'Name': 'Peter', 'Age': 35, 'City': 'Paris'}] df = pd.DataFrame(data) print(df)

Output:

Name Age City 0 John 30 New York 1 Jane 25 London 2 Peter 35 Paris

Accessing Data

You can access data in a DataFrame using column names and row indices:

print(df['Name']) # Accessing a column print(df.iloc[0]) # Accessing the first row print(df.loc[0, 'Name']) # Accessing a specific cell

Page 3: Data Manipulation with Pandas

Filtering Data

You can filter data in a DataFrame using boolean indexing:

filtered_df = df[df['Age'] > 30] print(filtered_df)

Sorting Data

You can sort data in a DataFrame by one or more columns:

sorted_df = df.sort_values(by='Age', ascending=False) print(sorted_df)

Aggregating Data

You can aggregate data in a DataFrame using built-in functions:

print(df.groupby('City')['Age'].mean())

This code calculates the average age for each city in the DataFrame.