02 March 2022

Data analysis in Python: Part 2

 


This article is the second part of Data analysis in Python: Part 1. I will be using the same dataset. A couple of changes has been made to demonstrate a few functionalities. The list of functions is given below.

1. Reloading the same data

2. Removing a column

3. Using logical operators

4. Indexing

5. Frequency count

6. Counting unique values

7. Finding any text that contains a specific word/phrase

8. Finding missing values/NaN

9. Making a simple line graph

In the previous article, a CSV file was used, this time, an excel having the same data has been used.

1. Reloading the same data

from pandas import*
data=read_excel('/content/dimen.xlsx')
data

Fig.1

One extra column  (Publication Date (Online)) was fitted in the data frame.

2. Removing a column

Drop() function is used to remove the extra column.

data=data.drop(['Publication Date (online)'],axis=1)
data.head(5)

After executing the above code, we will be able to remove any column which we do not want to analyze further.

3. Using logical operators

Next, we will be using Logical operators [1] to operate True/False values. It helps to identify a particular query.

For instance, if we want to investigate how many articles were published after 2020, then we can execute the

following code.

data[data['PubYear']>2020]

Fig.2

As we can see the dataframe (Fig.2) shows a total of 66 articles that were published after 2020, i.e. the year 2021. In the same manner, we also could find more.

data[(data['Times cited']>10) & (data['Recent citations']>5)] ## AND operator

This code will return those articles that have "Times cited" more than 10 and "Recent citations" less than 5. It generates two different queries.

data[(data['Times cited']>10) & (data['Open Access']=='Closed')]

The above code will retrun those articles that have "Times cited" more than 10 and "Open Access" type "Closed".

By that, we could use multiple queries to find what we need.

4. Indexing and Counting

Indexing is one of the most important parts of a dataframe. It has many attributes.[2]

If we want to check how many articles of a certain query (see Fig.2), we can execute the following code.

newdata=data[data['PubYear']>2020]
total=newdata.index
len(total)

After creating a new dataframe (newData), it will return 66 which is same as the Fig.2.The only difference is it will count a certain number of rows.

5. Frequency count

Frequency count is another vital functionality in a dataframe. We can find the frequency of an object in a column.

We will use value_counts() function to extract the frequency.

data['Open Access'].value_counts()

It will retrun something like that.

Closed 243 All OA; Gold 172 All OA; Hybrid 74 All OA; Green 8 All OA; Bronze 3 Name: Open Access, dtype: int64

If we want to find how many times an object occurred in a column (e.g. column name "Open Access:), we could use the above code.  It shows "Closed" has been occurred 243 times in the "Open Access" column in the dataframe.

6. Counting Unique Values in dataframe

Next, we can examine the unique values of the dataframe. For that, we need to use the following code.

data.nunique()

Title 480 PubYear 18 Open Access 5 Authors 459 Times cited 15 Recent citations 12 FCR 86 dtype: int64


The unique() function shows the unique n numbers of unique values of a dataframe. For example, We can 
see there are 18 unique years in the "PubYear" column.

7. Finding any text that contains a specific word/phrase

We could find any text that contains a specific term or phrase. For example, if we want to find the titles that contain
the word having "ben". In other words, if we want to find Bengal/Bengali, then we could use str.contains functions.
It will match all the text that contains the specified string.

data[data['Title'].str.contains('ben', case=False)]

Fig.3

As a result, it has returned all the titles that contain the string "ben." We have to define the case, we need to make it

False, otherwise, it will return only the text "ben."


8. Finding missing values/NaN

We may have missing values in a dataframe. So, we need to find out about them. It will help in analyzing adequately.

data.isna().sum()

It will show the following result.

Title 0 PubYear 0 Open Access 0 Authors 0 Times cited 0 Recent citations 0 FCR 67 dtype: int64

Only FCR column has a total of 67 NaN values or missing values. Further, we can fill it with zeros.

9. Making a simple line graph

%matplotlib inline
data['Open Access'].value_counts().plot(grid=True)

First, we need to insert the %matplotlib inline , a magic function and backend of Matplotlib [3]- a visualization

Python library. Fig.4 shows a line graph.

Fig.4

However, it is just an example, more visualization practices can be found in the content. Fig.4 displays  the frequency

count of the unique values of the column "Open Access"

Voila! basic data analysis has been performed. Although much more can be done. They will be shown by the specific

article

Note: This article is only for educational purposes.

References

[1] https://www.sciencedirect.com/topics/engineering/logical-operator

[2] https://pandas.pydata.org/docs/reference/api/pandas.Index.html#

[3] https://matplotlib.org/

Share it: