This article is the second part of Data analysis in Python: Part 1. I will be using the same dataset. A couple of changes has been made to demonstrate a few functionalities. The list of functions is given below.
1. Reloading the same data
2. Removing a column
3. Using logical operators
4. Indexing
5. Frequency count
6. Counting unique values
7. Finding any text that contains a specific word/phrase
8. Finding missing values/NaN
9. Making a simple line graph
In the previous article, a CSV file was used, this time, an excel having the same data has been used.
1. Reloading the same data
from pandas import*
data=read_excel('/content/dimen.xlsx')
data
| Fig.1 |
One extra column (Publication Date (Online)) was fitted in the data frame.
2. Removing a column
Drop() function is used to remove the extra column.
data=data.drop(['Publication Date (online)'],axis=1)
data.head(5)
After executing the above code, we will be able to remove any column which we do not want to analyze further.
3. Using logical operators
Next, we will be using Logical operators [1] to operate True/False values. It helps to identify a particular query.
For instance, if we want to investigate how many articles were published after 2020, then we can execute the
following code.
| Fig.2 |
As we can see the dataframe (Fig.2) shows a total of 66 articles that were published after 2020, i.e. the year 2021. In the same manner, we also could find more.
data[(data['Times cited']>10) & (data['Recent citations']>5)] ## AND operator
This code will return those articles that have "Times cited" more than 10 and "Recent citations" less than 5. It generates two different queries.
The above code will retrun those articles that have "Times cited" more than 10 and "Open Access" type "Closed".
By that, we could use multiple queries to find what we need.
4. Indexing and Counting
Indexing is one of the most important parts of a dataframe. It has many attributes.[2]
If we want to check how many articles of a certain query (see Fig.2), we can execute the following code.
After creating a new dataframe (newData), it will return 66 which is same as the Fig.2.The only difference is it will count a certain number of rows.
5. Frequency count
Frequency count is another vital functionality in a dataframe. We can find the frequency of an object in a column.
We will use value_counts() function to extract the frequency.
It will retrun something like that.
Closed 243 All OA; Gold 172 All OA; Hybrid 74 All OA; Green 8 All OA; Bronze 3 Name: Open Access, dtype: int64
If we want to find how many times an object occurred in a column (e.g. column name "Open Access:), we could use the above code. It shows "Closed" has been occurred 243 times in the "Open Access" column in the dataframe.
6. Counting Unique Values in dataframe
Next, we can examine the unique values of the dataframe. For that, we need to use the following code.
7. Finding any text that contains a specific word/phrase

Fig.3
As a result, it has returned all the titles that contain the string "ben." We have to define the case, we need to make it
False, otherwise, it will return only the text "ben."
8. Finding missing values/NaN
We may have missing values in a dataframe. So, we need to find out about them. It will help in analyzing adequately.
It will show the following result.
Only FCR column has a total of 67 NaN values or missing values. Further, we can fill it with zeros.
9. Making a simple line graph
First, we need to insert the %matplotlib inline , a magic function and backend of Matplotlib [3]- a visualization
Python library. Fig.4 shows a line graph.
| Fig.4 |
However, it is just an example, more visualization practices can be found in the content. Fig.4 displays the frequency
count of the unique values of the column "Open Access"
Voila! basic data analysis has been performed. Although much more can be done. They will be shown by the specific
article
Note: This article is only for educational purposes.
References
[1] https://www.sciencedirect.com/topics/engineering/logical-operator
[2] https://pandas.pydata.org/docs/reference/api/pandas.Index.html#
[3] https://matplotlib.org/
Share it: