KNOW-HOW BIBLIO: Data analysis in Python: Part 1

In this article, basic data analysis in Python will be demonstrated. This is a series of articles and it is the first part.

Python [1] is a high-level general-purpose programming language. Now, it is being used in every field including LIS. No in-depth discussion regarding Python or its application will be explored. Some prerequisites are there for using Python for data analysis.

1. Fundamental knowledge of Python (e.g., data types, variables, library import, arithmetic operators, etc.)

2. Using Jupyter notebook

Below listed functions will be demonstrated.

1. Importing and loading the dataset

2. Describing dataset

3. Viewing particular column

4. Calculating, Min, Max, Median, Sum and Mean

5. Sorting values

6. Adding data frame to the existing data frame

I am using Google Colab, a web-based Jupiter notebook environment. A dataset (n=500) containing bibliographic data of articles published in the DESDOC Journal of Library and Information Technology was exported from Dimensions [2].

1. Importing and loading the dataset

Those who use Jupyter Notebook [3], need to change the path where the dataset will be imported.

from pandas import*

data=read_csv('/content/dimensions.csv')
data
Fig.1

This is the output and the DataFrame. First, we have to import the library pandas (1) [4] for analysis.
Second, we have to express the path for the target dataset for loading using read_csv. We can see the data table (2).

2. Describing dataset

Next, we will just check the basic description of the dataset using describe().

data.describe()

Fig.2

This will show the primary statistics.

3. Viewing particular column

data.Title.head(10)

Since there is a column named "Title", we will see the first 10 article titles using head(). The output looks like:

0 A cross sectional study of retraction notices ... 1 Agriculture Journals Covered by Directory of O... 2 Use of Open Educational Resources and Print Ed... 3 Best Practices of Social Media in Academic Lib... 4 Predatory Publishers using Spamming Strategies... 5 Electronic Information Resource Optimisation i... 6 Use of ResearchGate by the Research Scholars o... 7 India, Open Access, the Law of Karma and the G... 8 Reusing Data Technical and Ethical Challenges 9 Global Research Studies on Electronic Journals... Name: Title, dtype: object

4. Calculating Min, Max, Median, and Mean

Next, we will create a new dataframe for "Times cited" to analyze min, max, median, and mean. We express out new dataframe as tc.

tc=data['Times cited']

tc.head(10)

Output:

0 3
1 0
2 0
3 2
4 2
5 2
6 3
7 2
8 3
9 0
Name: Times cited, dtype: int64

(Note: First column is the index column)

tc.min()

Output:

tc.max()

Output:

117

tc.median()

Output:

tc.mean()

Output:

1.788

tc.sum()

Output:

894

5. Sorting values

Further, if we want to sort the "Times cited" columns in descending order, we can follow the below code.

data.sort_values('Times cited',ascending=False).head(5)

The output will be:

Fig.3

We will have the top five articles having the most citations over the years.

6. Adding data frame to the existing data frame

If we want to get the percentage of cases for "Times cited", we could use the following codes. We express newdf as a new dataframe. Further, one more calculated dataframe perc has been created.

newdf = data['Times cited']
perc = newdf / tc.sum()
perc.head(5)

The output will be:

0 0.003356 1 0.000000 2 0.000000 3 0.002237 4 0.002237 Name: Times cited, dtype: float64

Next, if we want to add this part to the existing dataframe (data), we could use the following code.

data['Percentage of citation'] = perc
data

Fig.4

As we can see the new column has been added after calculating and coding the above lines.

This is how we can code to analyze our data. In the next article, more functions will be demostrated.

This article is only for educational purposes.

References

[1] https://en.wikipedia.org/wiki/Python_(programming_language)
[2] https://www.dimensions.ai/
[3] https://jupyter.org/
[4] https://en.wikipedia.org/wiki/Pandas_(software)
Share it:

02 February 2022

Data analysis in Python: Part 1

1. Importing and loading the dataset

2. Describing dataset

3. Viewing particular column

4. Calculating Min, Max, Median, and Mean

5. Sorting values

6. Adding data frame to the existing data frame

References