In this article, basic data analysis in Python will be demonstrated. This is a series of articles and it is the first part.
Python [1] is a high-level general-purpose programming language. Now, it is being used in every field including LIS. No in-depth discussion regarding Python or its application will be explored. Some prerequisites are there for using Python for data analysis.
1. Fundamental knowledge of Python (e.g., data types, variables, library import, arithmetic operators, etc.)
2. Using Jupyter notebook
Below listed functions will be demonstrated.
1. Importing and loading the dataset
2. Describing dataset
3. Viewing particular column
4. Calculating, Min, Max, Median, Sum and Mean
5. Sorting values
6. Adding data frame to the existing data frame
I am using Google Colab, a web-based Jupiter notebook environment. A dataset (n=500) containing bibliographic data of articles published in the DESDOC Journal of Library and Information Technology was exported from Dimensions [2].
1. Importing and loading the dataset
Those who use Jupyter Notebook [3], need to change the path where the dataset will be imported.
2. Describing dataset
Next, we will just check the basic description of the dataset using describe().
| Fig.2 |
3. Viewing particular column
data.Title.head(10)
Since there is a column named "Title", we will see the first 10 article titles using head(). The output looks like:
0 A cross sectional study of retraction notices ... 1 Agriculture Journals Covered by Directory of O... 2 Use of Open Educational Resources and Print Ed... 3 Best Practices of Social Media in Academic Lib... 4 Predatory Publishers using Spamming Strategies... 5 Electronic Information Resource Optimisation i... 6 Use of ResearchGate by the Research Scholars o... 7 India, Open Access, the Law of Karma and the G... 8 Reusing Data Technical and Ethical Challenges 9 Global Research Studies on Electronic Journals... Name: Title, dtype: object
4. Calculating Min, Max, Median, and Mean
Next, we will create a new dataframe for "Times cited" to analyze min, max, median, and mean. We express out new dataframe as tc.
0 3
1 0
2 0
3 2
4 2
5 2
6 3
7 2
8 3
9 0
Name: Times cited, dtype: int64
(Note: First column is the index column)
tc.min()
Output:
0
tc.max()
Output:
117
tc.median()
Output:
1
tc.mean()
Output:
1.788
tc.sum()
Output:
894
5. Sorting values
Further, if we want to sort the "Times cited" columns in descending order, we can follow the below code.
data.sort_values('Times cited',ascending=False).head(5)
The output will be:
| Fig.3 |
We will have the top five articles having the most citations over the years.
6. Adding data frame to the existing data frame
If we want to get the percentage of cases for "Times cited", we could use the following codes. We express newdf as a new dataframe. Further, one more calculated dataframe perc has been created.
newdf = data['Times cited']
perc = newdf / tc.sum()
perc.head(5)
The output will be:
0 0.003356 1 0.000000 2 0.000000 3 0.002237 4 0.002237 Name: Times cited, dtype: float64
Next, if we want to add this part to the existing dataframe (data), we could use the following code.
data['Percentage of citation'] = perc
data
| Fig.4 |
As we can see the new column has been added after calculating and coding the above lines.
This is how we can code to analyze our data. In the next article, more functions will be demostrated.
This article is only for educational purposes.
READ ALSO: Basic data types: Google Sheets, R, and Python
References
[1] https://en.wikipedia.org/wiki/Python_(programming_language)
[2] https://www.dimensions.ai/
[3] https://jupyter.org/
[4] https://en.wikipedia.org/wiki/Pandas_(software)
Share it: