student performance dataset

In the post-COVID-19 pandemic era, the adoption of e-learning has gained momentum and has increased the availability of online related . Student Performance Database. Date: 2017-7-1 Higher Education Students Performance Evaluation Dataset Data Set. The distribution of the performance scores by group is shown as a boxplot. The authors found that student exam scores increased by almost half a standard deviation through active learning. To see some information about categorical features, you should specify the include parameter of the describe() method and set it to [O] (see the image below). To examine whether engagement improved performance, scores on the questions related to the competition normalized by total exam score (as computed in the performance section) are examined in relation to frequency of submissions during the competition. As a competition, with an independent clear performance metric, along with a dynamic leader board, students can see how their model predictions compare with the models produced by other students. Student performance will be categorized as Fail, Fair, Good, Excellent the definition will be made by you. Fig. import pandas as pd import numpy as np import matplotlib. Focus is on the difference in median between the groups. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. It should contain 1 when the value in the given row from column famsize is equal to GT3 and 0 when the corresponding value in famsize column equals LE3. Table 3 shows the results of permutation testing of median difference between the groups. Besides, data analysis and visualization can be done as standalone tasks if there is no need to dig deeper into the data. The data contains various features like the meal type given to the student, test preparation level, parental level of education, and students' performance in Math, Reading, and Writing. Each scatter plot shows the interrelation between two of the specified columns. We use cookies to improve your website experience. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 16 schoolsup - extra educational support (binary: yes or no) 17 famsup - family educational support (binary: yes or no) 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 19 activities - extra-curricular activities (binary: yes or no) 20 nursery - attended nursery school (binary: yes or no) 21 higher - wants to take higher education (binary: yes or no) 22 internet - Internet access at home (binary: yes or no) 23 romantic - with a romantic relationship (binary: yes or no) 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93) # these grades are related with the course subject, Math or Portuguese: 31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target), P. Cortez and A. Silva. The third row simply prints out the results. However, it may have negative influence if constructed poorly. For the purpose of evaluation and benchmarking, an anonymized students' academic performance dataset, called IITR-APE, was created and will be released in the public domain. (Citation2015) ran a competition assessing anatomical knowledge, as part of an undergraduate anatomy course. (2020) Student Performance Classification Using Artificial Intelligence Techniques. Scores for the relevant questions were summed, and converted into percentage of the possible score. During the work, we used Matplotlib and Seaborn packages. We will demonstrate how to load data into AWS S3 and how to direct it then into Python through Dremio. (Citation2014) examined 158 studies published in about 50 STEM educational journals. Question: In python without deep learning models . This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. Student Academic Performance Prediction using Supervised Learning Abstract: Predict student performance in secondary education (high school). The more free time the student has, the lower the performance he/she demonstrates. You are not required to obtain permission to reuse this article in part or whole. In addition, performance in the competition as measured by accuracy or error is also examined in relation to the number of submissions. The data need to be split into training and testing sets. A Medium publication sharing concepts, ideas and codes. Abstract and Figures Automatic Student performance prediction is a crucial job due to the large volume of data in educational databases. Missing Values? On the heatmap, you can see correlation not only with the target variable, but also the variables between each other. 0 forks Report repository Releases No releases published. The magnitude of the effect of different approaches, though, varies. The boxplots suggest that the students who participated in the challenge performed relatively better than those that did not on the regression question than expected given their total exam performance. Refresh the page, check Medium 's site status, or find something interesting to read. We will use popular Python libraries for the visualization, namely matplotlib and seaborn. About halfway through the competition, students might be allowed to form teams, to learn how averaging models can boost performance. Student Performance Data was obtained in a survey of students' math course in secondary school. The dataset was created by collecting student feedback from American International University-Bangladesh and then labelled by undergraduate . Similarly the results show that students who did the regression challenge performed better on these exam questions. References [1] Bray F. , et al. My Observations regarding the Maths Score: My Observation regarding the Reading score: My observation regarding the writing score: My Observation regarding the Scores vs Gender plots: My Observation regarding the Race/Ethnicity: My Observation regarding Parents Education Level: My Observation regarding the Test Preparation Course status: My Observation regarding Race/Ethnicity vs Parental level of education: My Observation regarding the Lunch field: Awesome! filterwarnings ( "ignore") On the other hand, the predictive accuracy improved with the number of submissions for the regression competitions. A sample submission file needs to be provided. However, you can understand the gist of this type of visualization: Lets look at distributions of all numeric columns in our dataset using Matplotlib. It is well known for its competitions (e.g., Rhodes Citation2011), some of which come with rich monetary prizes (e.g., Howard Citation2013). The final dataset contains more than 2,000,000 student feedback instances related to teacher performance. In any case, a good data scientist should know how to analyze and visualize data. , Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , CA A Cancer J. Clin. We have learned so many factors that affect a students performance. The data set contains 12,411 observations where each represents a student and has 44 variables. The code below is used to import the port_final and mat_final tables into Python as pandas dataframes. We can analyze the correlation and then visualize it using Seaborn. It can be required as a standalone task, as well as the preparatory step during the machine learning process. Download. We recommend providing your own data for the class challenge. For example, we would expect from a student with a 70% exam mark to get 70% marks on each of the questions in the exam, if she has similar knowledge level on all the exam topics. Also, the more alcohol student drinks on the weekend or workdays, the lower the final grade he/she has. It allows understanding which features may be useful, which are redundant, and which new features can be created artificially. The sample() method returns random N rows from the dataframe. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). An exception is, of course, an academic discussion motivated by the competition between the teaching team and the students, for example, a discussion about different models, their advantages and limitations. Quarters one and three include students that underperform or outperform on both types of questions, respectively. # Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) 2 sex - student's sex (binary: 'F' - female or 'M' - male) 3 age - student's age (numeric: from 15 to 22) 4 address - student's home address type (binary: 'U' - urban or 'R' - rural) 5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) 6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart) 7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) 8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) 9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. Thats why we will do some things with data immediately in Dremio, before putting it into Pythons hands. Winners are typically expected to share their code, and occasionally newly emerged algorithms are introduced to the broad community, for example, deep neural networks (Hinton and Dahl Citation2012) and XGBoost (Chen and Guestrin Citation2016). Taking part in the data competition contributed a lot to my engagement with the subject. But these dataframes are absolutely identical, and if you want, you can do the same operations with the Mathematics dataframe and compare the results. 68 ( 6 ) ( 2018 ) 394 - 424 . In this Data Science Project we will evaluate the Performance of a student using Machine Learning techniques and python. The datasets used in our competitions can be shared with other instructors by request. A competition, like any other active learning method that is used for assessment, has its advantages and disadvantages. Its time to wrap up. Lucio Daza 26 Followers Sr. Director of Technical Product Marketing. Moreover, it can serve as an input for predicting students' academic performance within the module for educational datamining and learning analytics. Students who participated in the Kaggle challenge for classification scored higher than those that did the regression competition, on the classification problem. Students formed their own teams of 24 members to compete. The purpose is to predict students' end-of-term performances using ML techniques. Kaggle does not allow you to download participants email addresses; all you see is their Kaggle name. This data is based on population demographics. No Parts b and c were in the top 10 for discrimination and part a was at rank 13. Fig. Now, we use the hist() method on the df_num dataframe to build a graph: In the parameters of the hist() method, we have specified the size of the plot, the size of labels, and the number of bins. We acknowledge that the differences in the engagement levels may not necessarily be a result of participation in the competition but it is still an interesting aspect. Carpio Caada etal. Your home for data science. 1). The 141 undergraduate (ST-UG) students were used for comparison when examining the performance of the postgraduate students. Area: E-learning, Education, Predictive models, Educational Data Mining There are 270 of the parents answered survey and 210 are not, 292 of the parents are satisfied from the school and 188 are not. Nowadays, these tasks are still present. [Web Link]. One can expect that, on average, a students success rate for each question will be about the same as their success rate in the total exam. The graph for fathers jobs is shown below: The boxplot allows seeing the average value and low and high quartiles of data. This work is one of few quantitative analyses of data competition influences on students performance. Actually, before the machine learning era, all data science was about the interpretation and visualization of data with different tools and making conclusions about the nature of data. In our case, this visualization may not be as useful as it could be. The collection phase of the entire dataset includes . Surprisingly, fewer students perceived the Kaggle challenge might help with exam performance (Q4). Further in this tutorial, we will work only with Portuguese dataframe, in order not to overload the text. Using Data Mining to Predict Secondary School Student Performance. Click on the arrow near the name of each column to evoke the context menu. Her success rate on regression question will be higher than 70%. There are also learning competitions (Agarwal Citation2018), designed to help novices hone their data mining skills. Another improvement could be asking ST-UG students that did not take part in the competition about their level of engagement and compare the answers with other students of ST-PG. Student ID 1- Student Age (1: 18-21, 2: 22-25, 3: above 26) 2- Sex (1: female, 2: male) 3- Graduated high-school type: (1: private, 2: state, 3: other) 4- Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full) 5- Additional work: (1: Yes, 2: No) 6- Regular artistic or sports activity: (1: Yes, 2: No) 7- Do you have a partner: (1: Yes, 2: No) 8- Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410) 9- Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other) 10- Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other) 11- Mothers education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.) 12- Fathers education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.) 13- Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above) 14- Parental status: (1: married, 2: divorced, 3: died - one of them or both) 15- Mothers occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other) 16- Fathers occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other) 17- Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours) 18- Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often) 19- Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often) 20- Attendance to the seminars/conferences related to the department: (1: Yes, 2: No) 21- Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral) 22- Attendance to classes (1: always, 2: sometimes, 3: never) 23- Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable) 24- Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never) 25- Taking notes in classes: (1: never, 2: sometimes, 3: always) 26- Listening in classes: (1: never, 2: sometimes, 3: always) 27- Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always) 28- Flip-classroom: (1: not useful, 2: useful, 3: not applicable) 29- Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49) 30- Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49) 31- Course ID 32- OUTPUT Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA), Ylmaz N., Sekeroglu B. Kaggle (The Kaggle Team Citation2018) is a platform for predictive modeling and analytics competitions where participants compete to produce the best predictive model for a given dataset. Student Academic Performance Analysis | Kaggle In awarding course points to student effort, we typically align it to performance. In Dremio, everything that you did finds its reflection in SQL code. These statistics are consistent with historic scores for the class, that the undergraduates tend to have a wider range than post-graduates but generally quite similar averages. It allows a better understanding of data, its distribution, purity, features, etc. When ready, press the button. Moreover, future investigation is required to understand the influence of the different aspects of data competition implementation on the magnitude of the performance improvement. Paulo Cortez, University of Minho, Guimares, Portugal, http://www3.dsi.uminho.pt/pcortez. Performance is plotted against type of question, separately for the competition they completed. ibrahus/Students-Performance-in-Exams - Github Details. It also provides all the scores from all past submissions (under Raw Data on Public Leaderboard). One of these functions is the pairplot(). Record the student names in Kaggle to match with your class records. A tag already exists with the provided branch name. Several papers recently addressed the prediction of students' performances employing machine learning techniques. The purpose of this study is to examine the relationships among affective characteristics-related variables at the student level, the aggregated school-level variables, and mathematics performance by using the Programme for International Student Assessment (PISA) 2012 dataset. The first row of the code below uses method the corr() to calculate correlations between different columns and the final_target feature. Cited by lists all citing articles based on Crossref citations.Articles with the Crossref icon will open in a new tab. Secondarily, the competitions enhanced interest and engagement in the course. Data Analysis on Student's Performance Dataset from Kaggle. Points out of whiskers represent outliers. Table 1. Data Set Characteristics: Multivariate Seaborn package has the distplot() method for this purpose. This dataset can be used to develop and evaluate ABSA models for teacher performance evaluation. No packages published . Data were compiled by monitoring and extracting information from their emails by class members, over a period of a week, and manually tagging them as spam or ham. Also, visualization is recommended to present the results of the machine learning work to different stakeholders. When creating SQL queries, we used the full paths to tables (name_of_the_space.name_of_the_dataframe). The exam questions can be seen in the Online Supplementary files for ST and CSDM, respectively. Most of our categorical columns are binary: Now we are going to build visualizations with Matplotlib and Seaborn. To do this, we extract only those rows which contain value U in the address column: From the output above, we can say that there are more students from urban areas than from rural areas. Prior and post testing of students might improve the experimental design. The data set includes also the school attendance feature such as the students are classified into two categories based on their absence days: 191 students exceed 7 absence days and 289 students their absence days under 7.
How Does Rustem Recognize That Sohrab Is His Son?, Pismo Beach New Years Eve Fireworks 2021, El Dorado County Fatal Accident, El Salvador Shoe Size Conversion, Obatala Food Offerings, Articles S

student performance dataset 2023