Skip to content

Anastasia-front/data-science

Repository files navigation

TASKS

intro_data_science

Click to expand/collapse

PART 1

  1. Create a one-dimensional array (vector) with the first 10 natural numbers and print its values.
  2. Create a two-dimensional array (matrix) of size 3x3, fill it with zeros, and print its values.
  3. Create a 5x5 array, fill it with random integers in the range from 1 to 10, and print its values.
  4. Create a 4x4 array, fill it with random floating-point numbers in the range from 0 to 1, and print its values.
  5. Create two one-dimensional arrays of size 5, fill them with random integers in the range from 1 to 10, and perform element-wise addition, subtraction, and multiplication.
  6. Create two vectors of size 7, fill them with arbitrary numbers, and find their dot product.
  7. Create two matrices of size 2x2 and 2x3, fill them with random integers in the range from 1 to 10, and multiply them together.
  8. Create a 3x3 matrix, fill it with random integers in the range from 1 to 10, and find its inverse matrix.
  9. Create a 4x4 matrix, fill it with random floating-point numbers in the range from 0 to 1, and transpose it.
  10. Create a 3x4 matrix and a vector of size 4, fill them with random integers in the range from 1 to 10, and multiply the matrix by the vector.
  11. Create a 2x3 matrix and a vector of size 3, fill them with random floating-point numbers in the range from 0 to 1, and multiply the matrix by the vector.
  12. Create two matrices of size 2x2, fill them with random integers in the range from 1 to 10, and perform element-wise multiplication.
  13. Create two matrices of size 2x2, fill them with random integers in the range from 1 to 10, and find their product.
  14. Create a 5x5 matrix, fill it with random integers in the range from 1 to 100, and find the sum of its elements.
  15. Create two matrices of size 4x4, fill them with random integers in the range from 1 to 10, and find their difference.
  16. Create a 3x3 matrix, fill it with random floating-point numbers in the range from 0 to 1, and find a column vector containing the sum of elements of each row of the matrix.
  17. Create a 3x4 matrix with arbitrary integers and create a matrix with the squares of these numbers.
  18. Create a vector of size 4, fill it with random integers in the range from 1 to 50, and find a vector with the square roots of these numbers.

PART 2 (additional, optional)

  1. Replace all odd numbers in the array with -1: arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).
  2. Create and reshape a 1D array into a 2D array with 2 rows.
  3. Create two 2D arrays a and b, and vertically stack them.
  4. Generate a pattern without hard coding: Input: a = np.array([1,2,3]), Output: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]).
  5. Find common elements between arrays a and b: a = np.array([1,2,3,2,3,4,3,4,5,6]), b = np.array([7,2,10,2,7,4,9,4,9,8]).
  6. Find the indices of the first 5 maximum values in the array a: np.random.seed(100), a = np.random.uniform(1,50, 20).
  7. Remove all NaN values from a one-dimensional array: a=np.array([1,2,3,np.nan,5,6,7,np.nan]).
  8. Calculate the Euclidean distance between two arrays a and b: a = np.array([1,2,3,4,5]), b = np.array([4,5,6,7,8]).
  9. Find the index of the 5th occurrence of the number 1 in the array x: x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2]).
  10. Identify repeated entries (from the 2nd occurrence onwards) in the given array and mark them as True. The first occurrence should be False: np.random.seed(100), a = np.random.randint(0, 5, 10).

Getting_to_know_Pandas

Click to expand/collapse
Read the data from the table [Birth Rate in the Regions of Ukraine (1950—2019)](https://uk.wikipedia.org/wiki/%D0%9D%D0%B0%D1%81%D0%B5%D0%BB%D0%B5%D0%BD%D0%BD%D1%8F_%D0%A3%D0%BA%D1%80%D0%B0%D1%97%D0%BD%D0%B8)
  1. Display the first rows of the table using the head method.
  2. Determine the number of rows and columns in the DataFrame (use the shape attribute).
  3. Replace the "-" values in the table with NaN values.
  4. Determine the types of all columns using dataframe.dtypes.
  5. Replace non-numeric column types with numeric ones. Hint: these are columns where the "-" symbol was found.
  6. Calculate the proportion of missing values in each column (use the isnull and sum methods).
  7. Remove the data for the entire country, the last row of the table.
  8. Replace missing values in columns with the mean of the respective columns (use the fillna method).
  9. Get a list of regions where the birth rate in 2019 was higher than the national average.
  10. In which region was the highest birth rate in 2014?
  11. Build a bar chart of birth rates by region in 2019.

File_Analysis

Click to expand/collapse
Conduct an analysis of the file 2017_jun_final.csv. The file contains the results of a survey of developers in June 2017.
  1. Read the file 2017_jun_final.csv using the read_csv method.
  2. Read the obtained table using the head method.
  3. Determine the size of the table using the shape method.
  4. Determine the data types of all columns using the dataframe.dtypes.
  5. Calculate the proportion of missing values in each column (use the isnull and sum methods).
    1. Remove all columns with missing values except the "Programming Language" column.
    2. Calculate again the proportion of missing values in each column and make sure that only the "Programming Language" column remains.
  6. Remove all rows in the original table using the dropna method.
  7. Determine the new size of the table using the shape method.
  8. Create a new table python_data, which will only contain rows with specialists who indicated Python as their programming language.
  9. Determine the size of the python_data table using the shape method.
  10. Using the groupby method, perform grouping by the "Position" column.
  11. Create a new DataFrame where for the grouped data by the "Position" column, perform data aggregation using the agg method and find the minimum and maximum values in the "Monthly Salary" column.
  12. Create a function fill_avg_salary, which will return the average monthly salary. Use it for the apply method and create a new column "avg".
  13. Create descriptive statistics using the describe method for the new column.
  14. Save the obtained table to a CSV file.

Analyze_the_dataset_from_Kaggle.com

Click to expand/collapse
Utilize data from the Top-50 bestselling books on Amazon for 11 years (from 2009 to 2019). The dataset is publicly available on Kaggle.com. Download the CSV file from the link and move it to the same directory as your working notebook for convenience. Then proceed to the tasks.

For this part of the task, you will need to not only write the code but also answer accompanying questions. Wherever you see bold text "Answer:", you will need to insert the question into the file and provide the answer to it.


PART 1: Initial data
DESCRIPTION: Prepare table

  1. Read the csv file (use the read_csv function)
  2. Output the first five lines (the head function is used)
  3. Display the dimensions of the dataset (use the shape attribute)
  • Question How many books does the dataset store?

7 variables (columns) are available for each of the books. Let's take a closer look at them: Name - the name of the book Author - the author User Rating - rating (on a 5-point scale) Reviews - number of reviews Price - price (in dollars as of 2020) Year - the year when the book entered the Top-50 rating Genre - a genre

To simplify further work, let's tweak the variable names a little. As you can see, here all the names start with a capital letter, and one even contains a space. This is highly undesirable and can be quite inconvenient. Let's change the case to lowercase, and replace the space with an underscore (snake_style). And now we will study a useful data frame attribute: columns (you can simply assign a list of new names to this attribute)

df.columns = ['name', 'author', 'user_rating', 'reviews', 'price', 'year', 'genre']

PART 2: Initial data exploration
DESCRIPTION: Check Missing Values and Unique Genres

  1. Check if all rows have enough data: output the number of missing values (na) in each of the columns (use the isna and sum functions).
  • Question Are there any missing values in any of the variables? (Yes/No)
  1. Check what unique values are in the "genre" column (use the unique function).
  • Question What are the unique genres?
  1. Determine the maximum, minimum, mean, and median prices (use the max, min, mean, and median functions).
  • Question What is the maximum price?
  • Question What is the minimum price?
  • Question What is the mean price?
  • Question What is the median price?
  1. Now, look at the distribution of prices: create a histogram (use kind='hist').

PART 3: Data search and sorting
DESCRIPTION: Analysis of Book Ratings, Reviews, and Prices

  • Question What is the highest rating in the dataset?
  • Question How many books have such rating?
  • Question Which book has the most reviews?
  • Question Among the books that made it to the Top 50 in 2015, which one is the most expensive (you can use an intermediate dataframe)?
  • Question How many Fiction genre books made it to the Top 50 in 2010 (use &)?
  • Question How many books with a rating of 4.9 made it to the rating in 2010 and 2011 (use | or the isin function)?
    • Finally, let's sort all books that made it to the rating in 2015 and cost less than $8 in ascending order of price (use the sort_values function).
  • Question What is the last book in the sorted list?

PART 4: Data aggregation and table merging
DESCRIPTION: Aggregate Book Prices by Genre and Author

  1. First, let's look at the maximum and minimum prices for each genre (use the groupby and agg functions, for counting minimum and maximum values, use max and min). Do not take all columns, select only those you need.
  • Question What is the maximum price for the Fiction genre?
  • Question What is the minimum price for the Fiction genre?
  • Question What is the maximum price for the Non Fiction genre?
  • Question What is the minimum price for the Non Fiction genre?
  1. Question Now, create a new dataframe that will contain the number of books for each author (use the groupby and agg functions, for counting, use count). Do not take all columns, select only those you need.
  • Question What is the dimension of the resulting table?
  • Question Which author has the most books?
  • Question How many books does this author have?
  1. Question Now create a second dataframe that will contain the average rating for each author (use the groupby and agg functions, for calculating the average value, use mean). Do not take all columns, select only those you need.
  • Question Which author has the minimum average rating?
  • Question What is the average rating for this author?
  1. Merge the last two dataframes so that for each author, you can see the number of books and the average rating (use the concat function with the axis parameter). Save the result in a variable.
  2. Sort the dataframe in ascending order of the number of books and the rating (use the sort_values function).
  • Question Which author is first on the list?

PART 5: Visualization
DESCRIPTION: Visualize Book Data Trends

  1. For each of the previous tasks, add 3 - 5 plots of different types of functions of your choice. Style the plots so that each graph in each work is different and not similar to others. You can use both matplotlib and seaborn.
  2. Don't forget to add the directive %matplotlib inline to the Jupyter file so that the plots are built inside the document.

linear_regression

Click to expand/collapse
Read the data from the table [Housing](https://drive.google.com/file/d/1-rAa4XT4_fI0dOBlMNuE6a7jB0wln_Qo/view)

This homework assignment will be entirely related to linear regression and its implementation. So let's break our homework into several parts:

  1. write a linear regression hypothesis function in vector form;
  2. create a function to calculate the loss function in vector form;
  3. implement one step of gradient descent;
  4. find the best parameters w for a dataset using the functions you wrote that predict the price of a house depending on the area, number of bathrooms, and number of bedrooms;
  5. find the same parameters using an analytical solution;
  6. use LinearRegression from the scikit-learn library to check the predicted values and compare the results.

Classification_and_evaluation_of_model_performance

Click to expand/collapse

This time you need to complete the tasks from this notebook. To solve the proposed tasks, you also need to download a dataset with bike rental data.

other_algorithms_and_tutoring

Click to expand/collapse

Using the accelerometer data from a mobile phone, you need to classify what activity a person is doing: walking, standing, running, or climbing stairs. You can find the dataset here.

Use SVM algorithms and a random forest from the scikit-learn library. You can take accelerometer readings as characteristics, but to improve the results of the algorithms, you can first prepare our dataset and calculatetime domain features. These features are described in more detail in this article.

Compare the results of both algorithms on different features and different models with each other.

Compare the results of both algorithms on different features and different models with each other. Use the classification report method for comparison.

learning_without_a_teacher

Click to expand/collapse

Task 1

In this task, you need to download this dataset. Here you will find 2 files - a two-dimensional dataset and an mnist dataset. For each of them, apply the K-means algorithm for clustering. Use the elbow method to find the optimal number of clusters.

Task 2

Visualize the result of clustering. For the case of the mnist dataset, you will also need to use the PCA algorithm to reduce the dimensionality of your data to a 2-dimensional version.

Recommender_systems

Click to expand/collapse

Take the movielens dataset and build a matrix factorization model. In this library, it is called SVD. Select the best parameters using cross-validation, also experiment with other calculation algorithms (SVD++, NMF) and choose the one that will be optimal.

You can find tips on how to build this model in the documentation for this library.

Deep_learning

Click to expand/collapse
  1. Fill in the gaps in the code.

  2. Train the neural network.

  3. Build the necessary graphs.

  4. Find the network losses.

  5. Test the network on test data.

  6. Get quality metrics for each class of trained model using https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html.

  7. Draw conclusions.

selection_of_hyperparameters_HM

Click to expand/collapse

As a homework assignment, you are asked to create a neural network using Keras mechanisms that will classify products from the fasion_mnist dataset (https://www.tensorflow.org/datasets/catalog/fashion_mnist).

You are to propose your own network architecture. The accuracy of the most naive but adequate neural network is approximately 91%. The accuracy of your model should be no lower than this. To achieve such values, you will need to experiment with network hyperparameters:

  1. number of layers;
  2. number of neurons;
  3. activation functions;
  4. number of epochs;
  5. the size of the batch;
  6. optimizer selection;
  7. different regularization techniques, etc.

Use the techniques you have learned to identify neural network training problems and then experiment.

Convolutional_neural_networks

Click to expand/collapse

Part 1

As a homework assignment, you are asked to create a neural network using Keras mechanisms that will classify products from the fasion_mnist dataset.

Unlike the previous assignment, you are asked to create a convolutional neural network. Choose a network architecture and train it on the data from the fasion_mnist dataset. Try to achieve the highest possible classification accuracy by manipulating the network parameters. Compare the accuracy of the resulting convolutional network with the accuracy of the multilayer network from the previous task. Draw your conclusions.

Part 2

In this part, we will work with the fasion_mnist dataset again.

Unlike the previous assignment, you are asked to create a convolutional neural network using VGG16 as a convolutional basis.

Train the resulting network on the data from the fasion_mnist dataset. Try to achieve the highest possible classification accuracy by manipulating the network parameters. During training, use the methods of retraining and feature extraction.

Compare the accuracy of the resulting convolutional network with the accuracy of the multilayer network from the previous task. Draw conclusions.

Recurrent_neural_networks

Click to expand/collapse
As a homework assignment, you are asked to create a recurrent neural network using Keras mechanisms that will classify reviews from the imdb dataset.

Unlike the example in Module 9, we will use a recurrent neural network. Experiment with the structure of the network - RNN, LSTM, bipartite, and deep.

Compare the results and draw conclusions.

NL__tasks

Click to expand/collapse
Make a summary of the following text using the libraries for NLP: nltk and SpaCy

The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of ​​Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP's emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope into orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock with the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains "the largest, fastest, winged hypersonic aircraft in history," having regularly flown at twenty-five times the speed of sound.

Hint

First of all, we need to import the necessary libraries. For SpaCy, this can be done using the command:

import spacy

Note that NLTK may require additional data to be loaded, such as a list of stop words or tokenizers.

import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize

Before you can start working with SpaCy, you need to download the required language model. For example, for English, we can load the "en_core_web_sm" model:

nlp = spacy.load('en_core_web_sm')

Preparation of the text

Before you start creating a text summary, you need to prepare the text. This includes removing unnecessary characters, tokenization (breaking the text into individual words or sentences), removing stop words (words that do not carry significant information), and, if necessary, other text processing such as stemming or lemmatization.

Text to process

text = "This is an example sentence for tokenization and lemmatization."

Tokenization

doc = nlp(text) tokens = [token.text for token in doc] print(tokens)

NLTK also provides advanced features for text processing. Using NLTK methods like word_tokenize, sent_tokenize, or stopwords, we can get the tokenized words and sentences, as well as a list of stop words.

tokens = word_tokenize(text) sentences = sent_tokenize(text) stop_words = set(stopwords.words('english'))

And let's not forget about punctuation

punctuation = punctuation + '\n'

It is also possible to calculate the frequency of appearance of certain words in the text (but it is worth remembering that this should be done after excluding all punctuation marks)

word_frequencies = {} for word in doc: if word.text.lower() not in stopwords: if word.text.lower() not in punctuation: if word.text not in word_frequencies.keys(): word_frequencies[word.text] = 1 otherwise: word_frequencies[word.text] += 1

When we already have a prepared text and used SpaCy or NLTK to get the necessary inforation, we can create a text resume. This can be done, for example, by selecting the most important sentences from the text, taking into account their weight or the frequency of use of certain words.

The heapq library

The heapq library is part of the Python standard library and provides functionality for working with data structures called heap. One of the imported objects in this library - nlargest - is a function that allows you to find the largest elements from an iterable object.

from heapq import largest

The ``nlargest(n, iterable, key=None)'' function accepts three arguments:

n is the number of largest elements you want to retrieve iterable is the iterable object from which you want to select the largest elements key (optional) is a function that defines the key by which elements are compared (for example, key=str.lower)

The nlargest function returns a list of n largest elements from an iterable. These items will be sorted in descending order. If n is greater than the length of the iterable, the function will return the entire iterable in sorted order.

So, imported from heapq import nlargest allows you to use the ``nlargest'' function to find the largest elements from an arbitrary iterable object.

select_length = int(len(sentence_tokens)) summary = nlargest(select_length, sentence_scores, key = sentence_scores.get) summary

In this case, the nlargest function is used to find the largest elements from the dictionary sentence_scores in select_length. Dictionary keys represent sentences, and values ​​represent their scores or weights. The key argument is given as sentence_scores.get, which means that the get function is used to compare elements. In this case, it returns a value (score) for each sentence that is used as a criterion for comparison. So, the summary variable will contain a select_length list of the best sentences from the sentence_scores dictionary in descending order of scores.

A web application that allows you to upload images for classification

Click to expand/collapse
You will need to build a web application to visualize your neural network using Streamlit or Dash. As a homework assignment, you can continue the homework for the Convolutional Neural Networks module.

Assignment.

Create a web application that allows you to upload images for classification using your trained neural network. Display the input image on a web page. Display the loss and accuracy function graphs for the model; the classification results (probabilities for each class and the predicted class) in a convenient format. Add an interface to choose between the two models (convolutional neural network from Part 1 and VGG16-based model from Part 2).

Required functionality:

  • An interface to upload an image that the user wants to classify.

  • Use of the uploaded model to predict the class of the uploaded image.

Releases

No releases published

Packages

 
 
 

Contributors