Assignment 2

Data and Algorithms:

Cities and Earthquakes

an exercise in geographic data analysis

Last Modified: 5th October 2021 (BB)

Question 1: World Cities

In this coursework exercise, you will download data from the provided link and read it in as a CSV file using the Pandas data analysis package for Python. The data we will use contains a variety of information about cities from around the world.

Coding Techniques for Working with DataFrames

To complete these tasks you will need to access and filter a DataFrame. The DataFrame data structure has many convenient features for extracting and ordering information. Although conceptually it can be thought of as a comptuational represention of a table, it is quite a complex data structure and takes a while to master. The following questions can be done with only a small but powerful set of DataFrame operations; and the following examples of typical forms of programming with DataFrames should be useful for coding your answers.

Loading data from a CSV file into a DataFrame

DataFrames are specifically designed to handle data organised in a tabular format. Hence, as we would expect, since CSV is the standard format for tabular data, it is very easy to create a DataFrame by loading data from a CSV file.

Getting the Data for this Question

Download the data file worldcities.csv from the module's data repository, and put the file in the same directory as this Jupyter notebook file. Then, by running the following cell, we can set the global variable WC_DF to a DataFrame containing the information from worldcities.csv.

Be sure to keep the same variable name WC_DF for this global variable, otherwise most of the following code will not work and you will break the autograder.

Checking the contents of a DataFrame

Pandas provides the following useful methods that enable you to quickly check the contents of a DataFrame:

Note that the head() and describe() methods are actually operations that return a new DataFrame object. If this value is returned by the last line of a cell it will be displayed as a table, but if it is generated elsewhere in the code you will not see any output unless you use the display function from the IPython.display module.

Accessing DataFrame columns and rows

Each column of a DataFrame is a list-like object called a Series. Elements, and slices of a Series can then be accessed in similar fashion to a list. The following illustrates how get the Series containing the first 5 elements of the city column of WC_DF:

In the above output, the left hand column of the displayed value of top_5_cities shows the index label of each element. One of the differences between a Series and an ordinary list is that, whereas a list always has integers for its index labels, a Series can have different kinds of values for these. For instance (though there is no reason to do this for the current assignment) we could set the index values to alphabetic letters, as follows:

You can also use .values to return an array of the column values without the index:

An array is also a list-like datastructure. It does not have an index. The main difference between a list and an array is that the list is optimised for storing large amounts of information and for efficiently applying numerical and other operations to all elements of the array. Hence, arrays are usually preferred to lists when handling large amounts of information, or when storing numerical vectors.

You can also easily find the column names of the DataFrame using .columns, for example:

Note: The Index returned here is yet another type of list-like, object. It is similar to an array, except that it is used for indexing a Series or DataFrame. You do not usually need to create or deal with Index objects directly, since this is done automatically when you create and minipulate DataFrames. So you will normally only see one, when you want to look at the columns or rows of a DataFrame. But what you should be aware of, when dealing with DataFrames, is that the word index can refer to several different types of thing.

In many cases you can treat Series, array and Index objects like lists and if you want to change them to an ordinary list you can just use the list operator, as in the following:

We can refer to rows of a DataFrame either by the expression DF.loc[label], where label is the index label of the row we want, or by DF.iloc[n], where n is an int giving the position of the row in the DataFrame. In the case of WC_DF, the labels are integers, so we would get the same result using either. You could test this. You could also see the difference if you try finding a row of top_5_cities DataFrame defined above, after its index labels have been replaced by letters. In this case you could access rows either using letters, using loc, or by ints, using iloc.

Iterrating through the rows of a DataFrame

A convenient way of going through the rows of a DataFrame to perform some operation i by using the iterrows method in a for loop. This enables you to get both the index label and the row itself, for each successive row of the DataFrame. The following code is a simple example:

Sorting the rows of a DataFrame

It is easy, and often very useful, to sort the DataFrame by column values using .sort_values, for example:

Note on encodings of the city name

there are two columns that hold the city name. The first column name is 'city' and the second is city_ascii. There are various different ways in which textual information can be encoded into bytes. These days Unicode characters encoded using UTF-8 are pretty standard. But the older ASCII code, which uses a single byte per character is still commonly used. Unicode provides a huge variaty of text characters and other symbols, whereas ASCII is quite limited (mainly to characters and symbols found in standard English). But ASCII and is simpler and in some ways easier to deal with than UTF-8. In the following questions you will be asked to use the ASCII version of the city name (from the city_ascii column). This mainly just to make you aware that there are different encodings of text strings, but it will also prevent cerain problems that could occur in the Autograder, if different people used different encodings.

Filtering DataFrames

By filtering we mean keeping some parts that we want and throwing away others. Typically, we look for rows that match some condition; and the filter condition is often some constraint involving the values for that row in one or more columns. pandas DataFrames can be filtered according values of a column by using a boolean expression, for example:

This way of filtering is a very powerful and useful aspect of DataFrames. However, the syntax of the filter operation is rather unusual and a bit difficult to understand.

What is happening can be explained by these steps in the way a filter expression is evaluated:

You do not necessarily need to follow all of that precise desciption of filtering but it will be extremely helpful if you are able to construct filtering operations similar to the above example. You will see another example below, in relation to the earthquake data you will be processing.

Overview of Question 1 tasks

This question requires you to write functions to carry out the following specific tasks.

Full details for each task are explained below.

Question 1a

As we have seen, the names of some cities include accents or symbols that are not represented in the standard ASCII character set. Write a function non_ascii_cities, which returns a Set containing all the cities that occur in the world_cities.csv dataset whose name cannot be properly represented using only ASCII characters.

Your function should not have any arguments. You should assume that the global variable WC_DF has already been initialised by running the first code cell in this notebook (see above).

Question 1b

One issue that you will discover if you investigate the worldcities.csv data is that there are many different cities that have the same name. You may find it interesting to discover which are the most common city names. But for this question you need to write a function num_cities_occurring_n_times(n), such that:

Question 1c

Write a function that returns a dictionary (a dict object), whose keys are all the country name strings that occur in the worldcities data and whose values are ints giving the number of cities of that country that are included in the dataset.

Question 1d

Write a function largest_cities_dataframe that takes an int argument n and uses the pandas DataFrame WC_DF to return a new DataFrame containing n rows corresponding to the n largest cities in terms of population size, in order of decreasing population size.

You should return a dataframe such that it has the same columns as the WC_DF and each row has the same values as a corresponding row of WC_DF. It does not matter if the row indexes are the same. (They may or may not be the same depending on the specific way that you create the new DataFrame.)

NOTE: In answering 1d you may assume that no two cities have exactly the same population, which is almost but not quite certain, when dealing with large numbers like this. But, of course, when dealing with quantites where multiple data records could have the same value, we need to be careful, because this may not be the case. For example, if we are interested in what equipement students own, we might think it would be informative to find 'the top 10 students owning the most laptops'. In this case there could be: 1 student with 3 laptops, 23 students with 2 laptops, 160 with 1 laptop and 3 who do not own a laptop. In such a case it is not meaningful to pick the 'top 10' in terms of laptop ownership. A similar problem could potentiall occur with the earthquake data that we will look at later, because the earthquake magnitudes are only recorded to 1 decimal place.

Question 1e

Define a function big_cities_in_country( country, population) that takes as arguments a string corresponding to the name of a country and an integer, which will referes to a population number. The function should return a list of the form

[("city1", pop1), ("city2", pop2),... ],

where each pair ("cityN", popN) is a tuple consisting of the name, in ASCII form, of a city in the given country, followed by an int, which is the population of that city (according to the worldcities data). The list should include all and only those cities in the country whose population is greater than or equal to the given popuplation argument. The list should be ordered so that the ("cityN", popN) items occur in increasing order of the population size popN.

Question 1f

Create a function that given a country name, returns an int which is the total population of people liveing in all the cities of that country, as given in WC_DF.

Hints:

 Question 2: Earthquakes - Web Access and Pandas DataFrames

In this coursework exercise, you will learn how to download live information from the web and procress it using the Pandas data analysis package for Python.

The data we will use as an example is from the United States Geological Survey (USGS), which provides a wide range of geographic and geological information and data. We shall be using their data relating to seismological events (i.e. Earthquakes) from around the world, which is published in the form of continually updated CSV files. Information about these feeds can be found here. The URL for the particular feed we shall be using is given below.

Questions Overview

Question 2a: Read in data file

Read earthquake data from the USGS live feed CSV all_day.csv into a Pandas DataFrame. The data can be obtained directly from http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv and read into a Pandas DataFrame.

Note: For this question you do not need to download and save the file all_day.csv. It should be loaded directly from the web feed. However, while testing, if you have no internet connection or a bad connection you could download a copy of the file. But remember to put it back to downloading the current one before you submit. Note also that all_day.csv is a live file, which lists quakes recorded during the past 24 hours, and is updated every minute, so of course, you will not always get the same file or the same results. More information about this and other earthquake feeds provided by USGS can be found here.

You can use the following cell to test if you have read the quake data into QUAKE_DF

Note:

The columns containing latitude and longitude values are labelled differently in the worldcities.csv and the earthquake data from USGS. This is a minor but very typical form of incompatibility between data formats that you will often need to deal with when working with real data.

More examples of useful pandas functions

Here we show you some more pandas functions that you may find useful in this exercise.

As we have seen, versatile filtering and sorting capabilities are provided by pandas. To get more understanding of these, you should look at tutorials of using Pandas DataFrames. But the following example illustrates how you can find and display quakes whose depth is greater than or equal to a given threshold:

Note: The QUAKES_DF global variable needs to be set before these examples will work, so I am using a try, except construct to avoid getting an error.

You can also find max and min values in a column. Eg:

Question 2b: Find Powerful Quakes

Write a function powerful_quakes that takes a numerical argument and returns a DataFrame including all the quakes in QUAKE_DF that have a magnitude greater than or equal to the given argument.

Question 2c: Find n+ most powerful earthquakes

Produce a DataFrame with rows represent the n(or maybe more) most powerful quakes in descending order of magnitude. The returned DataFrame should show at least n quakes and may sometimes show more since we do not want to leave out any quake that is equally powerful as the last quake listed in the DataFrame, More specificially, we want the function to return a DataFrame that:

Note:

The above definition of the requirements is clear and precise. Though you may ask for help and advice regarding implementation, you will not be given help with understanding the specification.

Distance between locations on the Earth's surface

Clearly, when dealing with data pertaining to locations in space, the distince between such locations is often of great significance when interpreting or extracting further information from the data.

To help answer the following questions you are provided with the function haversine_distance, which implements the Haversine formula to find the surface distance in kilometres between two locations, that are specified in terms of latitude and longitude values. When finding distances betwen points on the surface of the Earth We need to use this formula, rather than the simpler Pythagorean distance formula, because the Earth's surface is a sphere.

Question 2d: Sort quakes by distance from a given location

Write a function quake_distance_from_loc_dataframe(loc) satisfying the following requirements:

Note:

You will need to do some research to find out how to create a new column and set its values.

Question 2e: Identifying Endangered Cities

The idea of this question is to identify possible emergency situations by finding cities that are likely to suffer from the effects of an earthquake.

Effect of an qarthquake at a distance from its epicenter

The effect of an earthquake on a city or person will depend on their distance from the source of the quake. The effect of an earthquake will depend on many factors and even the dependence on distance to source is very complex. However, after a bit of background research, Brandon has come up with a simple formula which hopefully at least gives a very crude estimate of relative effect of a quake with a particular magnitude and depth on a surface location at a known surface distance from the quake's epicenter. The calculated effective magnitude of an earthquake will be less that the source magnitude, for instance a magnitude 9 quake at a depth of 100km (which is likely to be extremely destructive, would have an effective magnitude of 5 at its epicentre (directly above the source) and 3.585 at a point on the earth surface 500km away from the epicenter.

The epicenter of an earthquake is the point on the earth that is directly above its source. Thus the effective magnitude at the epicenter is just the effective magnitude of the quake at surface distance zero:

Note: For any given quake, its effective_magnitude at any point on Earth is always less than or equal to its epicenter_magnitude.

Specification of the endangered_cities function

Now we get to the specification of the function.

Write a function engangered_cities( minimum_population, minimum_effective_magnitude) that takes two numerical arguments: an int (minimum_population) and a float (minimum_effective_magnitude) and returns a list specifying all those cities listed in the WC_DF such that:

Example Output:
 in [165]    %time        
             endangered_cities(200000, 0.5)

 out[165]    CPU times: user 1min 58s, sys: 768 µs, total: 1min 58s
             Wall time: 1min 58s
             [('Baghlan', 'Afghanistan', (36.1393, 68.6993)),
              ('Kunduz', 'Afghanistan', (36.728, 68.8725)),
              ('Mazar-e Sharif', 'Afghanistan', (36.7, 67.1)),
              ('Ambon', 'Indonesia', (-3.7167, 128.2)),
              ('Denov', 'Uzbekistan', (38.2772, 67.8872))]
Notes:

Optional Exercises

Having got this far, you may find it interesting and informative to do some more processing of the city and earthquake information. Since the previous exercises were designed so they can be quickly and reliably assessed by the autograding software, they involve coding particular functions with very specific requirements. But the following exercises are more open ended and give suggestions for interactive and visual use of the city and earthquake data.

To ensure that your assignment submission works with the autograding software when submitted, it is recommended that you now save this file and make a new copy with a different name, such as Earthquakes_optional.ipynb. Then use the new file to continue with the optional exercises.

Constructing a city risk status alert DataFrame

A government or other organisation may want to monitor a certain list of cities with regard to whether they may be at risk of earthquake damage. To answer this question you should create a function that uses the endangered_cities function you have defined above to create such a DataFrame.

Your function city_risk_alert should return a pandas DataFrame that includes the status of 'ENDANGERED' or 'SAFE' for a certain city. The dataframe should also contain the city name, country and status for each city input. You could also extend this to add more columns showing things like the distance and magnitude of the nearest earthquake. And you could perhaps make it so any endangered cities were put at the top of the list.

For example:

display( city_risk_alert( ['Rome', 'Milan', 'Pisa'] )

might give the following output:

city country status
Pisa Italy ENDANGERED
Rome Italy SAFE
Milan Italy SAFE

Visualisation Exercise: display endangered cities on a map

The code below creates a Map object using the ipyleaflet module and uses this to display powerful quakes on the map. If you have coded the powerful_quakes function for Question 2b above, the code in the cell below the map should draw the detected powerful quakes onto the map at their correct locations.

To install the ipyleaflet module use pip3 install ipyleaflet. If the map does not display after installation be sure to restart the kernel, and close and reopen this file. We provide the draw_circle_on_map function, this add circles to a specified location on the map, where the location is defined by longitudes and latitudes.

More Ideas for Graphical Display

It would be nice to also see the endangered cities on the map. For an ambitious exercise you could see if you can draw lines on the map running from from the locations of powerful earthquakes to the cities that are endangered by them.