Search for documents by keyword (help)
 
Français Español
  To stay informed
 
• Board
• Scientific Committee
• Economists
• Research Associates
• Contacts
• Directory
Databases & models
 
• BACI
• CHELEM
• Distances
• FDI
• MAcMap
• Market Potentials
• Productivity
• Institutionnal Profiles
• TradePrices
• TradeProd
• INGENUE
• MIRAGE
• OLGAMAP
 
• The CEPII Newsletter
• World Economic Overview
• La lettre du CEPII
• Economic Journals
• Books
 
• Communications
   

 
 
Distances
The CEPII has built and made available two datasets providing useful data for empirical economic research including geographical elements and variables. A common use of these files is the estimation by trade economists of gravity equations describing bilateral patterns of trade flows. Other datasets have been proposed and provide geographical and distance data, notably those developed by Jon Haveman, Vernon Henderson and Andrew Rose. We try to improve upon the existing sets of variables in terms of geographical coverage, measurement and the number of variables provided. Covariates such as bilateral distance, contiguity, or colonial historical links have also been used in other fields than international trade: for the study of bilateral flows of foreign direct investment for instance, but also by researchers interested in explaining migration patterns, international flows of tourists, of telephone traffic, etc. Even outside economics, several researchers in different social sciences use these types of variables. Political scientists, for instance, use distance and contiguity (among other determinants) to explain why some pairs of countries have a higher probability than others of going to war.

The first of these datasets incorporates geographical variables for 225 countries in the world, including the geographical coordinates of their capital cities, the languages spoken in the country under different definitions, a variable indicating whether the country is landlocked, etc. The second dataset is dyadic, in the sense that it includes variables valid for pairs of countries. Distance is the most common example of such a variable, and the file includes different measures of bilateral distances (in kilometers) available for most countries across the world (again 225 countries in the current version of the database).

The geo_cepii (geo_cepii.xls & geo_cepii.dta) file provides data on countries and their main city or agglomeration. Among the country-level variables are first 3 identification codes of the country according to the ISO classification, the country’s area in square kilometers, used to calculate in particular its internal distance. Variables indicating whether the country is landlocked and which continent it is part of are also included.

There are several language variables that can be used to create different indexes of language proximity or dummy variables for common language in dyadic applications like gravity equations. The sources for all language information are the web site www.ethnologue.org and the CIA World Factbook. For each country, we report the official languages (up to three), as well as the languages spoken by at least 20% of the population and the languages spoken by between 9 and 20% of the population (up to four languages in each of those cases). Colonial linkage variables are also often used by economists to proxy for similarities in cultural, political or legal institutions. Our dataset provides several variables (based on the CIA World Factbook, and the Correlates of War Project run by political scientists, available at cow2.la.psu.edu) that identify for each country, up to 4 long-term and up to 3 short-term colonizers in the whole history of the country.

Distance calculation requires information on geographical coordinates of at least one city in each of the country. The simplest measure of geodesic distance considers only the main city of the country, reported here with the English and French names, latitude and longitude. In most cases, the main city is the capital of the country. However, for 13 out of the 225 countries, we considered that the capital was not populated enough to represent the “economic center” of the country. For these countries, we propose the distances data calculated for both the capital city and the economic center. Consequently, there are 238 (225+13) observations in the geo_cepii.xls file. Also included is a variable providing the number of cities for each country (available in the www.world-gazetteer.com dataset) used to calculate our weighted distances described in the next section.

The dist_cepii (dist_cepii.xls or dist_cepii.dta) file provides the bilateral data: the different distance measures and dummy variables indicating whether the two countries are contiguous, share a common language, have had a common colonizer after 1945, have ever had a colonial link, have had a colonial relationship after 1945, are currently in a colonial relationship. There are two common languages dummies, the first one based on the fact that two countries share a common official language, and the other set to one if a language is spoken by at least 9% of the population in both countries. Trying to give a precise definition of a colonial relationship is obviously a difficult task. Colonization is here a fairly general term that we use to describe a relationship between two countries, independently of their level of development, in which one has governed the other over a long period of time and has contributed to the current state of its institutions.

There are two kinds of distance measures: simple distances, for which only one city is necessary to calculate international distances; and weighted distances, for which we need data on the principal cities in each country. The simple distances are calculated following the great circle formula, which uses latitudes and longitudes of the most important city (in terms of population) or of its official capital. These two variables incorporate internal distances based on areas provided in the geo_cepii.xls file. The two weighted distance measures use city-level data to assess the geographic distribution of population inside each nation. The idea is to calculate distance between two countries based on bilateral distances between the largest cities of those two countries, those inter-city distances being weighted by the share of the city in the overall country’s population. This procedure can be used in a totally consistent way for both internal and international distances. We use latitudes, longitudes and population data of main agglomerations of all countries available in www.world-gazetteer.com. The distance formula used is a generalized mean of city-to-city bilateral distances developed by Head and Mayer (2002), which takes the arithmetic mean and the harmonic means as special cases. We provide the two variables corresponding to those cases.

Explanatory note.pdf
   
Contact:&