Geolocation remains an opt-in feature on Twitter, which has rendered it relatively unpopular with users -- most statistics suggest between 1% and 3% of users actually choose to provide their exact coordinates. This enormous limitation therefore makes such data largely ungeneralizable to Twitter users more broadly (see, for example, here and here) but that has not diminished the appeal of the idea of getting to understand which users are tweeting what from where.
My workshop on Making Social Media Matter at Boston University recently took up this task while fully understanding what were producing was, at best, only a fraction of all activity on any given topic. Since I have not been able to find a decent tutorial on how to do this elsewhere, I am posting this one here for the workshop attendees but also others interested in doing similar work. Please feel free to comment or email me with feedback.
First, we need some geolocative data. For this example, I'll use a sample Excel dataset (in .csv format) of 1,791 tweets about immigration that I've collected and exported of the BU-TCAT. If you would like to access the TCAT and millions of tweets on a variety of topics, please contact me. Normally, I can add custom search terms upon request for certain use cases (like dissertations or similar) if you don't find something that suits your interests, just let me know.
Back to the tutorial:
To begin, we will create our nodes file for importing to gephi. To do so, start by deleting all columns from the datafile except:
‘from_user_name’ ‘text’ ‘lang’ ‘to_user_name’ ‘location’ ‘lat’ and ‘lng'
This next step is important – you will break your nodes file if done improperly - and here we have to rename
‘from_user_name’ to ‘id’ and then copy that column and name it ‘label’
Save this file with a name of your choice, here, we will save as ‘imm_nodes.csv’
At this point, we have finished making our data suitable to import into Gephi to visualize our geolocative tweets. That is good. Before going further, make sure you have installed both the 'Map of Countries' and 'GeoLayout' plugins in order to see where your tweets actually are coming from in the world. Both are free and available by going under the Tools -- > Plugins --> Available Plugins menu of Gephi.
Once installed, reopen Gephi as necessary and start by going to
File --> New Project
Then run the 'Map of Countries' using 'Layout' in the Overview tab of Gephi. Once done, you should see an empty general map of the world, like this:
Now, click on the Data Laboratory tab in Gephi and follow these steps:
Import Spreadsheet --> imm_nodes.csv (import as ‘Nodes table’)
Make sure that 'lat' and 'lng' are identified as 'double' values, not 'string' and be sure that ‘Force nodes to be created as new ones’ is checked.
Once you have imported your nodes, go back to the Overview tab in Gephi. You should see a box of nodes more or less hovering over Atlantic Ocean, Africa, or Europe. Not to worry.
Run the 'Geo Layout' spatialization using the layout menu, be sure here to set
'Latitude' as 'lat' and 'Longitude' as 'lng'
Once run, all the nodes should have a proper geolocative home, as below.
Of course, at this point, it is clear a few things are missing, namely edges and color. While we can deal easily enough with color, we will have to save the adding of edges for the next tutorial.
To add color, in this case by language of tweets, in the Overview tab of Gephi, go to
Filters --> Attributes --> Partition --> background_map (Node)
Select 'null' and Filter. This will allow the nodes to have color added without adding color to the nodes of the background map. To add color the nodes, in the Overview tab of Gephi, go to Partition in the upper left of your screen, not under the Filters menu, make sure 'Nodes' is highlighted and select
Partition --> Refresh --> lang
Once you click Apply, you can see the language that users (nodes) identified in their profile, and this gives some sense of not only where but in which language users are tweeting about immigration around the world.
Go back and turn off the background map Filter and you should see something like this.
That is it -- welcome to the wonderful world of geolocation :)
For the next phases, let’s start by making an edge file.
Delete every column from the original imm-20150307-20150422--------geoTweets--.csv file except:
‘from_user_name’ and ‘to_user_name’
Once you have done that, rename ‘from_user_name’ to ‘source’ and ‘to_user_name’ to ‘target’
Save this file as ‘imm_edges.csv’ or something similar.
Go back to the Data Laboratory and import spreadsheet 'imm5_edges.csv' as an ‘Edges table’
Check ‘Create missing nodes’
Go back to the Overview and see what you created!
Great, right, except some nodes are geolocated in the ocean. Why? Those users did not turn on geolocation, or if they did, we don’t have coordinates because they were only mentioned in the dataset, they didn't tweet themselves.
Let’s put them in Antarctica by re-running the Geo Layout process again.
If you don't like having nodes in falsely settled in around southern pole, just do not check ‘Create missing nodes’ and you will see far fewer edges, but in their proper locations.
Future posts will address making graphs dynamic and interactive for the web.
Let me know questions or issues @jgroshek or jgroshek '@' bu.edu
Hope it is helpful! Thanks!