Searching for Data¶
Aside from connecting to the filter stream, all other ways of collecting Twitter data are different searches of the full archive.
Specifying the Query¶
Search queries are specified in an event query file according to Twitter’s search operator syntax rules.
See an example of a search query file here.
Using the Full Archive Search¶
Once the event query file is ready, the command for searching for tweets is
python -m twitter.search event_name
The search can be cancelled at any time with CTRL+C
.
Counting¶
Twitter’s API allows you to request the total number of tweets that will be returned by a given query without actually retrieving the tweets themselves. This is helpful for estimating the size of a query and staying under the API’s monthly quota. It is also convenient if only time series data is needed and not the full tweet data.
To get the counts, simply use the get_counts
flag:
python -m twitter.search event_name --get_counts
This accesses Twitter’s count endpoint and returns time series count data in JSON files in the output
directory. If you have more than one query, then there will be one file per query, numbered in the same order that they appear in the input query file. You can set the granularity
of the time series data to be minute
, hour
, or day
:
python -m twitter.search event_name --get_counts -granularity day
Search Parameters¶
The search has several optional parameters. These are specified as flags on the standard search command, for example:
python -m twitter.search event_name --get_convos --backfill -n_days_back 30
See the documentation on updates and backfills, and conversation, quote, and timeline searches for additional example usage.
Parameter |
Description |
---|---|
config_f |
The configuration file to use if not using the default |
max_results_per_page |
The maximum number of tweets to return per page of search. Defaults to the maximum of 500. The minimum must be 10 |
get_counts |
Whether to count the number of tweets that will be returned by the queries, in place of actually searching for the tweets |
granularity |
If counting, the granularity of the time series data, either “minute”, “hour”, or “day”. Defaults to “hour” |
get_convos |
Whether to get the conversations for the event using the conversation IDs from a search/stream, or for a list input conversation IDs. If True, |
get_quotes |
Whether to get tweets that quote those from a search. Note, this is not needed for tweets that were collected from a stream. If True, |
get_quotes_of_quotes |
If a quote search has been run before, whether to get quote tweets of those quote tweets. This command can be run repeatedly, but becomes increasingly less efficient because all quote tweets, not just those from the prior run of |
get_timelines |
Whether to get user timelines for a search/stream, or for a list of input user handles or IDs. If True, |
full_timelines |
Whether to retrieve the full timelines of users. Defaults to False |
user_ids_f |
Filename of a newline delimited text file of user IDs or handles for collecting user timelines |
convo_ids_f |
Filename of a newline delimited text file of conversation IDs for collecting reply threads |
update |
Whether to update the dataset with tweets that have occurred since the last tweet time in the search/stream. If True, |
backfill |
Whether to update the dataset with tweets that occurred before the earliest tweet time in the search/stream. If True, |
start_time |
The start time of the search. Overrides any start time set in the event query file. Use |
end_time |
The end time of the search. Overrides any end time set in the event query file. Use |
n_days_back |
How many days back to start the search relative to |
n_days_after |
How many days after to end the search relative to |
append / overwrite |
Whether to append JSON tweets to an existing file for the event. By default, tweets are appended |
write_count_files / no_count_files |
Whether to write JSON time series count data when counting. Defaults to True if running a standard counting search. Defaults to False if running a timeline, conversation, or quote search (because they will produce many count files, one per entity being searched) |
verbose / quiet |
Whether to print information/updates to the console while running the stream. By default, information is printed |
update_interval |
How often to print updates of the number of tweets collected, in minutes |