Searching for Data

Aside from connecting to the filter stream, all other ways of collecting Twitter data are different searches of the full archive.

Specifying the Query

Search queries are specified in an event query file according to Twitter’s search operator syntax rules.

See an example of a search query file here.

Counting

Twitter’s API allows you to request the total number of tweets that will be returned by a given query without actually retrieving the tweets themselves. This is helpful for estimating the size of a query and staying under the API’s monthly quota. It is also convenient if only time series data is needed and not the full tweet data.

To get the counts, simply use the get_counts flag:

python -m twitter.search event_name --get_counts

This accesses Twitter’s count endpoint and returns time series count data in JSON files in the output directory. If you have more than one query, then there will be one file per query, numbered in the same order that they appear in the input query file. You can set the granularity of the time series data to be minute, hour, or day:

python -m twitter.search event_name --get_counts -granularity day

Search Parameters

The search has several optional parameters. These are specified as flags on the standard search command, for example:

python -m twitter.search event_name --get_convos --backfill -n_days_back 30

See the documentation on updates and backfills, and conversation, quote, and timeline searches for additional example usage.

Parameter

Description

config_f

The configuration file to use if not using the default

max_results_per_page

The maximum number of tweets to return per page of search. Defaults to the maximum of 500. The minimum must be 10

get_counts

Whether to count the number of tweets that will be returned by the queries, in place of actually searching for the tweets

granularity

If counting, the granularity of the time series data, either “minute”, “hour”, or “day”. Defaults to “hour”

get_convos

Whether to get the conversations for the event using the conversation IDs from a search/stream, or for a list input conversation IDs. If True, start_time defaults to first_time and end_time defaults to last_time

get_quotes

Whether to get tweets that quote those from a search. Note, this is not needed for tweets that were collected from a stream. If True, start_time defaults to first_time and end_time defaults to last_time

get_quotes_of_quotes

If a quote search has been run before, whether to get quote tweets of those quote tweets. This command can be run repeatedly, but becomes increasingly less efficient because all quote tweets, not just those from the prior run of get_quotes_of_quotes will be searched

get_timelines

Whether to get user timelines for a search/stream, or for a list of input user handles or IDs. If True, start_time defaults to first_time with n_days_back=14, and end_time defaults to last_time. Note, n_days_back will still be overriden to 14 if start_time is not manually set as a parameter

full_timelines

Whether to retrieve the full timelines of users. Defaults to False

user_ids_f

Filename of a newline delimited text file of user IDs or handles for collecting user timelines

convo_ids_f

Filename of a newline delimited text file of conversation IDs for collecting reply threads

update

Whether to update the dataset with tweets that have occurred since the last tweet time in the search/stream. If True, start_time defaults to last_time and end_time defaults to now. If updating conversations or timelines, the start_time is set dynamically based on the latest tweet from each conversation or user. The parameters end_time and n_days_after can be used together to specify an end time other than to the present day. Cannot be done at the same time as a backfill, and quote searches cannot be updated

backfill

Whether to update the dataset with tweets that occurred before the earliest tweet time in the search/stream. If True, start_time defaults to the beginning of the day (UTC) of the first tweet available for the event, and end_time defaults to first_time. If backfilling conversations or timelines, the end_time is set dynamically based on the latest tweet from each conversation or user. The parameters start_time and n_days_back can be used together to specify a start time other than to the first day of the event. Cannot be done at the same time as an update, and quote searches cannot be backfilled

start_time

The start time of the search. Overrides any start time set in the event query file. Use first_time to use the earliest tweet time recorded for an event, and last_time to use the latest

end_time

The end time of the search. Overrides any end time set in the event query file. Use last_time to use the latest tweet time recorded for an event, and first_time to use the earliest. Use now to use the current date

n_days_back

How many days back to start the search relative to start_time. Note, it has to be start_time passed manually as a parameter, not in the event query file. Defaults to 0

n_days_after

How many days after to end the search relative to end_time. Note, it has to be end_time passed manually as a parameter, not in the event query file. Defaults to 0

append / overwrite

Whether to append JSON tweets to an existing file for the event. By default, tweets are appended

write_count_files / no_count_files

Whether to write JSON time series count data when counting. Defaults to True if running a standard counting search. Defaults to False if running a timeline, conversation, or quote search (because they will produce many count files, one per entity being searched)

verbose / quiet

Whether to print information/updates to the console while running the stream. By default, information is printed

update_interval

How often to print updates of the number of tweets collected, in minutes