Data Cleaning


r/wallstreetbets Posts

Cleaning Process

Reddit Cleaning


Challenges Faced

  1. Alignment: Formatting the ‘date’ column to ‘datetime64’ to ensure compatibility and precise alignment with GameStop (GME) stock price data for time series analysis.

  2. Aggregation: Using the groupby method to aggregate posts by date requires careful handling to ensure accurate summarisation of data points, like total posts, average scores, or comments per day.

  3. Data Frame: The process of transforming raw subreddit data into a cleaned and structured data frame ready for analysis involves several preprocessing steps, including handling missing values, removing duplicates, and standardising data formats.


GameStop Stock Prices

Cleaning Process

GME Cleaning


Challenges Faced

The process of transforming the raw API data into a cleaned data frame involved multiple sub-tasks:

  • Ensuring data consistency and accuracy after type conversion.
  • Handling any missing or anomalous data points that could skew analysis.
  • Structuring the data in a way that would align with analytical goals.


Merging Data Frames

Once the data was cleaned, the GME stock price data frame was merged with the r/wallstreetbets post data frame.

Merged Data Frames

Rows containing ‘Nan’ indicate that the stock market was closed - these rows were subsequently dropped.

Tree

Brick

Chest

Man