Bitcoin Price Dataset - Cryptogracia

In July 2024, I worked on my Masters dissertation, ‘social media analysis of the cryptocurrency market.’ The project involved crawling Bitcoin price data from cryptocurrency data aggregators such as CoinMarketCap and CryptoCompare. This price data was correlated with the sentiment data from top social media sites including Twitter, Telegram, Reddit and BitcoinTalk.

The correlation between both datasets showed a relationship between the price of cryptocurrencies and social media sentiment. And most importantly, the level of impact each social media platform had on Bitcoin and altcoins. The best part about this project was when I built predictive model that can forecast the future price of the virtual assets.

Now, If you’ve got a similar project where you need the Bitcoin historical price data as a csv for download, I’ll show you the steps that enabled me get mine. I’ll also reveal some data cleaning steps such as the removal of stop words, institutional comments, and links. You’ll also get to know the machine learning technciques I used to model the data.

How to Gather Bitcoin Price Data

There are several cryptocurrency price aggregators that you can use to get cryptocrrency price dataset. Some of these platforms include:

Cryptocompare
Coinmarketcap
Coingecko

Let’s take a look at one of these platforms:

Cryptocompare

For my dissertation, I used CryptoCompare to get Bitcoin price dataset for free. At the time, there was no need to use an API which made the data collection process straightforward. However, you’ll need to setup an API to crawl the data.

The dataset, on the other hand, was classified by year, month, and date, which made it easy to correlate with sentiments gotten around the same time. This means that I could check if comments made on social media within a given timeframe, did impact Bitcoin price. I could alslo develop a model that would take data for a given market season such as bull or bear season to predict price.

For instance, you’re trying to predict when the crypto alt season will begin in 2025. You can use historic data from Novemeber 2020 to forecast. The machine learning model you choose may need technical indicators such as relative strenghth index (RSI), moving average convergence divergence (MACD), stochastic, etc. to better predict the future price of these assets.

Apart from data as a feature as a feature in this dataset, other features were price, 24-hour volume, and market capitalization. The dataset is shown in Figure 1.

Code to Crawl Bitcoin Price Data from CryptoCompare

So let’s say you’ve chosen to use CryptoCompare, you’d need a code that can automatically crawl the data based on the date range you want. A code of this nature can help you begin:

import requests
import pandas as pd
from datetime import datetime

def fetch_crypto_data(symbol, limit=400, api_key='YOUR_API_KEY'):
    """
    Fetch historical daily cryptocurrency data.
    """
    url = 'https://min-api.cryptocompare.com/data/v2/histoday'
    parameters = {
        'fsym': symbol,
        'tsym': 'USD',
        'limit': limit,
        'api_key': api_key
    }
    try:
        response = requests.get(url, params=parameters)
        response.raise_for_status()  # Raise HTTPError for bad responses
        data = response.json()
        if data.get('Response') == 'Success':
            return data['Data']['Data']
        else:
            print(f"Error fetching data for {symbol}: {data.get('Message', 'Unknown error')}")
            return []
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return []

# List of cryptocurrencies to fetch data for
crypto_symbols = ['BTC']

# Data storage
market_data = []

# Fetch and process data for each cryptocurrency
for symbol in crypto_symbols:
    historical_data = fetch_crypto_data(symbol)
    if historical_data:
        for day_data in historical_data:
            date = datetime.utcfromtimestamp(day_data['time']).strftime('%Y-%m-%d')
            price = day_data['close']
            volume_24h = day_data['volumeto']
            # Market Cap omitted, as 'volumeto' is in USD
            
            market_data.append({
                'Symbol': symbol,
                'Price': price,
                'Volume_24h': volume_24h,
                'Date': date
            })

# Convert collected data to DataFrame
if market_data:
    market_df = pd.DataFrame(market_data)
    # Print and save the DataFrame
    print(market_df.head())
    market_df.to_csv('crypto_market_data.csv', index=False)
else:
    print("No market data collected.")

The code above will crawl Bitcoin’s price data for the past 400 days. You can adjust this timeframe by changing the figure. Adjustments can also be made to the code to crawl other cryptocurrencies such as Ethereum, Ripple, Solana, Shiba, Dogecoin, etc.

Here’s a step by step process of setting up an account on the plaform to get an API that will work with the code:

Get the API key:

Go https://min-api.cryptocompare.com/ to get an API.
Register an account.
Replace ‘YOUR_API_KEY’ in the code below wiht the key you got from platform.
Run the code on Google Collab or any other Python compiler you’ve got.

import requests
import pandas as pd
from datetime import datetime

def fetch_crypto_data(symbol, limit=400, api_key=None):
    url = 'https://min-api.cryptocompare.com/data/v2/histoday'
    parameters = {
        'fsym': symbol,
        'tsym': 'USD',
        'limit': limit,
        'api_key': api_key
    }
    
    # Make the request to CryptoCompare
    response = requests.get(url, params=parameters)
    
    # Handle potential issues with the response
    if response.status_code != 200:
        print(f"Error fetching data for {symbol}: {response.status_code}")
        return None
    
    data = response.json()
    if data.get('Response') != 'Success':
        print(f"No data for {symbol}: {data.get('Message')}")
        return None
    
    return data

# List of cryptocurrencies to fetch
crypto_symbols = ['BTC']
api_key = 'YOUR_API_KEY'  # Replace this with your actual API key

# Storage for market data
market_data = []

# Fetch data for each cryptocurrency
for symbol in crypto_symbols:
    data = fetch_crypto_data(symbol, limit=400, api_key=api_key)
    if data:
        for day_data in data['Data']['Data']:
            date = datetime.utcfromtimestamp(day_data['time']).strftime('%Y-%m-%d')
            price = day_data['close']
            volume_24h = day_data['volumeto']
            market_cap = day_data['close'] * day_data['volumeto']  # Approximation

            market_data.append({
                'Symbol': symbol,
                'Price': price,
                'Volume_24h': volume_24h,
                'Market_Cap': market_cap,
                'Date': date
            })

# Convert the collected data into a DataFrame
market_df = pd.DataFrame(market_data)

# Display the DataFrame or save it to a CSV file
print(market_df)
market_df.to_csv('crypto_market_data.csv', index=False)

How to Crawl Sentiment Data

There are so manny social media platforms that will help you build thousands of data instances for your project. Given their wide nature, you’d have to narrow down based on the audience type, age, and location. Consider the following platforms.

1. Twitter

Twitter’s data is updated every minute and second, hence, you can find real time sentiment data to pair with the price data. This real time data is ideal for predicting short term price movements. So let’s say you’re building a trading bot for scalpers, then you’d greatly rely on the data from Twitter.

2. Telegram

Telegram also has real time data from groups and channels. There has also been heightened interest on this media due to airdrops like Hamster Kombat, BLUM, Tapswap, Memfi, Dogs, and others. Telegram also has a younger demographic, hence, you may want to see what this class of users have to say. In line with that, there are more speculators than experienced traders on Telegram.

3. BitcoinTalk

BitcoinTalk has historic context with cryptocurreies. Bitcoin’s and Ethereum’s launch were first annouced on BitcoinTalk, and Satoshi Nakamoto, Bitcoin’s founder, is also the creator of the forum. You’d find trading analysis from experienced traders on BitcoinTalk. Thus, you can pair the sentiment of experienced traders with that of speculators to predict the next market direction.

Code to Crawl Social Media Data

The code below will enable you crawl the sentiment data from Telegram, Reddit and BitcoinTalk. You can adjust the code accordingly based on how many comments or sentiments you need from each platform.

You’ll also have to adjust the fields where API ID, Username and phone number are required. Telegram, for instance, will give you an API ID, associated with your phone number which you can use to crawl sentiments from the platform. BitcoinTalk, on the other hand, only requires a link to the web pages which will be given to an HTML parser to crawl the content.

!pip install asyncpraw
!pip install --upgrade telethon
!pip install telethon nest_asyncio

import asyncio
import asyncpraw
import nest_asyncio
import pandas as pd
import random
from datetime import datetime
import requests
from bs4 import BeautifulSoup
from telethon import TelegramClient
from telethon.errors import PhoneCodeInvalidError, SessionPasswordNeededError
from telethon.tl.functions.messages import GetHistoryRequest
from telethon.tl.types import PeerChannel
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Download NLTK data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.WordNetLemmatizer()

# Define a pattern to remove emojis and special characters
emoji_pattern = re.compile(
    "["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # Chinese characters
    u"\U00002702-\U000027B0"  # Dingbats
    u"\U0001F926-\U0001F937"
    u"\U00010000-\U0010ffff"
    u"\u2640-\u2642"
    u"\u2600-\u2B55"
    u"\u200d"
    u"\u23cf"
    u"\u23e9"
    u"\u231a"
    u"\ufe0f"  # Dingbats
    u"\u3030"
    "]+", flags=re.UNICODE
)

def is_relevant(text):
    # Define patterns to identify non-relevant data
    ad_patterns = [
        r'http\S+',  # URLs
        r'www\S+',
        r'https\S+',
        r'join our telegram',
        r'follow us on',
        r'click here',
        r'visit our website',
        r'\b(buy|sell)\b',  # common advertising keywords
        r'\b(bot|automated)\b'  # bot-related keywords
    ]

    # Check for institutional or bot-like comments
    if any(re.search(pattern, text, re.IGNORECASE) for pattern in ad_patterns):
        return False
    return True

def preprocess_text(text):
    if not is_relevant(text):
        return None  # Skip non-relevant data

    # Remove emoji
    text = emoji_pattern.sub(r'', text)
    # Remove URLs and non-alphabetical characters
    text = re.sub(r'http\S+|www\S+|https\S+|\@\w+|[^a-zA-Z\s]', '', text, flags=re.MULTILINE)
    # Lowercase the text
    text = text.lower()
    # Tokenize the text
    word_tokens = word_tokenize(text)
    # Remove stopwords
    filtered_text = [word for word in word_tokens if word not in stop_words]
    # Lemmatize the text
    lemmatized_text = [lemmatizer.lemmatize(word) for word in filtered_text]

    return ' '.join(lemmatized_text)

def preprocess_dataset(dataset):
    return [preprocess_text(text) for text in dataset if preprocess_text(text) is not None]

# Reddit Data Collection
CLIENT_ID = '82dtnfocqa02q3prkbwVJg'
CLIENT_SECRET = '1cZbtlW_TIza_ehhaYZAqj0-W6oLbg'
USERNAME = 'USERNAME'
PASSWORD = 'PASSWORD'
USER_AGENT = 'USERAGENT'

async def fetch_reddit_posts(subreddit_name, limit=500):
    reddit = asyncpraw.Reddit(
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET,
        username=USERNAME,
        password=PASSWORD,
        user_agent=USER_AGENT
    )

    posts = []
    subreddit = await reddit.subreddit(subreddit_name)
    async for post in subreddit.hot(limit=limit):
        posts.append({'title': post.title, 'created': post.created_utc})

    await reddit.close()
    return posts

# BitcoinTalk Data Collection
def fetch_bitcointalk_posts(thread_url, num_pages=6):
    posts = []

    for page in range(num_pages):
        url = f"{thread_url}.{page * 20}"  # BitcoinTalk pagination: add page offset
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        for post in soup.find_all('div', class_='post'):
            post_content = post.get_text(separator=" ", strip=True)

            # Try to find the post date
            post_date_element = post.find('div', class_='smalltext')
            if post_date_element:
                post_date = post_date_element.text
            else:
                # Handle the case where the date is not found
                post_date = 'Unknown'

            posts.append({'content': post_content, 'created': post_date})

    return posts

# Apply the nest_asyncio patch
nest_asyncio.apply()

api_id = 'YOUR API ID'
api_hash = 'YOUR API'
phone_number = 'YOUR NUMBER'

# Initialize the client
client = TelegramClient('session_name', api_id, api_hash)

async def fetch_telegram_messages(channel_username, limit=500):
    print("Connecting to Telegram...")
    await client.connect()

    if not await client.is_user_authorized():
        print("Sending code request...")
        await client.send_code_request(phone_number)
        code = input('Enter the code you received: ')
        try:
            await client.sign_in(phone_number, code)
        except PhoneCodeInvalidError:
            print("Invalid code entered. Please try again.")
            return []
        except SessionPasswordNeededError:
            password = input('Two-step verification enabled. Please enter your password: ')
            await client.sign_in(password=password)

    try:
        print(f"Fetching messages from {channel_username}...")
        channel = await client.get_entity(channel_username)
        history = await client(GetHistoryRequest(
            peer=channel,
            limit=limit,
            offset_date=None,
            offset_id=0,
            max_id=0,
            min_id=0,
            add_offset=0,
            hash=0
        ))
        messages = [{'message': message.message, 'created': message.date} for message in history.messages if message.message]
        return messages
    except Exception as e:
        print(f"An error occurred: {e}")
        return []
    finally:
        await client.disconnect()

# Main Function to Collect and Preprocess Data
async def main():
    # Reddit
    subreddits = ['CryptoCurrency', 'Bitcoin', 'CryptoMoonShots', 'Ethereum', 'SHIBArmy', 'algotrading']
    reddit_tasks = [fetch_reddit_posts(sub, limit=50) for sub in subreddits]
    all_reddit_posts = await asyncio.gather(*reddit_tasks)

    reddit_data = []
    for sub, posts in zip(subreddits, all_reddit_posts):
        for post in posts:
            reddit_data.append({'Message': post['title'], 'Source': 'Reddit', 'Created': datetime.utcfromtimestamp(post['created']).isoformat()})

    # BitcoinTalk
    thread_urls = [
        'https://bitcointalk.org/index.php?topic=5490782.0',

    ]
    bitcoin_data = []
    for url in thread_urls:
        posts = fetch_bitcointalk_posts(url, num_pages=1)
        for post in posts:
            bitcoin_data.append({'Message': post['content'], 'Source': 'BitcoinTalk', 'Created': post['created']})

    # Telegram
    channel_usernames = ['CRYPTO_jokker', 'CRYPTO_insidderr', 'blumcrypto', 'Crytpo_Woolf', 'bittonapp', 'blumcrypto', 'whalepumpgroup', 'cryptobull']
    telegram_data = []
    for username in channel_usernames:
        messages = await fetch_telegram_messages(username, limit=50000)
        for message in messages:
            telegram_data.append({'Message': message['message'], 'Source': 'Telegram', 'Created': message['created'].isoformat()})

    # Combine all data
    all_data = reddit_data + bitcoin_data + telegram_data

    # Preprocess the combined data
    preprocessed_messages = preprocess_dataset([data['Message'] for data in all_data])

    # Combine preprocessed data with metadata
    combined_data = []
    current_date = datetime.now().isoformat()

    for data, preprocessed_message in zip(all_data, preprocessed_messages):
        if preprocessed_message:
            combined_data.append({
                'Message': preprocessed_message,
                'Source': data['Source'],
                'Date_Crawled': current_date,
                'Date_Created': data['Created']
            })

    # Create a DataFrame from the combined data
    df = pd.DataFrame(combined_data)

    # Add an empty column for sentiment labels
    df['Sentiment'] = ''

    # Save the DataFrame to a CSV file
    df.to_csv('evv.csv', index=False)

    print(df)

    print("Sampled data saved to 'sampled_data_for_annotation.csv' for manual annotation.")

# Run the main function and get the preprocessed data
preprocessed_data = asyncio.run(main())

Data Processing Techniques

The code provided for crawling social media data also has several data processing techniques such as:

1. Stopword Removal

Stop words such as “and,” “the,” “is,” etc. are removed from the crawled data. This allows the data to have more meaningful words. On the other hand, the stopwords set from the NLTK library is used for this purpose.

2. Lemmatization

Lemmatization is used to break the words into their base form while considering the word’s context. For example, “running” is converted to “run.” The NLTK WordNetLemmatizer is employed to perform this task.

3. Emoji Removal

A pattern is defined to remove emojis and special characters from the text since they may not convey the true meaning behind the sentiment. I used a regular expression that targets various Unicode ranges corresponding to emojis and special symbols

4. URLs and Non-Alphabetical Characters Removal

A regular expression pattern is also used to remove irrelevant data such as URLs, ad-related data, and bot-generated content. Based on the code, the pattern includes:

‘r’http\S+’ (to remove URLs)
‘r’www\S+’
‘r’https\S+’
‘r’join our telegram’
‘r’follow us on’
‘r’click here’
‘r’visit our website’
‘r’\b(buy|sell)\b’ (to remove common advertising keywords)
‘r’\b(bot|automated)\b’ (to remove bot-related keywords)

5. Handling Institutional Comments

The regular expression pattern is also used in a function to ensure the data only includes individual investor sentiments. This is because institutional investors such as hedge funds, banks, etc. have large portfolios that can cause swift market movements.

But while institutional investors have large portfolios, they are fewer compared to daily traders and spectators. Thus, based on the purpose of your analysis, it might be better to focus on individual investors since they are more and these sentiments would be more representative of what the majority of the market thinks.

Data Augmentation for Social Media Sentiment

Ideally, you should either ues the same data instances from each platform, or determine the ratio based on how relevant a platform is impactful in your analyis. For instance, you may want a greater percentage of data from Telegram if you’re building a predictive model for scalping while more data from Reddit and BitcoinTalk may be more ideal for medium term and long term analysis.

In each case if you have an underepresented class, which might cause your model not to learn well for certain class labels, you could consider creating synthetic data for this class. Here, augmentation can be used to create new data instances using the structure of the old data. Synonyms may be used for the new data instances.

Here’s a code that’ll help you get started with the augmentation process:

import pandas as pd

input_csv = 'REPLACE WITH YOUR FILES'S LOCATION'
df = pd.read_csv(input_csv)

# Split the dataframe based on the 'source' column
reddit_df = df[df['Source'] == 'Reddit']
bitcointalk_df = df[df['Source'] == 'BitcoinTalk']
telegram_df = df[df['Source'] == 'Telegram']

# Save the resulting dataframes to separate CSV files
reddit_csv = 'reddit.csv'
bitcointalk_csv = 'bitcointalk.csv'
telegram_csv = 'telegram.csv'

reddit_df.to_csv(reddit_csv, index=False)
bitcointalk_df.to_csv(bitcointalk_csv, index=False)
telegram_df.to_csv(telegram_csv, index=False)

print(f"Files have been saved as {reddit_csv}, {bitcointalk_csv}, and {telegram_csv}")

Machine Learning Techniques for Crypto Data

There are several machine learning techinques you can use to predict crypto price using the current dataset you have.

These techniques include:

Logistic regression
Random Forest Classifier
Support Vector Machine

Logistic Regression

This is a simple modeling technique you can use to classify the sentiment you get from social media pages.

Given that this is a classiifcation task where sentiments are in classes such as positive, negative, neutral, very positive, and very negative, it might help.

Ideally, logistic regression is used for binary classification but it can also be used for multiclass classiifcation. Hence, if you need a simple place to start from, you might choose this approach.

The next step would be to encode the labels into numbers before correlating with the price data you crawled.

Ethical Review Considerations

Ethical review is important if you’re crawling social media data. This will ensure you adhere data privacy and security standards while handling comments from these platforms. Some considerations would include:

1. Data Storage

You’d have to store the sentiment data in a safe medium. It could be an online or offline resource. You may also have to delete the data using it for its intended purpose.

2. Data Privacy

The details of such crawled data will not be a part of your publications, hence, discuss only the results of the analysis.

Combine Sentiment Scores with Market Data

Asides building predictive model, you may want to carry out correlation analysis between the Bitcoin price data and social media data. The first step to begin is averaging sentiment scores.

Averaging Sentiment Scores

Sentiment scores can be calculated by averaging the sentiments of all messages for a particular date, a positive score (+1) and a negative score (-1) on the same date would result in an average score of 0.

For example:

Positive Sentiment: +1
Negative Sentiment: -1

Average: (+1 + (-1)) / 2 = 0

Sentiment Score for Several Sentiments from the Same Date

You can then combine the calculated sentiment scores with the market data (price, volume, etc.) from Cryptocompare to form one dataset as shown in Figure 3.

Figure 3: CryptoCompare Bitcoin Dataset Used for Analysis

However, I needed to filter out the price data to ensure its date corresponds with those available in the new sentiment data. This filtering is shown in Figure 4 .

Figure 4: Different Date Ranges for Bitcoin Data and Sentiment Scores

Correlation Analysis

You can perform time-series analysis to correlate sentiment scores with market prices.

Two initial correlation analyses can performed. These are:

A correlation with the sentiment scores and Bitcoin only.
A correlation with sentiment scores and average price of the 10 selected cryptocurrencies to understand the overall market sentiment.

Pearson correlation can be used to calculate the correlation between social media sentiment and crypto price. The result may show a weak linear relationship between social media sentiment and market prices.

This correlation can be checked with different sizes of datasets. In my findings, I discovered that the correlation was weaker with smaller datasets and a bit stronger with more datasets.

For instance, the correlation was around 0.0004 when 500 data instances were used covering 30 days. However, the correlation score was closer to 0.0012 when the data set was increased to 2000 spanning across 3 months.

Conclusion

You now know how to get the Bitcoin price dataset as well as that for other cryptocurrencies. These steps would also help you get started with crawling social media data for related projects. Encountered issues? Let me know in the comment section.