Install appropriate packages and open the libraries.

library(Hmisc)
library(funModeling) 
library(tidyverse) 



Source, Target, and Weight Data Frame

The data represent a summary of a Telegram network that consists of eight channels, namely “cyber_frontZ,” “ruserbia,” “orly_rs,” “balkan_spy,” “russkeydomserbia,” “Prigozhin_hat,” “rtbalkan_ru,” and “narodnapatrola.”

The “source” column identifies the channel, while the “unique_domains” and “total_URLs” columns represent the number of distinct domains and URLs shared in each channel, respectively.

stats <- data.frame(ch_stats)
datatable(stats)

The data suggests that the cyber_frontZ channel is the most active in terms of sharing, with the highest number of unique domains (1610) and total URLs (9106) shared. In contrast, the narodnapatrola and russkeydomserbia channels had the lowest number of unique domains and total URLs shared, with only 25 and 77 unique domains and 194 and 196 total URLs, respectively.

The total number of unique domains and total URLs shared across all channels is 2095 and 17339, respectively. This means that the cyber_frontZ channel alone is responsible for sharing the majority of the URLs and unique domains within the network.



Summary Statistics for Core Channels

Looking at the summary statistics, we can see that the mean number of unique domains (378.2) and total URLs (2211.1) in the Telegram network are much higher than the minimum values (25 and 194, respectively). The 25 and 194 are the minimum values across all channels, while the means represent the average values across all channels. This suggests that there is a wide range of unique domains and URLs being shared across the network.

The median number of unique domains (87) and total URLs (857) across all channels are lower than the means, indicating that the distribution of the data may be positively skewed. This is also supported by the positive skewness values for both variables reported in the additional information.

The interquartile ranges (IQR) for unique domains (390.75) and total URLs (2544.75) are relatively large, which further supports the notion of variability in the data. The maximum values for both variables are quite large, with 1610 unique domains and 9106 total URLs, indicating that some channels are particularly active in sharing URLs and unique domains.

Overall, the additional information provides context for the summary statistics previously given, allowing for a more comprehensive understanding of the distribution of unique domains and URLs in the Telegram network.



Basic Exploratory Data Analysis

Basic EDA (Exploratory Data Analysis) is an initial and informal examination of data to understand its main characteristics. It is often the first step in a data analysis process and can help identify potential problems or interesting patterns in the data.

Basic EDA typically involves looking at the distribution of the variables in the dataset, checking for missing values or outliers, identifying patterns or relationships between variables, and summarizing key statistics such as means, medians, and standard deviations.

Convert data to data frame.


Plot function.

plot_function <- function(df) {
  # insert code to create a plot of the data
  # for example, using ggplot2:
  ggplot(data = df) +
    geom_histogram(aes(x = "total_URLs"), bins = 10)
}
plot_function
function(df) {
  # insert code to create a plot of the data
  # for example, using ggplot2:
  ggplot(data = df) +
    geom_histogram(aes(x = "total_URLs"), bins = 10)
}


Glimpse function.

glimpse_function <- function(df) {
  # insert code to print a summary of the data structure and variable types
  # for example, using dplyr:
  glimpse(df)
}
glimpse_function(df)
Rows: 8
Columns: 3
$ source         <chr> "cyber_frontZ", "ruserbia", "orly_rs", "rtbalkan_ru", "balkan_spy", "…
$ unique_domains <dbl> 1610, 745, 354, 55, 97, 63, 25, 77
$ total_URLs     <dbl> 9106, 3904, 2366, 1269, 445, 209, 196, 194


Summary function.

summary_function <- function(df) 
{
  # insert code to print a summary of the data values
  # for example, using base R:
  summary(df)
}
summary_function(df)
    source          unique_domains     total_URLs    
 Length:8           Min.   :  25.0   Min.   : 194.0  
 Class :character   1st Qu.:  61.0   1st Qu.: 205.8  
 Mode  :character   Median :  87.0   Median : 857.0  
                    Mean   : 378.2   Mean   :2211.1  
                    3rd Qu.: 451.8   3rd Qu.:2750.5  
                    Max.   :1610.0   Max.   :9106.0  


Profiling function.

profiling_num_function <- function(df) {
  # insert code to create a numeric variable profiling report
  # for example, using dplyr and tidyr:
  stats %>%
    select_if(is.numeric) %>%
    gather(key = "variable", value = "value") %>%
    group_by(variable) %>%
    summarize(mean = mean(value, na.rm = TRUE),
              sd = sd(value, na.rm = TRUE),
              min = min(value, na.rm = TRUE),
              max = max(value, na.rm = TRUE),
              median = median(value, na.rm = TRUE),
              n_missing = sum(is.na(value))) 
}
profiling_num_function(df)

Plot the Core Channel Data

The code uses the ggplot2 package to create two histograms of the “unique_domains” and “total_URLs” variables contained in the “ch_stats.csv”, now “df”, data.

For the first histogram, the code specifies that the x-axis will show the “unique_domains” variable, and that the y-axis will show the frequency of occurrence of that variable. The code then specifies that the plot should be created using the “geom_histogram” function, which will create a histogram of the data. Finally, the “labs” function is used to add x and y axis labels to the plot.

For the second histogram, the code is similar, except that it specifies that the x-axis should show the “total_URLs” variable, and that the plot should be labeled accordingly.

Overall, the code creates two histograms of the data contained in the “df” data frame, providing a visual representation of the distribution of unique domains and total URLs shared in the Telegram network.

ggplot(df, aes(x = unique_domains)) + geom_histogram() + labs(x = "Unique Domains", y = "Frequency")

ggplot(df, aes(x = total_URLs)) + geom_histogram() + labs(x = "Total URLs", y = "Frequency")



Describe Core Channels

Grouping and Percentages for Edges

df <- data.frame(ch_stats)
profiling_num_function(df)



The “describe(df)” code is used to provide a summary of the distribution of values for each column in the df data frame.



Write file to save results.

print(profiling_num(df))
write.csv(profiling_num(df), file = "prof_num_ch_stats.csv")



Combined

df <- data.frame(stats)
basic_eda <- function(stats) {
  plot(stats)
  plot_num(stats)
  glimpse(stats)
  summary(stats)
  profiling_num(stats)
}

basic_eda(stats)
Rows: 8
Columns: 3
$ source         <chr> "cyber_frontZ", "ruserbia", "orly_rs", "rtbalkan_ru", "balkan_spy", "…
$ unique_domains <dbl> 1610, 745, 354, 55, 97, 63, 25, 77
$ total_URLs     <dbl> 9106, 3904, 2366, 1269, 445, 209, 196, 194

