Install appropriate packages and open the libraries.
library(Hmisc)
library(funModeling)
library(tidyverse)
Source, Target, and Weight Data Frame
The data represent a summary of a Telegram network that consists of
eight channels, namely “cyber_frontZ,” “ruserbia,” “orly_rs,”
“balkan_spy,” “russkeydomserbia,” “Prigozhin_hat,” “rtbalkan_ru,” and
“narodnapatrola.”
The “source” column identifies the channel, while the
“unique_domains” and “total_URLs” columns represent the number of
distinct domains and URLs shared in each channel, respectively.
stats <- data.frame(ch_stats)
datatable(stats)
The data suggests that the cyber_frontZ channel is the most active in
terms of sharing, with the highest number of unique domains (1610) and
total URLs (9106) shared. In contrast, the narodnapatrola and
russkeydomserbia channels had the lowest number of unique domains and
total URLs shared, with only 25 and 77 unique domains and 194 and 196
total URLs, respectively.
The total number of unique domains and total URLs shared across all
channels is 2095 and 17339, respectively. This means that the
cyber_frontZ channel alone is responsible for sharing the majority of
the URLs and unique domains within the network.
Summary Statistics for Core Channels
Looking at the summary statistics, we can see that the mean number of
unique domains (378.2) and total URLs (2211.1) in the Telegram network
are much higher than the minimum values (25 and 194, respectively). The
25 and 194 are the minimum values across all channels, while the means
represent the average values across all channels. This suggests that
there is a wide range of unique domains and URLs being shared across the
network.
The median number of unique domains (87) and total URLs (857) across
all channels are lower than the means, indicating that the distribution
of the data may be positively skewed. This is also supported by the
positive skewness values for both variables reported in the additional
information.
The interquartile ranges (IQR) for unique domains (390.75) and total
URLs (2544.75) are relatively large, which further supports the notion
of variability in the data. The maximum values for both variables are
quite large, with 1610 unique domains and 9106 total URLs, indicating
that some channels are particularly active in sharing URLs and unique
domains.
Overall, the additional information provides context for the summary
statistics previously given, allowing for a more comprehensive
understanding of the distribution of unique domains and URLs in the
Telegram network.
Basic Exploratory Data Analysis
Basic EDA (Exploratory Data Analysis) is an initial and informal
examination of data to understand its main characteristics. It is often
the first step in a data analysis process and can help identify
potential problems or interesting patterns in the data.
Basic EDA typically involves looking at the distribution of the
variables in the dataset, checking for missing values or outliers,
identifying patterns or relationships between variables, and summarizing
key statistics such as means, medians, and standard deviations.
Convert data to data frame.
Plot function.
plot_function <- function(df) {
# insert code to create a plot of the data
# for example, using ggplot2:
ggplot(data = df) +
geom_histogram(aes(x = "total_URLs"), bins = 10)
}
plot_function
function(df) {
# insert code to create a plot of the data
# for example, using ggplot2:
ggplot(data = df) +
geom_histogram(aes(x = "total_URLs"), bins = 10)
}
Glimpse function.
glimpse_function <- function(df) {
# insert code to print a summary of the data structure and variable types
# for example, using dplyr:
glimpse(df)
}
glimpse_function(df)
Rows: 8
Columns: 3
$ source <chr> "cyber_frontZ", "ruserbia", "orly_rs", "rtbalkan_ru", "balkan_spy", "…
$ unique_domains <dbl> 1610, 745, 354, 55, 97, 63, 25, 77
$ total_URLs <dbl> 9106, 3904, 2366, 1269, 445, 209, 196, 194
Summary function.
summary_function <- function(df)
{
# insert code to print a summary of the data values
# for example, using base R:
summary(df)
}
summary_function(df)
source unique_domains total_URLs
Length:8 Min. : 25.0 Min. : 194.0
Class :character 1st Qu.: 61.0 1st Qu.: 205.8
Mode :character Median : 87.0 Median : 857.0
Mean : 378.2 Mean :2211.1
3rd Qu.: 451.8 3rd Qu.:2750.5
Max. :1610.0 Max. :9106.0
Profiling function.
profiling_num_function <- function(df) {
# insert code to create a numeric variable profiling report
# for example, using dplyr and tidyr:
stats %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
group_by(variable) %>%
summarize(mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
n_missing = sum(is.na(value)))
}
profiling_num_function(df)
Plot the Core Channel Data
The code uses the ggplot2 package to create two histograms of the
“unique_domains” and “total_URLs” variables contained in the
“ch_stats.csv”, now “df”, data.
For the first histogram, the code specifies that the x-axis will show
the “unique_domains” variable, and that the y-axis will show the
frequency of occurrence of that variable. The code then specifies that
the plot should be created using the “geom_histogram” function, which
will create a histogram of the data. Finally, the “labs” function is
used to add x and y axis labels to the plot.
For the second histogram, the code is similar, except that it
specifies that the x-axis should show the “total_URLs” variable, and
that the plot should be labeled accordingly.
Overall, the code creates two histograms of the data contained in the
“df” data frame, providing a visual representation of the distribution
of unique domains and total URLs shared in the Telegram network.
ggplot(df, aes(x = unique_domains)) + geom_histogram() + labs(x = "Unique Domains", y = "Frequency")

ggplot(df, aes(x = total_URLs)) + geom_histogram() + labs(x = "Total URLs", y = "Frequency")

Describe Core Channels
Grouping and Percentages for Edges
df <- data.frame(ch_stats)
profiling_num_function(df)
The “describe(df)” code is used to provide a summary of the
distribution of values for each column in the df data frame.
Write file to save results.
print(profiling_num(df))
write.csv(profiling_num(df), file = "prof_num_ch_stats.csv")
Combined
df <- data.frame(stats)
basic_eda <- function(stats) {
plot(stats)
plot_num(stats)
glimpse(stats)
summary(stats)
profiling_num(stats)
}
basic_eda(stats)
Rows: 8
Columns: 3
$ source <chr> "cyber_frontZ", "ruserbia", "orly_rs", "rtbalkan_ru", "balkan_spy", "…
$ unique_domains <dbl> 1610, 745, 354, 55, 97, 63, 25, 77
$ total_URLs <dbl> 9106, 3904, 2366, 1269, 445, 209, 196, 194


