Setup and Requirements

Install appropriate packages and open the libraries.

library(Hmisc)
library(funModeling) 
library(tidyverse) 
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ───────────────────────────── tidyverse 1.3.2 ──✔ tibble  3.1.8     ✔ stringr 1.5.0
✔ purrr   1.0.1     ✔ forcats 1.0.0── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::src()       masks Hmisc::src()
✖ dplyr::summarize() masks Hmisc::summarize()



Source, Target, and Weight Data Frame


Summary Statistics for Edges

    source          unique_domains     total_URLs    
 Length:8           Min.   :  25.0   Min.   : 194.0  
 Class :character   1st Qu.:  61.0   1st Qu.: 205.8  
 Mode  :character   Median :  87.0   Median : 857.0  
                    Mean   : 378.2   Mean   :2211.1  
                    3rd Qu.: 451.8   3rd Qu.:2750.5  
                    Max.   :1610.0   Max.   :9106.0  


Basic EDA for Edges

Basic Exploratory Data Analysis Basic EDA (Exploratory Data Analysis) is an initial and informal examination of data to understand its main characteristics. It is often the first step in a data analysis process and can help identify potential problems or interesting patterns in the data.

Basic EDA typically involves looking at the distribution of the variables in the dataset, checking for missing values or outliers, identifying patterns or relationships between variables, and summarizing key statistics such as means, medians, and standard deviations.



Frequencies for Edge Variables

Warning: Skipping plot for variable 'target' (more than 100 categories)
NANA

[1] "Variables processed: target, weight"



Describe Edges

edges 

 3  Variables      3026  Observations
-----------------------------------------------------------------------------------------------------------------------------------------------------
source 
       n  missing distinct 
    3026        0        8 

lowest : balkan_spy       cyber_frontZ     narodnapatrola   orly_rs          Prigozhin_hat   
highest: orly_rs          Prigozhin_hat    rtbalkan_ru      ruserbia         russkeydomserbia
                                                                                                                                                  
Value            balkan_spy     cyber_frontZ   narodnapatrola          orly_rs    Prigozhin_hat      rtbalkan_ru         ruserbia russkeydomserbia
Frequency                97             1610               25              354               63               55              745               77
Proportion            0.032            0.532            0.008            0.117            0.021            0.018            0.246            0.025
-----------------------------------------------------------------------------------------------------------------------------------------------------
target 
       n  missing distinct 
    3026        0     2472 

lowest : /c/1732054517/3452 /c/1732054517/4875 /c/1732054517/5727 /c/1732054517/9009 @GRIP_SoundLab    
highest: Ztdk3FDbbhA1Nzdi   zubovskiy4         zvezdalive         zvezdalive.ru      zvezdanews        
-----------------------------------------------------------------------------------------------------------------------------------------------------
weight 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
    3026        0       71    0.714    5.846    9.215        1        1        1        1        2        5        9 

lowest :    1    2    3    4    5, highest:  587  720  755 1179 2664
-----------------------------------------------------------------------------------------------------------------------------------------------------



Grouping and Percentages for Edges

df_edges <- data.frame(edges) %>%
              group_by(target) %>%
              summarise(sum = (sum(weight))) %>%
              mutate(percent = sum / sum(sum) * 100) %>%
              arrange(desc(sum))

datatable(df_edges)



write.csv(df_edges, file = 'perc_target.csv')
LS0tDQp0aXRsZTogIkJyb3RoZXJseSBJbmZvcm1hdGlvbiBPcGVyYXRpb25zIg0Kc3VidGl0bGU6ICJSLUNvZGUgZm9yIGVkZ2VzLmNzdiINCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCINCm91dHB1dDoNCiAgaHRtbF9ub3RlYm9vazoNCiAgICB0b2M6IHllcw0KICAgIHRvY19mbG9hdDogeWVzDQogIGh0bWxfZG9jdW1lbnQ6DQogICAgdG9jOiB5ZXMNCiAgICBkZl9wcmludDogcGFnZWQNCi0tLQ0KDQpcDQpcDQoNCiMgU2V0dXAgYW5kIFJlcXVpcmVtZW50cw0KDQpJbnN0YWxsIGFwcHJvcHJpYXRlIHBhY2thZ2VzIGFuZCBvcGVuIHRoZSBsaWJyYXJpZXMuDQoNCmBgYHtyIGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpDQpgYGANCg0KYGBge3J9DQpsaWJyYXJ5KEhtaXNjKQ0KbGlicmFyeShmdW5Nb2RlbGluZykgDQpsaWJyYXJ5KHRpZHl2ZXJzZSkgDQpgYGANCg0KXA0KXA0KDQojIFNvdXJjZSwgVGFyZ2V0LCBhbmQgV2VpZ2h0IERhdGEgRnJhbWUNCg0KYGBge3IgZWNobz1GQUxTRX0NCmxpYnJhcnkoRFQpDQpzdGF0cyA8LSBkYXRhLmZyYW1lKGVkZ2VzKQ0KZGF0YXRhYmxlKHN0YXRzKQ0KYGBgDQoNClwNCg0KIyMjIFN1bW1hcnkgU3RhdGlzdGljcyBmb3IgRWRnZXMNCg0KYGBge3IgZWNobz1GQUxTRX0NCnN1bW1hcnkoc3RhdHMpDQpgYGANCg0KXA0KDQojIEJhc2ljIEVEQSBmb3IgRWRnZXMNCg0KQmFzaWMgRXhwbG9yYXRvcnkgRGF0YSBBbmFseXNpcyBCYXNpYyBFREEgKEV4cGxvcmF0b3J5IERhdGEgQW5hbHlzaXMpIGlzIGFuIGluaXRpYWwgYW5kIGluZm9ybWFsIGV4YW1pbmF0aW9uIG9mIGRhdGEgdG8gdW5kZXJzdGFuZCBpdHMgbWFpbiBjaGFyYWN0ZXJpc3RpY3MuDQpJdCBpcyBvZnRlbiB0aGUgZmlyc3Qgc3RlcCBpbiBhIGRhdGEgYW5hbHlzaXMgcHJvY2VzcyBhbmQgY2FuIGhlbHAgaWRlbnRpZnkgcG90ZW50aWFsIHByb2JsZW1zIG9yIGludGVyZXN0aW5nIHBhdHRlcm5zIGluIHRoZSBkYXRhLg0KDQpCYXNpYyBFREEgdHlwaWNhbGx5IGludm9sdmVzIGxvb2tpbmcgYXQgdGhlIGRpc3RyaWJ1dGlvbiBvZiB0aGUgdmFyaWFibGVzIGluIHRoZSBkYXRhc2V0LCBjaGVja2luZyBmb3IgbWlzc2luZyB2YWx1ZXMgb3Igb3V0bGllcnMsIGlkZW50aWZ5aW5nIHBhdHRlcm5zIG9yIHJlbGF0aW9uc2hpcHMgYmV0d2VlbiB2YXJpYWJsZXMsIGFuZCBzdW1tYXJpemluZyBrZXkgc3RhdGlzdGljcyBzdWNoIGFzIG1lYW5zLCBtZWRpYW5zLCBhbmQgc3RhbmRhcmQgZGV2aWF0aW9ucy4NCg0KYGBge3IgZWNobz1GQUxTRX0NCmRmIDwtIGRhdGEuZnJhbWUoZWRnZXMpDQoNCmJhc2ljX2VkYSA8LSBmdW5jdGlvbihkZikgew0KICBnbGltcHNlKGRmKQ0KICBzdW1tYXJ5KGRmKQ0KICBmcmVxKGRmKSANCiAgZ2dwbG90KGRmLCBhZXMoeCA9IHRhcmdldCkpICsgZ2VvbV9oaXN0b2dyYW0oKSArIGxhYnMoeCA9ICJ3ZWlnaHQiLCB5ID0gIkZyZXF1ZW5jeSIpDQogIGRlc2NyaWJlKGRmKQ0KfQ0KDQpiYXNpY19lZGEoZGYpDQpgYGANCg0KXA0KXA0KDQojIyMgRnJlcXVlbmNpZXMgZm9yIEVkZ2UgVmFyaWFibGVzDQoNCmBgYHtyIGVjaG89RkFMU0UsIHBhZ2VkLnByaW50PVRSVUV9DQpsaWJyYXJ5KGZ1bk1vZGVsaW5nKQ0KZnJlcShkYXRhID0gZWRnZXMsIGlucHV0ID0gYygndGFyZ2V0Jywnd2VpZ2h0JykpDQpgYGANCg0KXA0KXA0KDQojIyMgRGVzY3JpYmUgRWRnZXMNCg0KYGBge3IgZWNobz1GQUxTRX0NCmRlc2NyaWJlKGVkZ2VzKQ0KYGBgDQoNClwNClwNCg0KIyMjIEdyb3VwaW5nIGFuZCBQZXJjZW50YWdlcyBmb3IgRWRnZXMNCg0KYGBge3IgZWNobz1UUlVFLCBwYWdlZC5wcmludD1UUlVFfQ0KZGZfZWRnZXMgPC0gZGF0YS5mcmFtZShlZGdlcykgJT4lDQogICAgICAgICAgICAgIGdyb3VwX2J5KHRhcmdldCkgJT4lDQogICAgICAgICAgICAgIHN1bW1hcmlzZShzdW0gPSAoc3VtKHdlaWdodCkpKSAlPiUNCiAgICAgICAgICAgICAgbXV0YXRlKHBlcmNlbnQgPSBzdW0gLyBzdW0oc3VtKSAqIDEwMCkgJT4lDQogICAgICAgICAgICAgIGFycmFuZ2UoZGVzYyhzdW0pKQ0KDQpkYXRhdGFibGUoZGZfZWRnZXMpDQpgYGANCg0KXA0KDQpcDQoNCmBgYHtyfQ0Kd3JpdGUuY3N2KGRmX2VkZ2VzLCBmaWxlID0gJ3BlcmNfdGFyZ2V0LmNzdicpDQpgYGANCg==