I don’t know under what context you might be looking at this html, but hello I’m Tim. I’m writing this preface so that no matter where you’re coming from I can explain what you’re about to see.
This is a data analysis I completed on some survey data I collected from a Facebook group known as Alex G 666posting. Alex G is a prolific indie artist who has been releasing music since at least 2010. His popularity has continued to grow with each album release, but what makes him such an interesting musician is, including his leaked unreleased tracks, Alex has 200+ songs.
I joined the group back in January of 2019 and have come to really enjoy the community that exists there. In the group a member, Zion, asked how old people were when they first started listening to Alex G and how old they were now. I asked if it was okay, collected that data, and made a figure for the group showing the distribution of ages in the group. I then asked if it was okay to make a survey for the group to analyze more data, I got the okay, specifically from Isabelle, and began working. Many individuals helped with the creation of survey as you can see in my thanks at the bottom, but the final result was 38 columns of data from 211 individuals. Some questions Alex G related, some not. But I hope you find something interesting in this little project I made. I’ve been working on it on and off in my free time since late April finishing it up in late September. Either way, its a labor of love to give back to the great community I found online. The group is full of cool people and I’m glad to be a part of it (I also think this project helped me become a mod admin so that was nice).
At the time of me writing this 666posting is at the cusp of reaching 3,000 members. That’s great, but we need to talk about sample size because that means we only have ~7.0333333% of the group accounted for. Not only that but as this is an optional survey, one must be aware of the type of people who would take the time to fill out the said survey. This is not an unbiased slice of the group, but I do like to think this includes the core group of active individuals. So with that in mind, let’s get into the analysis.
tidyverse
: Used for general exploratory analysis, primarily used dplyr within itztable
: Used to make various tablesUpSetR
: Used to make Upset plotswordcloud
: Used to make word clouds (surprise!)rworldmap
: The map of the world!mapproj
: The map of the USwesanderson
: Wes Anderson Color Palettesmaps
: More mapsviridis
: Used for colorblind-friendly paletterstatix
: Stat testsEnvStats
: Used to get the N on ggplots for groupingsggpubr
: Arranging the plotsspotifyr
: Getting that spotify data!scales
: Percent axisreshape2
: Data meltingggforce
: Sina Plotsset.seed(666)
This was just to keep the consistency of the word clouds, it used to be 123, but uh, Alex G frequently references 666.
So the first demographics we’ll take a quick peek at is at age and race.
First, we’ll look at age at a histogram colored on “what generation do you identify with”. Get used to the age variable, as it is one of our few numerical values, so I’m going to be plotting it a lot. Anyways, overall age is right-skewed, with only one real outlier at 42. Alex G is about 26-27 at the time of me writing this and has been writing music since his early teenage years, so the demographics of teens and 20-somethings is unsurprising.
As a straight white dude (who we will see is the average Alex G listener), I can’t really speak to why or why not Alex is popular or not popular with the other demographics, and frankly I’m not sure it is my place to speculate. As a result, for race, gender, trans identity, and polyamory I will let you draw your own conclusions. I just don’t feel I as an individual should be speaking didactically about such a subject.
This is the plot the inspired the project, how? Quick storytime:
Zion, a member of 666posting asked individuals “What age were you when you discovered Alex G and what age are you now?”. Everyone started answering the question and I found it interesting. Eventually, it had 200+ individuals and I came up with an idea of making two overlapping histograms, one distribution for each respective question. I asked for permission and then began the first analysis of 666posting. As you’ll see I didn’t stop there with that data, as I thought of other things to do with it, but this eventually led to me asking if I could run a survey which leads to this whole R markdown. So really Zion inspired me to do this, so if you’re reading this, thanks dude!
Another plot I made, using the aforementioned “year learning of Alex G” data I extrapolated is a diagram of his fan base growth over time. I did this by using the “first year people learned of Alex G” data, and overlaying that with his album releases (the albums are colored on the album art). However, unlike my initial making of this plot based on people just commenting on a Facebook post, this data is a little… odd.
Compare to the original figure.
And compared to this figure our new one is definitely different. First of all, we have much more of a normal distribution besides the person… who first heard Alex G in ’07. Frankly, I’m a little dubious and would be interested in hearing that story. The individual did not identify which state they live in, so I thought my worries would be assuaged by hearing they lived in PA, but alas, I have no idea. The other interesting aspect is our first non ’07 fan comes in around 2011, after Race, the first complete Alex G album. I would expect a larger boost with Race, and the previous plot I made did have individuals listening around the time of Race, admittedly few, but more than none. It is important to note Alex G was making music with his band the Skin Cells during the early years as well, which could influence when people heard his solo stuff.
Also compared to the previous plot we have a lot more 2018-20 fans, but I think that’s primarily because the original data was pulled from late March/early April 2020, and this current survey has been rolling open the entire year, so that is not unexpected. But on the topic of the actual data, the data is for the most part normal, which is more indicative of the sample we’re pulling from. I assumed the number of fans should just continue increasing with each album as his fame has only grown. Maybe there is a critical limit of indie fans that he might’ve reached? Who knows! I personally believe this is due to the fact that people trickle into these Facebook groups, so while more people might like Alex now, they might not be in the group (or willing to answer a silly survey). It is interesting that 2015 and the release of Beach Music is his largest jump in popularity, as it is also the same year he signed with Domino Records who probably helped promote his work to a wider audience.
This long.
It’s the previous plot inverted, what can I say.
I just like cumulative distribution plots, it’s all the same data. If you don’t know how cumulative distribution plots work, basically the line shows over the years the growth of Alex’s fanbase as if counting up to the total, so you can see his percentage gain each year to his current 100% of the fanbase.
So Alex G, for those not in the know, lives in Philly. But, like most successful artists, people want to see him perform in other cities and he tours pretty regularly. So I thought it would be interesting to see where Alex G fans are all over the world!
A simple plot of the whole world colored on fans per 100,000 individuals in each of those countries.
HOLD UP, WHERE IS ALL THE RED? You may say, to which I respond, “GIVE ME A SECOND I’LL GET TO IT WE GOTTA ZOOM IN EVERYWHERE FIRST TO FIND IT”
So with all that said, the only interesting question I could think of is comparing this data to Alex G’s touring data so… tada! I took this data from setlist.fm, they actually have interesting concert statistics, I suggest checking them out. I actually had a project idea where I was gonna scrape their data and make an average Alex G setlist for each year and generate a naive Bayes model/shiny app allowing people to put in their perfect setlist and return the probability of that concert happening. But they already did the first thing, and I ended up doing this entire analysis instead of the naive Bayes thing, that’s life. Back to what I was talking about:
Alex G Concerts Around the World from setlist.fm
I put the setlist data into a csv file so I could use it in R, and I decided to plot the number of concerts in a country versus the number of fans. But you can see when I first saw the data….
America ruined the graph. So I decided to log scale it and make it more informative with shapes and colors.
Ireland had 2 concerts and no fans which is why you can’t see it :(
Number of concerts per fan linear model with America
##
## Call:
## lm(formula = concerts ~ fans, data = joined_country_concerts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1377 -0.0644 1.6368 2.0591 6.7277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.25582 1.42528 -0.881 0.393
## fans 1.52811 0.04239 36.050 3.29e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.297 on 14 degrees of freedom
## Multiple R-squared: 0.9893, Adjusted R-squared: 0.9886
## F-statistic: 1300 on 1 and 14 DF, p-value: 3.287e-15
Number of concerts per fan linear model without America
##
## Call:
## lm(formula = concerts ~ fans, data = no_america_concerts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8957 -1.0379 -0.0379 2.4785 9.1371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0489 1.6272 0.645 0.53041
## fans 0.9891 0.2444 4.047 0.00138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.674 on 13 degrees of freedom
## Multiple R-squared: 0.5575, Adjusted R-squared: 0.5235
## F-statistic: 16.38 on 1 and 13 DF, p-value: 0.001384
Basically the amount of concerts in a country increase by 1 for each 1.5 fans in said country, but that is driven by the outlier known as the US. If we remove the US the results changes to an increase of 1 concert per 0.99 fans.
So now let us look at the Alex G Fans around the US.
I used the below vignette to guide this section https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html
I got the census data from census.gov particularly the file named: NST-EST2019-01: Table 1. Annual Estimates of the Resident Population for the United States...
I’m not gonna lie, I was lazy and cleaned up the data in excel, but
basically I filtered to only the 2019 data. I used this data primarily to control for population in the coloration of the maps below. I don’t really have much to say on them so enjoy!
So the analysis I decided to do here was ask “Where is Alex’s fan base the largest (when controlling for population size) in the United States?” I pulled this region data from kaggle and will use it to test our hypothesis. “Our hypothesis?” you say? Yes, obviously it is the East Coast which has the most representation, if it’s not, all hope will be lost.
(P.S. I’ve learned of the state.region data in R, but womp womp here we R, let’s just go with it)
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: summarised_region$fans out of summarised_region$pop
## X-squared = 3.4198, df = 3, p-value = 0.3313
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 4.390522e-07 5.001536e-07 3.264839e-07 4.199483e-07
Well there doesn’t seem to be a clear winner… I’ll just tell myself if we included Southern East Coast states it would’ve won.
So one of the first graphs I generated with the survey data, because I thought it would be a fun one to do, is looking at the political leanings in the group. While politics has a lot more depth than can be represented by my figure, I think having two axes, left to right
and libertarian to authoritarian
, allows a fair amount of depth in itself. The question was phrased as follows:
“Where do you stand on the political compass? This website will calculate it, but you can do it by personal feel as well by looking at the image! https://www.politicalcompass.org/ (center at 5 and round if you get a score from the website, site ~10 minutes)”
With the below image as a guide. The image isn’t completely accurate if you ask me, but from a quick google search, it does a well enough job.
The reason that I said this is spicy, is because, let’s be honest, we live in an incredibly politically charged time. And when I posted this, I completely understood people not being comfortable with individuals having, uh, “strong fascist leanings” so it caused a bit of a row (Never a good look when your post has more comments than likes). That said, I still think the data is pretty interesting so let’s look at it.
As the data is technically a non-continuous discrete ordinal value, as I asked for answers in integers, this is technically one of the better ways to represent the data. The size of the circle represents the number of individuals who selected said option. We will later have more readable examples of this data later on in case you’re interested in exact numbers.
Second political compass scatterplot with a jitter and the shape of the points relating to gender. In case you’re wondering why some points are off of the figure that is due to the jitter, I apologize for that.
The same scatterplot as above, but this time the dots are colored on the age of the individuals. I had to remove the individual over 40 because that threw the scale off even when I used a log scale, so my apologies to them.
Let us see if we can break this data up into the different quadrants, including wiggle room in the middle for centrists. Again its a scale of 0-10, so I’m gonna have 4-6 as centrist.
While this plot is interesting, I don’t think there is a good way to test these groups because some of them have very few individuals, however there is another way we can look at this.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 23.2 0.414 56.1 6.70e-124
## 2 left_right 0.0375 0.139 0.270 7.88e- 1
Well that’s not significant, how about being libertarian or authoritarian?
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 23.5 0.467 50.3 4.33e-115
## 2 lib_auth -0.0644 0.124 -0.519 6.04e- 1
Well, I guess I can say, specifically in this 666posting cohort, your age doesn’t seem to influence your political leanings. Sorry, that wasn’t interesting, but again there are some groupings with very few samples so that didn’t help either and I didn’t remove the outliers.
Another question asked is if you play instruments or not, and if so which ones?
Preparing the data and make a melted version of the data.frame
## [1] "Guitar" "Banjo" "Voice" "Accordion"
## [5] "Bass" "Violin/Viola" "Percussion/Drums" "Piano/Keyboard"
## [9] "Brass" "Music Software" "Cello" "Mandolin"
## [13] "Sampler" "Woodwind" "Ukulele" "omnichord"
## [17] "Chinese Flute" "clarinet" "synthesiser"
Generate general summaries to be used in the first plot
Plot of instruments played in 666posting
Okay, here I’m about to go on a visualization tirade. So I’m sure you’re used to Venn diagrams, but have probably not heard of Upset plots. So Venn diagrams are the classic way to represent count crossover between multiple groups, but when groups are large this generally reduces to circles with numbers inside of them. This doesn’t do the scale of the differences proper justice, and people, in general, have a better time understanding the magnitude of difference when there is a visualization. And people are especially good at comparing sizes of adjacent bars. As a result, a new plot has been formed called Upset plots.
I will explain how to read them now. An Upset plot is comprised of two bar plots that relate to the central image with balls and lines. The barplot on the “y-axis” of the center image is the total count for how many people play each instrument. If we look at the center image, you can see each row is labeled by an instrument. Each column however has different dots representing each instrument. When a dot is filled that shows an overlap between the two instruments, the amount of those overlaps is what is counted in the “x-axis” barplot above. So one plot gives you the total number of people who play a certain instrument, while the other plot gives you how many people share the same overlap of instruments. I hope this makes, sense, but it will be readily apparent once you see the plots.
Anyways the way to read this diagram is as follows: The total count of each instrument is in the bottom left corner, you can see that we have over ~120 guitarists who answered the survey followed by ~70 keyboard/piano players. Then if you look to the right of that graph you’ll see dots and lines. If two dots are filled that means that the bar above it represents the intersection of individuals who play both those instruments. As more dots are filled that represents a greater intersection. So you can see that there are 27 solo guitarists, while there are 12 people who play all of the instruments in this graph.
As I said, I couldn’t fit in all the data because of the limitations so sorry to all the unique instrument players, but at least you got a shout out in the first graph you Chinese flute-playing god.
The way to read this diagram is as follows: The total count of each instrument is in the bottom left corner, you can see that we have 121 guitarists who answered the survey followed by r nrow(instr_dat %>% filter(`Piano/Keyboard`== 1))
keyboard/piano players. Then if you look to the right of that graph you’ll see dots and lines. If two dots are filled that means that the bar above it represents the intersection of individuals who play both those instruments. As more dots are filled that represents a greater intersection. So you can see that there are 27 solo guitarists, while there are 12 people who play all of the instruments in this graph.
As I said, I couldn’t fit in all the data because of the limitations so sorry to all the unique instrument players, but at least you got a shout out in the first graph you Chinese flute-playing god.
Fun fact: there are 2 who play the same instruments as Alex, including the guitar, bass, drums, banjo, voice, mandolin, music software, and keyboard (to my knowledge).
One last quick question about our instrumentalists, are you in a band?
Frankly, I expected “No” to be the dominant answer, but I am surprised by how good a fight “Yes” put up. Maybe that’s my personal bias of knowing more musicians who aren’t in bands.
This is basically the same analysis as above, but this time we’re looking at the rest of the world of The Arts!
Plot of art mediums in 666posting
So… word clouds. Technically speaking, word clouds are never a good way to represent data in a meaningful way. While the size of the letters increases one can still be tricked into thinking, in this case, long band names have more votes, which isn’t true. As a result, you’ll see that I ended up coloring the data a little bit to help clarify this visual discrepancy.
If they have more than one fan, the name is in orange.
Alex has 80 fans by the way.
Alex has 33 individuals who consider him their second favorite artist.
First, we gotta do some MASSIVE data cleaning, I really should not have left this open for people to put whatever they want, look at all the unique comments, as of writing there are 50. So we gotta find a way to generalize these. That said, some are very interesting.
## [1] "Through a friend"
## [2] "Spotify/Apple Suggested Music"
## [3] "I know him personally"
## [4] "DIY show "
## [5] "Youtube"
## [6] "Press"
## [7] "Tumblr"
## [8] "College Radio"
## [9] "Reddit"
## [10] "Compilation Albums"
## [11] "I read an article from Pitchfork about the release of DSU"
## [12] "Suggested by another artist I listen to"
## [13] "Pitchfork"
## [14] "SoundCloud"
## [15] "Toured Together"
## [16] "Crush at the time"
## [17] "another FB group"
## [18] ""
## [19] "4chan"
## [20] "Soundcloud"
## [21] "saw his name on a festival lineup"
## [22] "Soundcloud recommendation"
## [23] "Flaked on Netflix"
## [24] "A girl in twitter who made playlists"
## [25] "Record store clerk"
## [26] "through music blogs like NME or pitchfork"
## [27] "Radio"
## [28] "Nintendo 64 cover on a Ztapes compilation"
## [29] "pitchfork's rocket review, sorry!"
## [30] "8tracks.com"
## [31] "Wikipedia"
## [32] "Festival Lineup"
## [33] "Music journalists/festivals"
## [34] "Saw him perform at a festival"
## [35] "Music blog"
## [36] "Facebook"
## [37] "He performed at a festival I attended"
## [38] "lofi record labels (specifically birdtapes and orchid tapes)"
## [39] "From a vine"
## [40] "Through facebook friends"
## [41] "tumblr"
## [42] "Through my brother (maybe this comes under friend?)"
## [43] "420 Love Songs Compilation (wasnt sure if you meant Alex G compilations sorry!)"
## [44] "Through the Spotify playlists of a YouTuber I like"
## [45] "i really canâ\200\231t remember "
## [46] "radio podcast"
## [47] "at a concert (Living Bread, brooklyn, may 2013)"
## [48] "was the support at a show i was at "
## [49] "Flake"
## [50] "GTA V lol"
If you’re viewing the version without the code you might not know what is going on here, but basically I am using key words in all of the non-normal options to fit them into more normal categories. For example if the answer includes a website, I detect the unique letters in that character string (“umblr” fo Tumblr) and then rename that answer to “other website”. This is a more interesting section to look at via the coded version of this file.
## [1] "Through a friend" "Spotify/Apple Suggested Music"
## [3] "I know him personally" "Live Music"
## [5] "Other Website" "Journalism"
## [7] "Radio" "Compilations/Label"
## [9] "Other media" "Online Personality"
Making the plot now simplified including a table of how we listen to our music
Now we’re getting into that good Alex G data, favorite albums, favorite songs, and more!
A barplot of favorite released Alex G albums
A barplot of favorite Alex G Fan compilations, also for those of you that don’t listen to his unreleased stuff, you should give it a listen.
Overall it seems that people don’t listen to them, but as I said, I suggest it. The winner of them is by a large margin Monsterhead, which makes sense to me. Out of all his compilations, I feel it is the most stacked with well known unreleased tracks (Nintendo 64, Uh, Written in Blood, etc.)
A word cloud of everyone’s favorite Alex G songs from the survey, colored by what album they’re in, size based on the number of people that selected a given song as their favorite. So as you can see the top two favorite songs of those who answered the survey is Snot and Gnaw, and tbh, I’m SNOT surprised. Ugh, anyways the thing I actually love about this image is, because of how Alex names his songs, there are cool phrases that are generated. Personally I love “Screwy People, I Wait For You”, it’s just fun sticking his song titles together.
A bar plot of the top 10 favorite Alex G songs, just to ascribe hard numbers to the data, which is the issue with word clouds as said before.
Looks like Gnaw is our winner!
A quick question I wanted to ask is: “Is it more likely that your favorite song is on your favorite album?” Which is what I did below. I made sure to only include favorite songs on released albums, as with unreleased material stuff gets dicey. I ran a chi-square, which has the null hypothesis “There should be no correlation between one’s favorite song and their favorite album” meaning that the count should be split between “Yes, my favorite song is on my favorite album” and “No, my favorite song is not on my favorite album” and it turns out….
##
## FALSE TRUE
## 112 64
##
## Chi-squared test for given probabilities
##
## data: table(songs_dat_test$fav_song_fav_album)
## X-squared = 13.091, df = 1, p-value = 0.0002967
It is more likely that you will not have your favorite song in your favorite album, statistically speaking. Frankly if you compare favorite songs to favorite albums I’m just gonna blame Gnaw for this one.
I would first like to thank a friend of mine, Stephanie Y., who looked over this markdown for me and suggested this package.
You might be wondering, “what is spotifyr?”, and to put it simply, it is an R package that allows me to effectively use Spotify’s API to get data about Alex G and his music. While I could do a lot with this data, what I’m most interested in is the different “qualities” that Spotify attributes to various songs. This is done using machine-learning on the songs to determine their “valence” (sad to happy; 0 to 1), danceability (0 to 1), energy (0 to 1), etc. I’m going to make tables showing the overall top and bottom songs for each of these categories, and honestly, I’m dubious a. hell about these. For example, they have a quality known as “liveness” which determine if the song is recorded live or not. The number 1 “liveness” song is Brick meanwhile Sugar House - LIVE, is third! We also have loudness in decibels which goes from Clouds (thanks Luke) to Brick. But if you want to know more click here, its interesting anyways.
note: All scales are from 0-1 except loudness which is decibels.
note 2: PUT RACE ON SPOTIFY PLEASE
Generating tables with the top 5 and bottom 5 tracks for a select handful of Spotify qualities.
Now that you’ve probably drawn your own opinions about the validity of this data (reminder: these are the top 5 and bottom 5 songs per song quality) let’s look over Alex’s albums and then we’ll see how this intersects with our music.
Technically not the best way to represent the data but…..
note: Remember, the value is in decibels, the closer to 0 the louder it is
I find the overall increase of volume over his career interesting for two reasons. First, Alex mixed and mastered his songs until Beach Music, which was engineered by someone at Domino Records, and the Domino albums seem to have more consistent volume level. Second, it is known that overtime music has only gotten louder, and it seems to be true for Alex as well. Actually let us test this real quick
##
## Call:
## lm(formula = loudness ~ album_release_year, data = alex_g)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.057 -1.770 0.153 2.055 5.287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -640.2495 215.2093 -2.975 0.00373 **
## album_release_year 0.3136 0.1069 2.935 0.00421 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.997 on 93 degrees of freedom
## Multiple R-squared: 0.08476, Adjusted R-squared: 0.07492
## F-statistic: 8.613 on 1 and 93 DF, p-value: 0.004205
So, to an extent, yes the theory is correct, his music seems to get louder as time passes, and at least Rules and Trick had a relatively larger distribution of volume levels compared to his later work.
Is there a secret formula to Alex G’s top hits? Maybe we can use this data to figure it out! By intersecting the different Spotify qualities and 666posting members favorite songs we can see if the distribution of favorites is different than Alex’s discography. Remember though, this will be excluding non-officially released tracks.
##
## Two-sample Kolmogorov-Smirnov test
##
## data: filtered_songs$danceability and filtered_alex_g$danceability
## D = 0.15808, p-value = 0.09332
## alternative hypothesis: two-sided
##
## Two-sample Kolmogorov-Smirnov test
##
## data: filtered_songs$energy and filtered_alex_g$energy
## D = 0.077578, p-value = 0.8542
## alternative hypothesis: two-sided
##
## Two-sample Kolmogorov-Smirnov test
##
## data: filtered_songs$valence and filtered_alex_g$valence
## D = 0.085427, p-value = 0.7621
## alternative hypothesis: two-sided
Well these are interesting distributions. Alex has a somewhat normal curve for danceability, but there seems two be a bimodal distribution amongst fan favorite songs. As a result, it does lean more towards danceability preference. With energy people prefer lower energy, with fan preference taking a dip around 0.8. Lastly with valence, which I will remind is basically happiness, there is a low peak around .30 of ution from danceability trends towards signifigance. sad peeps with some happy nerds around .75, breaking through the discography distribution. Statistically speaking, using Kolmogorov-Smirnov tests, which tests if there is a similar distribution, we see that there is no significant difference, but that bimodal distribution from danceability trends towards significance.
Myer-Briggs is the zeitgeist when it comes to personality tests. For more information on them check out this website: https://www.16personalities.com/free-personality-test . While there are sixteen personalities, they are not evenly distributed in the population according to data from the official MBTI types, some make over 10% of the population others closer to 2%. These types can also be broken down further into their more discrete types based on the letters, Extrovert vs Introvert (E v I) for example. I have two figures below showing both of these distributions in the general population. We will then compare this to the Alex G data.
Not really sure what the best question to ask here is besides the distribution and how the distribution differs from the normal world. Data pulled from here
General population plots Myer-Briggs Type
Run the Chi-square analysis against general population probabilities
The group, compared to the general population, well, they look NOTHING alike. While the population data is relatively outdated cough cough hasn’t been updated since 2002 , our distribution is not close to it at all, below I will have the outputs side by side. A good portion of our members come from the rarer types, and specifically, 18.23% of participating members are the rarest type, INFJ, which makes up ~2% of the normal population. Crazy. This, of course, carries over to the discrete types as well, the starkest difference is the Intuitives considerably outnumber the Sensors in this group (I don’t know what that means, but if you want to explain it in the comments feel free to), which is the opposite in the general population. It’s wacky folks.
Note: The reason the P-values are the same is that I have the chi-square set to simulate.pvalues
as the distribution of values throws results in a warning, this isn’t the best dataset for this test
Gonna be honest, I am under qualified to say anything about what zodiac signs mean, so I’m going to leave these results up to your interpretation. That said, one thing I do know is these signs break up into the four classic elements, fire, water, air, and earth which are included in my analysis. So to generate the population data I pulled CDC birth records from 2000-2014 as an estimate. There appears to be a relatively uniform distribution of signs, and similarly elements in the general population.
Pulling public data to make a fair general population example and then preparing it
Making both colors and general population plots for zodiac signs
Also zodiac elements
In our group however, this aforementioned balanced distribution doesn’t hold true. Capricorns make up around 3% of our group while Gemini makes up ~12%. That said, this is not a significant change in the distribution according to a chi-square analysis. The same is true for elements, the change nears significance, but does not cross the arbitrary (thanks Fisher) 0.05 threshold, though overall there are less Earth and Water signs in this group.
Basic Alex G zodiac sign plot
Plotting the zodiac elements of the Alex G data
In our group however, this aforementioned balanced distribution doesn’t hold true. Capricorns make up around 3% of our group while Gemini makes up 11.3744076%. That said, this is not a significant change in the distribution according to a chi-square analysis. The same is true for elements, the change nears significance, but does not cross the arbitrary (thanks Fisher) 0.05 threshold, though overall there are less Earth and Water signs in this group.
I’m also not great with Harry Potter info, but this was relatively interesting because there is some insight to be gained. I had to eyeball the general population percentages from the article below because they didn’t actually give the true numbers What’s interesting is that the main difference between the general population and 666posting is that we have more Slytherin. What makes this even more interesting, is according to the Time’s article, Slytherin makes a large proportion of the younger population, and I’d say that this group leans to the younger side (see previous above ages). And so, I’d argue this cohort doesn’t stray from the normal population distribution when adjusted for age. But we don’t have enough older people to properly adjust for age, so just take my word for it, K? Great!
Data taken from https://drive.google.com/file/d/0B8PCmhQmtcDKLXlzSGtnZ0hKbjQ/view Data taken from Time Magazine, they don’t have the actual percents posted, but a much larger sample size.
Re-plot the Times Data
Now we have general population prepare 666posting data
Chi-Square for houses
##
## Chi-squared test for given probabilities
##
## data: house_dat$count
## X-squared = 26.732, df = 3, p-value = 6.699e-06
Plotting Alex G Hogwarts Houses
As I said above, the only thing of note is the change in Slytherin, which I attribute to the trend that Slytherin’s are generally younger, and this group a younger sample when compared to the overall population.
My last question, and the one I found most interesting is, are personality types related to one another? E.g. are certain MBTI types enriched in a given zodiac sign? Long story short, kinda, kinda not. I made a bunch of Heatmap Tables using an R package called Z-table and calculated Chi-squares and while there were a few hits I assure you none would stand up to any multiple testing correction, so I don’t have any larger statement to say here. I feel like we are underpowered in participants to test this hypothesis as not everyone answered all three survey questions, and even if we did we still might need a larger N.
note: I’m going to not echo these as the code is repetitive.
Using Ztables from this link
Ztable signs vs MBTI
Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENFJ | 2 | 2 | 1 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 0 | 1 |
Ztable Zodiac Elements vs MBTI
ENFJ | ENFP | ENTJ | ENTP | ESFP | INFJ | INFP | INTJ | INTP | ISFJ | ISFP | ISTP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Air | 4 | 9 | 0 | 1 | 1 | 7 | 16 | 4 | 9 | 0 | 1 | 2 |
Ztable for Zodiac Signs and HP Houses
Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gryffindor | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 4 | 0 | 1 | 2 |
Ztable for Zodiac Elements and HP Houses
Air | Earth | Fire | Water | |
---|---|---|---|---|
Gryffindor | 6 | 4 | 7 | 2 |
MBTI vs Houses
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
ENFJ | 3 | 1 | 3 | 1 |
Prepare data
Sign and IE
Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
E | 4 | 7 | 8 | 2 | 5 | 5 | 6 | 2 | 6 | 4 | 3 | 4 |
Sign and SN
Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
N | 16 | 14 | 11 | 9 | 17 | 15 | 17 | 14 | 21 | 10 | 9 | 18 |
Sign and FT
Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
F | 15 | 14 | 9 | 5 | 11 | 15 | 12 | 8 | 19 | 7 | 10 | 11 |
Sign and JP
Aquarius | Aries | Cancer | Capricorn | Gemini | Leo | Libra | Pisces | Sagittarius | Scorpio | Taurus | Virgo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
J | 3 | 3 | 3 | 2 | 5 | 6 | 7 | 5 | 10 | 7 | 2 | 8 |
Element EI
Air | Earth | Fire | Water | |
---|---|---|---|---|
E | 15 | 9 | 18 | 14 |
Element SN
Air | Earth | Fire | Water | |
---|---|---|---|---|
N | 50 | 36 | 50 | 35 |
Element FT
Air | Earth | Fire | Water | |
---|---|---|---|---|
F | 38 | 26 | 48 | 24 |
Element JP
Air | Earth | Fire | Water | |
---|---|---|---|---|
J | 15 | 12 | 19 | 15 |
Prepare data
House IE
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
E | 9 | 7 | 13 | 9 |
House SN
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
N | 18 | 33 | 36 | 24 |
House FT
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
F | 14 | 33 | 27 | 18 |
House JP
Gryffindor | Hufflepuff | Ravenclaw | Slytherin | |
---|---|---|---|---|
J | 8 | 14 | 13 | 7 |
Let us first make some barplots of both drugs followed by a z-table to see what our distribution of recorded substance usage is. I will also run a Chi-square (because the function already does that) to see if the distribution of usage is unexpected, aka, are weed use and alcohol use random, or are they correlated in some manner. I am also ordering factors to make the tables and following graphs more readable.
Bringing in data about weed use about the world world and the states
Now lets look at that Z table
I have never smoked weed | I don't smoke weed anymore | Once a month | Once a week | Multiple times per week | Every day | Multiple times a day | |
---|---|---|---|---|---|---|---|
I never had alcohol | 5 | 2 | 0 | 0 | 1 | 0 | 1 |
So it appears, according to the the Chi-Square that our data might be trending towards significance, but there is nothing to be concluded here. Overall it appears most people use to smoke weed, but not anymore, and most people seem to drink socially.
Now let us see if there are any interesting distributions with age
Above is an interesting way to view the distribution, but to see differences there are better plots such as the box-plot below:
These distributions don’t seem that different from one another. As a test, I’ll run a linear model to see if these drinking groups are a good predictor of age. That said, I’m not really going to check any assumptions (don’t do this), I will remove the one older individual because they are an outlier, but I don’t expect there to be a difference.
## # A tibble: 7 x 5
## alc_use variable n mean sd
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 I never had alcohol age 9 21.8 5.12
## 2 I don't drink anymore age 15 23.7 4.17
## 3 Once a month/socially age 75 22.3 2.66
## 4 Once a week age 42 23.5 2.79
## 5 Multiple times per week age 57 24.3 3.51
## 6 Every day age 9 24.7 1.94
## 7 Multiple times a day age 2 26.5 0.707
##
## Call:
## lm(formula = center_age ~ alc_use, data = no_out_drugs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2667 -2.2667 -0.2667 2.3333 10.2667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5045 1.0528 -1.429 0.1545
## alc_useI don't drink anymore 1.9556 1.3317 1.468 0.1435
## alc_useOnce a month/socially 0.4889 1.1142 0.439 0.6613
## alc_useOnce a week 1.6984 1.1602 1.464 0.1448
## alc_useMultiple times per week 2.4854 1.1329 2.194 0.0294 *
## alc_useEvery day 2.8889 1.4889 1.940 0.0537 .
## alc_useMultiple times a day 4.7222 2.4691 1.913 0.0572 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.159 on 202 degrees of freedom
## Multiple R-squared: 0.08829, Adjusted R-squared: 0.06121
## F-statistic: 3.26 on 6 and 202 DF, p-value: 0.004415
Nothing significant, but social drinking trends towards younger individuals which is interesting.
And now I’m going to reveal all the filthy filthy lawbreakers in this group, I’m talking under-age drinking. (I’m being sarcastic, but I think its an interesting question)
Disgusting…
Alright on to the completely legal topic of weed usage
Again the better way to view the data
Let us test if there is a significant difference between age and weed use.
## # A tibble: 7 x 5
## weed_use variable n mean sd
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 I have never smoked weed age 19 22 3.40
## 2 I don't smoke weed anymore age 82 23.4 3.46
## 3 Once a month age 35 23.4 3.09
## 4 Once a week age 7 25.1 2.91
## 5 Multiple times per week age 26 22.4 2.37
## 6 Every day age 15 21.9 2.15
## 7 Multiple times a day age 25 25.0 3.35
##
## Call:
## lm(formula = center_age ~ weed_use, data = no_out_drugs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3537 -2.3537 -0.4231 1.9600 9.6463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.28230 0.72731 -1.763 0.07940 .
## weed_useI don't smoke weed anymore 1.35366 0.80719 1.677 0.09509 .
## weed_useOnce a month 1.40000 0.90341 1.550 0.12278
## weed_useOnce a week 3.14286 1.40171 2.242 0.02604 *
## weed_useMultiple times per week 0.42308 0.95684 0.442 0.65885
## weed_useEvery day -0.06667 1.09500 -0.061 0.95151
## weed_useMultiple times a day 3.04000 0.96489 3.151 0.00188 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.17 on 202 degrees of freedom
## Multiple R-squared: 0.08148, Adjusted R-squared: 0.0542
## F-statistic: 2.987 on 6 and 202 DF, p-value: 0.008121
Weed usage interestingly shows older people in this group tend to consume weed more than their younger counterparts. I feel this is somewhat driven by outliers, but “once a week” weed usage, with no outliers, was also significant. Part of me wonders how much of this has to do with the financial situation/living situation. Perhaps older individuals smoke more because they have both the financial ability and space (free from those who might look down on it) to do so. I’m not sure, it would’ve been pretty weird if I asked “what is your yearly income?” and “Do you live with an authority figure?”.
Actively smokes weed | Doesn't actively smoke weed | |
---|---|---|
not legal | 22 | 25 |
Making a histogram showing the general results. As you can see there are some outliers, but I have theories on why these outliers exist.
My theory is they live near Philly, and by using the state data I’ve collected I can determine that two of the largest outliers indeed do live near Philly (defined by living in Pennsylvania, New Jersey or Delaware). That said, the individual who has seen Alex 40 times lives in Missouri.
Fun fact: According to setlist.fm Alex has played 284 concerts
But another theory I had is that maybe living near cities in general will increase the probability of an individual going to more concerts
That “once a year if ever” dude with 40 Alex G concerts must be 40 then…. something is telling me that might’ve been a mistake.
I decided to use some linear models to check if any of these variables were good predictors
## Estimate Std. Error t value
## (Intercept) 1.744968e-13 3.740517 4.665046e-14
## living_areaIn the outskirts of a city -1.999617e-01 1.313950 -1.521837e-01
## living_areaIn the suburbs 1.120113e-01 1.256890 8.911790e-02
## living_areaIn a city 1.381653e+00 1.198747 1.152581e+00
## `Lives Near Philly?`Yes 3.707999e+00 1.062184 3.490920e+00
## concertsOnce a year if ever 2.093536e+00 3.979742 5.260482e-01
## concertsEvery 3-4 months 9.088468e-01 3.916566 2.320520e-01
## concertsEvery month or every other month 1.736156e+00 3.951054 4.394159e-01
## concertsEvery two weeks 6.820543e-02 4.059121 1.680301e-02
## concertsWhenever I can 1.572409e+00 3.932301 3.998699e-01
## Pr(>|t|)
## (Intercept) 1.0000000000
## living_areaIn the outskirts of a city 0.8791954668
## living_areaIn the suburbs 0.9290773914
## living_areaIn a city 0.2504583279
## `Lives Near Philly?`Yes 0.0005922191
## concertsOnce a year if ever 0.5994375140
## concertsEvery 3-4 months 0.8167349029
## concertsEvery month or every other month 0.6608346103
## concertsEvery two weeks 0.9866105225
## concertsWhenever I can 0.6896790533
It seems that the best predictor in this case is living near Philly! But that’s not an honest analysis, because this result is likely driven by outliers, so we’re going to toss them.
## Estimate Std. Error t value
## (Intercept) -1.407823e-15 1.7205786 -8.182264e-16
## living_areaIn the outskirts of a city 3.009316e-01 0.6048521 4.975292e-01
## living_areaIn the suburbs 3.222993e-01 0.5786990 5.569377e-01
## living_areaIn a city 8.150397e-01 0.5520281 1.476446e+00
## `Lives Near Philly?`Yes 1.220924e+00 0.5274623 2.314714e+00
## concertsOnce a year if ever 5.119795e-01 1.8319731 2.794689e-01
## concertsEvery 3-4 months 1.265932e+00 1.8016165 7.026646e-01
## concertsEvery month or every other month 1.698775e+00 1.8178097 9.345178e-01
## concertsEvery two weeks 8.652422e-01 1.8676406 4.632809e-01
## concertsWhenever I can 1.397884e+00 1.8090075 7.727353e-01
## Pr(>|t|)
## (Intercept) 1.00000000
## living_areaIn the outskirts of a city 0.61937358
## living_areaIn the suburbs 0.57820542
## living_areaIn a city 0.14142904
## `Lives Near Philly?`Yes 0.02166441
## concertsOnce a year if ever 0.78017970
## concertsEvery 3-4 months 0.48309847
## concertsEvery month or every other month 0.35118676
## concertsEvery two weeks 0.64367706
## concertsWhenever I can 0.44060990
So it still appears that the biggest determinant of seeing Alex G concerts is if you live near Philly, which makes sense as, before Alex was touring most of his shows were in that area (and people could’ve included Skin Cells concerts as Alex G concerts). That said, if you remove the outliers, in the dataset, living near Philly is no longer as strong a predictor on how many Alex G concerts you’ve seen, but is still the biggest one.
I didn’t know where this section fit best, so I just put it here. I didn’t really have any hypotheses here either, but I hope you enjoy the data!
Crime | Drama | Fantasy | Historical Fiction | Horror | I don't read books | Mystery | Nonfiction | Poetry | Romance | Satire | Sci-Fi | Suspense/Thriller | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Action | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
Another question we asked is what type of college major did people have, or if they even decided to attend college which we look at below.
Upset plot for different majors to see what double majors we have
We have 23 people with 2 majors, 4 people with three majors, and 1 with four majors.
So around 2020-05-08 09:57:30 a white heterosexual/straight male who is 23 years old decided to fill out the survey. They live in United States, specifically in California and when asked if they prefer multiple romantic partners they generally say no. Their favorite musician is Alex G, when you ask them their second favorite musician they’ll emphasize how much they like Alex G, before balking and saying Modest Mouse. They have a lot of opinions about Alex G, their favorite song is Snot, their favorite music video is Gretel, and their favorite album is Trick. That said, they when asked about what their favorite fan compilation is they reply, i don’t listen to them., which is probably because they listen to most of their music through spotify which is somewhat difficult to get non-official tracks on. Their favorite song is definitely Snot
Nevertheless, they’re still a big Alex G fan, they started listening to him when they were 19 years old, and have been to 2 concerts so far. When ask what they like to do in their spare time they have a couple hobbies. They like to play the Guitar, but when asked if they’re in a band, they’d say no. They also use writing to express themselves, and when it comes to concerts they go whenever i can. They are studying/have studied the Stem Field and live in a city. When it comes to weed they would say “i used to smoke weed, but now i don’t” and with alcohol they’d say they’d drink “once a month/socially”.
When it comes to their political leanings, on the scale of 1 to 10 from left politics to right, and from libertarian to authoritarian, they fall around 2 and 3 respectively. Their personality, if you asked them arbitrarily their MBTI, astrological sign, and Harry Potter House they would also tell you, respectively, INFP, Gemini, and “i don’t know”. They find all of these questions you’re asking them kinda weird, almost like this format that I decided to explain the most average Alex G fan didn’t pan out the way I wanted, but they know everyone is just trying their best.
In the order Facebook decided:
Special shoutout to Isabelle for allowing and promoting the survey.
spotifyr
Graham aka Grumpus for talking to me about Alex G back when I knew diddly squat besides Mary. Thank you for introducing me to one of my favorite artists of all time.