[00:35] *** JH88 has quit IRC (JH88) [00:55] *** BlueMax has joined #archiveteam-bs [01:19] So after combining all Mixtape scrapes so far, I'm at 30249 URLs. Or in other words, a bit over 0.5 % of all content was linked on Reddit, 4chan, Bing, GitHub, Something Awful, or Wykop. [01:20] At most. A good number of those links are probably dead, too. [01:29] *** ndiddy has joined #archiveteam-bs [01:42] *** a_spook_ has quit IRC (Quit: Connection closed for inactivity) [02:06] *** decay has quit IRC (Quit: leaving) [02:13] JAA: It was still up on January 20th apparently. https://webcache.googleusercontent.com/search?q=cache:https://mastodon.rocks/@Glenn [02:14] *** decay has joined #archiveteam-bs [02:14] and 23rd https://webcache.googleusercontent.com/search?q=cache:https://mastodon.rocks/@UBports/with_replies?max_id=101419312194725348 [02:21] hook54321: Thanks, added to the page. [02:40] *** qw3rty118 has joined #archiveteam-bs [02:42] *** qw3rty117 has quit IRC (Read error: Operation timed out) [03:01] Yay, unit fix spam incoming. :-) [03:02] Never seen that before: "10 edits not shown" [03:10] Just ran a little test on the Mixtape URL list using a random sample of 1 % of the URLs (302). Only 3 URLs produced a 404, surprisingly little. The total size of those URLs was 2.7 GiB, so I expect a total size of around 270 GiB from those 30k URLs. That, on the other hand, is surprisingly large; 0.5 % make up 1.85 % of the total size, apparently. Or the numbers are wrong, idk. [03:10] Unless someone tells me not to, I'll throw these 30k into ArchiveBot later (10 hours or so). [03:11] There were a number of failed requests also in those 302, and in my quick test, I didn't retry those, so it might be slightly larger. [03:45] *** wp494 has quit IRC (Read error: Operation timed out) [03:46] *** wp494 has joined #archiveteam-bs [04:37] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [04:43] *** odemgi_ has joined #archiveteam-bs [04:45] *** odemgi has quit IRC (Ping timeout: 252 seconds) [04:52] *** odemg has quit IRC (Ping timeout: 615 seconds) [04:58] *** odemg has joined #archiveteam-bs [05:03] *** wyatt8740 has quit IRC (Read error: Operation timed out) [05:24] *** ndiddy has quit IRC () [05:43] *** Exairnous has quit IRC (Ping timeout: 252 seconds) [06:09] *** martle has joined #archiveteam-bs [06:11] *** martle has quit IRC (Client Quit) [06:11] *** martle has joined #archiveteam-bs [06:12] *** martle has quit IRC (Client Quit) [06:13] *** martle has joined #archiveteam-bs [06:14] *** dhyan_nat has joined #archiveteam-bs [06:17] *** martle has quit IRC (Remote host closed the connection) [06:18] *** martle has joined #archiveteam-bs [06:20] *** martle has quit IRC (Client Quit) [07:16] who runs the archiveteam docker hub account? i'd like to request access to it for maintaining individual non-warrior worker code scripts that people can run instead of having to maintain, build and rebuild a dockerfile every time the worker code changes [07:19] *** Mateon1 has joined #archiveteam-bs [07:23] Re: Mixtape There's a reason to this. A lot of the URLs we found are from anime communities, which means they're snippets of anime series or other video [07:23] Meanwhile, a lot of private users used it for uploading images, which are often smaller [07:24] So, yes, a 1.85%/0.5% discrepancy doesn't sound unreasonable. [07:30] *** wyatt8740 has joined #archiveteam-bs [07:49] *** wyatt8740 has quit IRC (Ping timeout: 360 seconds) [08:02] *** wyatt8740 has joined #archiveteam-bs [08:33] *** HashbangI has quit IRC (Read error: Operation timed out) [09:00] *** killsushi has quit IRC (Quit: Leaving) [09:31] *** Despatche has quit IRC (Read error: Operation timed out) [09:49] JAA: I limited purplebot to show only the five most recent edits. With the rise of bots on our wiki it might be necessary to move those edit notifications to a different channel though… [10:25] *** dhyan_nat has quit IRC (Read error: Operation timed out) [11:08] asie: Right, makes sense. [11:10] PurpleSym: I see. Yeah, that makes sense. It's just the first time that I see this. Even the bot edits usually don't happen this quickly. It was only this time because the size units changed, so the bot simply regenerated the page from cached data with a few things changed and saved it again immediately. [11:13] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:30] *** dhyan_nat has joined #archiveteam-bs [11:35] *** qw3rty119 has joined #archiveteam-bs [11:35] *** qw3rty118 has quit IRC (Read error: Operation timed out) [12:17] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [12:17] *** VerifiedJ has joined #archiveteam-bs [12:45] *** wp494 has quit IRC (Read error: Operation timed out) [12:45] *** wp494 has joined #archiveteam-bs [13:02] *** Pixi` has quit IRC (Quit: Pixi`) [13:11] *** Pixi has joined #archiveteam-bs [13:29] *** HashbangI has joined #archiveteam-bs [14:22] *** icedice has quit IRC (Read error: Connection reset by peer) [14:22] *** icedice has joined #archiveteam-bs [14:45] *** dhyan_nat has quit IRC (Read error: Operation timed out) [15:18] *** Stiletto has joined #archiveteam-bs [15:20] *** ats has quit IRC (Read error: Operation timed out) [15:32] *** ats has joined #archiveteam-bs [17:06] *** PhrackD has quit IRC (Read error: Operation timed out) [17:07] *** PhrackD has joined #archiveteam-bs [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Aign.com [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Aneogaf.com [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Agamespot.com [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Aresetera.com [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Agamefaqs.com [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Amyanimelist.net [17:38] https://www.google.com/search?q=my.mixtape.moe+site%3Aanime-sharing.com [17:39] https://www.google.com/search?q=my.mixtape.moe+site%3Amangadex.org [17:39] https://www.google.com/search?q=my.mixtape.moe+site%3Aanidb.net [17:39] ^ Game, anime, and manga forums with Mixtape.moe links [17:41] Seems this Russian Reddit-like site also has some Mixtape.moe links: https://www.google.com/search?q=my.mixtape.moe+site%3Apikabu.ru [17:42] Also, I found a list of chans that probably has some Mixtape.moe links: [17:42] https://encyclopediadramatica.rs/Chans#.2Achans:_Futaba-style_imageboards [17:42] https://encyclopediadramatica.rs/List_of_*chan_boards [17:43] *** Jopik has quit IRC (Remote host closed the connection) [17:44] *** Jopik has joined #archiveteam-bs [17:44] ign.com, neogaf.com, gamespot.com, resetera.com, gamefaqs.com, myanimelist.net, and anime-sharing.com all have enough links to justify collecting them, at least [17:45] From 4690 Google results for IGN to down to 230 Google results for Anime-Sharing [17:47] Almost forgot Steam Community, which has 10 100 Google results: https://www.google.com/search?q=my.mixtape.moe+site%3Asteamcommunity.com [17:50] Hmm, [17:51] I forgot to put my.mixtape.moe in quotation marks [17:53] Steam Community then has 2070 Google results and the other forums probably even less :/ [17:58] https://www.google.com/search?q=%22my.mixtape.moe%22+site%3Aleagueoflegends.com [17:58] https://www.google.com/search?q=my.mixtape.moe+site%3Aubi.com [17:59] https://www.google.com/search?q=%22my.mixtape.moe%22+site%3Abattle.net [18:00] steamcommunity.com: 2070 Google results [18:00] gamespot.com: 689 Google results [18:00] neogaf.com: 297 Google results [18:01] ign.com: 113 Google results [18:01] leagueoflegends.com: 203 Google results [18:01] gamefaqs.com: 187 Google results [18:02] resetera.com: 182 Google results [18:02] myanimelist.net: 151 Google results [18:03] battle.net: 63 Google results [18:04] The rest of the game, anime, and manga forums have few enough links that just copy pasting it manually is faster than crawling it [18:15] *** dhyan_nat has joined #archiveteam-bs [18:20] *** ndiddy has joined #archiveteam-bs [18:26] *** icedice has quit IRC (Quit: Leaving) [19:25] *** C4K3_ has quit IRC (Quit: leaving) [19:30] *** arkiver has quit IRC (Connection closed) [19:31] *** arkiver has joined #archiveteam-bs [19:32] *** svchfoo1 sets mode: +o arkiver [19:32] *** svchfoo3 sets mode: +o arkiver [20:03] *** Exairnous has joined #archiveteam-bs [20:19] *** Despatche has joined #archiveteam-bs [20:19] *** Despatche has quit IRC (Read error: Connection reset by peer) [20:31] *** icedice has joined #archiveteam-bs [20:48] *** Despatche has joined #archiveteam-bs [21:03] *** dhyan_nat has quit IRC (Read error: Operation timed out) [21:05] *** Odd0002_ has joined #archiveteam-bs [21:09] *** Odd0002 has quit IRC (Ping timeout: 615 seconds) [21:09] *** Odd0002_ is now known as Odd0002 [21:28] *** killsushi has joined #archiveteam-bs [21:44] *** wp494 has quit IRC (Read error: Operation timed out) [21:44] *** dhyan_nat has joined #archiveteam-bs [21:44] *** wp494 has joined #archiveteam-bs [21:58] JAA: How are things going with the Mixtape.moe archivation project? Has Drybones responded yet? Which 4chan archivation sites and other sites have been crawled for links? [22:07] *** dhyan_nat has quit IRC (Read error: Operation timed out) [22:07] icedice: I've started a test job for 1k URLs a few hours ago which is still running. I haven't been in contact with Drybones. [22:09] I've scraped archived.moe and desuarchive.org. Fusl gave me a lot of additional URLs from Majestic-12 earlier. Still need to combine everything again to get a total URL count though. [22:09] Ok [22:11] Seems like Bing doesn't give as many results as Google for all of those forums, but at least those results are scrapeable (unlike Google). Hopefully the missing results are mostly duplicate results. [22:11] https://www.bing.com/search?q=%22my.mixtape.moe%22+site%3Asteamcommunity.com [22:12] 1280 results on Bing vs 2070 results on Google [22:12] *** wyatt8740 has quit IRC (Read error: Operation timed out) [22:13] !ig e1r8qtljbscvouvladfh2bzg0 ^https?://static\.xx\.fbcdn\.net/rsrc\.php/ [22:14] Wrong channel lol [22:14] icedice: That sounds like a lot of effort though. Just doing that search on Bing isn't enough, also need to grab each page and extract the links afterwards. [22:15] Isn't it possible to make a script that extracts my.mixtape.moe links from the description of the search results pages? [22:16] * descriptions [22:16] Hmm, yeah, unless there are multiple links on the same page. [22:17] Well, feel free to try it. My Bing scraper's on GitHub. [22:18] Multiple links on the same search result page or multiple links in the same linked search result? [22:18] The latter. [22:18] I can't code :/ [22:24] Looks like I'm at 80k URLs now, thanks to Fusl's export. [22:24] 80040 to be precise [22:31] *** BlueMax has joined #archiveteam-bs [22:52] JAA: Is it possible to configure JDownloader in a way that it outputs captured mixtape.moe URLs into a .txt file and then just crawl search pages with JDownloader? [23:01] *** qw3rty111 has joined #archiveteam-bs [23:01] JAA: I think you need to redo the Bing crawl: https://www.bing.com/search?q=site%3Amixtape.moe+-site%3Amy.mixtape.moe [23:01] ^ I forgot how to wildcard the track#, but this shows it pretty well anyway [23:02] Apparently there's also spit.mixtape.moe [23:05] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [23:05] *** VerifiedJ has joined #archiveteam-bs [23:06] *** qw3rty119 has quit IRC (Ping timeout: 600 seconds) [23:06] *** icedice has quit IRC (Read error: Connection reset by peer) [23:08] *** icedice has joined #archiveteam-bs [23:09] The track#.mixtape.moe links should then be converted into my.mixtape.moe links so that the my.mixtape.moe links people follow still redirect properly [23:09] Anyway, that's 16 800 Bing results for direct links [23:10] icedice: I have no idea. I haven't used jDownloader in... 10 years? [23:10] 8810 of them are spit links: https://www.bing.com/search?q=site%3Aspit.mixtape.moe [23:11] Most of the rest are probably track# links [23:11] The problem is, you can't actually get all of those results. [23:11] Ah :/ [23:11] I scrape the first 1000 pages, but those contain lots of duplicates etc. [23:12] Did you go for just my.mixtape.moe or for mixtape.moe when you crawled Bing? [23:12] At some point, it's just not worth continuing because while Bing's more scraper-friendly than others, you still need a 2-second delay between requests to avoid getting banned. [23:12] Ok [23:12] I scraped site:mixtape.moe. [23:12] Hmm [23:16] If you did site:my.mixtape.moe, site:spit.mixtape.moe, site:track3.mixtape.moe, site:track4.mixtape.moe, site:track5.mixtape.moe, site:track6.mixtape.moe, and site:track9.mixtape.moe separately you should be able to get more links [23:16] Yeah, testing that now. [23:16] godane: thanks for the effort, and sorry the tape didn't work perfectly [23:17] nice [23:18] oh i was in #archiveteam channel talking to you [23:18] i thought it was -bs [23:19] dashcloud: don't worry about [23:19] icedice: Hmm interesting, I didn't get any my.mixtape.moe results with my mixtape.moe search scrape. [23:19] it [23:19] dashcloud: some tapes that are factory sealed and i couldn't get a picture out of them [23:20] *** Medowar has joined #archiveteam-bs [23:20] icedice: But yeah, the first 10 pages of site:mixtape.moe all contain the same three results... :-| [23:20] one Entertainment Weekly tape i got from Jason didn't work at al [23:20] *all [23:28] Oh nooo [23:28] In what way [23:33] SketchCow: i think i told you about that but it was over a year ago [23:33] i think i digitize the paper that came with it [23:40] SketchCow: here it is: https://archive.org/details/vhscovers-misc-jason-scott-box1-20181203 [23:40] and Entertainment Weekly page is the first one [23:50] Hmm [23:51] I think I could try installing a clipboard manager some day and go through all of those Google results for anime, manga, and game forums with mixtape.moe links