#archiveteam-bs 2019-03-24,Sun

↑back Search

Time Nickname Message
00:35 πŸ”— JH88 has quit IRC (JH88)
00:55 πŸ”— BlueMax has joined #archiveteam-bs
01:19 πŸ”— JAA So after combining all Mixtape scrapes so far, I'm at 30249 URLs. Or in other words, a bit over 0.5 % of all content was linked on Reddit, 4chan, Bing, GitHub, Something Awful, or Wykop.
01:20 πŸ”— JAA At most. A good number of those links are probably dead, too.
01:29 πŸ”— ndiddy has joined #archiveteam-bs
01:42 πŸ”— a_spook_ has quit IRC (Quit: Connection closed for inactivity)
02:06 πŸ”— decay has quit IRC (Quit: leaving)
02:13 πŸ”— hook54321 JAA: It was still up on January 20th apparently. https://webcache.googleusercontent.com/search?q=cache:https://mastodon.rocks/@Glenn
02:14 πŸ”— decay has joined #archiveteam-bs
02:14 πŸ”— hook54321 and 23rd https://webcache.googleusercontent.com/search?q=cache:https://mastodon.rocks/@UBports/with_replies?max_id=101419312194725348
02:21 πŸ”— JAA hook54321: Thanks, added to the page.
02:40 πŸ”— qw3rty118 has joined #archiveteam-bs
02:42 πŸ”— qw3rty117 has quit IRC (Read error: Operation timed out)
03:01 πŸ”— JAA Yay, unit fix spam incoming. :-)
03:02 πŸ”— JAA Never seen that before: "10 edits not shown"
03:10 πŸ”— JAA Just ran a little test on the Mixtape URL list using a random sample of 1 % of the URLs (302). Only 3 URLs produced a 404, surprisingly little. The total size of those URLs was 2.7 GiB, so I expect a total size of around 270 GiB from those 30k URLs. That, on the other hand, is surprisingly large; 0.5 % make up 1.85 % of the total size, apparently. Or the numbers are wrong, idk.
03:10 πŸ”— JAA Unless someone tells me not to, I'll throw these 30k into ArchiveBot later (10 hours or so).
03:11 πŸ”— JAA There were a number of failed requests also in those 302, and in my quick test, I didn't retry those, so it might be slightly larger.
03:45 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
03:46 πŸ”— wp494 has joined #archiveteam-bs
04:37 πŸ”— Mateon1 has quit IRC (Ping timeout: 255 seconds)
04:43 πŸ”— odemgi_ has joined #archiveteam-bs
04:45 πŸ”— odemgi has quit IRC (Ping timeout: 252 seconds)
04:52 πŸ”— odemg has quit IRC (Ping timeout: 615 seconds)
04:58 πŸ”— odemg has joined #archiveteam-bs
05:03 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
05:24 πŸ”— ndiddy has quit IRC ()
05:43 πŸ”— Exairnous has quit IRC (Ping timeout: 252 seconds)
06:09 πŸ”— martle has joined #archiveteam-bs
06:11 πŸ”— martle has quit IRC (Client Quit)
06:11 πŸ”— martle has joined #archiveteam-bs
06:12 πŸ”— martle has quit IRC (Client Quit)
06:13 πŸ”— martle has joined #archiveteam-bs
06:14 πŸ”— dhyan_nat has joined #archiveteam-bs
06:17 πŸ”— martle has quit IRC (Remote host closed the connection)
06:18 πŸ”— martle has joined #archiveteam-bs
06:20 πŸ”— martle has quit IRC (Client Quit)
07:16 πŸ”— Fusl_ who runs the archiveteam docker hub account? i'd like to request access to it for maintaining individual non-warrior worker code scripts that people can run instead of having to maintain, build and rebuild a dockerfile every time the worker code changes
07:19 πŸ”— Mateon1 has joined #archiveteam-bs
07:23 πŸ”— asie Re: Mixtape There's a reason to this. A lot of the URLs we found are from anime communities, which means they're snippets of anime series or other video
07:23 πŸ”— asie Meanwhile, a lot of private users used it for uploading images, which are often smaller
07:24 πŸ”— asie So, yes, a 1.85%/0.5% discrepancy doesn't sound unreasonable.
07:30 πŸ”— wyatt8740 has joined #archiveteam-bs
07:49 πŸ”— wyatt8740 has quit IRC (Ping timeout: 360 seconds)
08:02 πŸ”— wyatt8740 has joined #archiveteam-bs
08:33 πŸ”— HashbangI has quit IRC (Read error: Operation timed out)
09:00 πŸ”— killsushi has quit IRC (Quit: Leaving)
09:31 πŸ”— Despatche has quit IRC (Read error: Operation timed out)
09:49 πŸ”— PurpleSym JAA: I limited purplebot to show only the five most recent edits. With the rise of bots on our wiki it might be necessary to move those edit notifications to a different channel though…
10:25 πŸ”— dhyan_nat has quit IRC (Read error: Operation timed out)
11:08 πŸ”— JAA asie: Right, makes sense.
11:10 πŸ”— JAA PurpleSym: I see. Yeah, that makes sense. It's just the first time that I see this. Even the bot edits usually don't happen this quickly. It was only this time because the size units changed, so the bot simply regenerated the page from cached data with a few things changed and saved it again immediately.
11:13 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
11:30 πŸ”— dhyan_nat has joined #archiveteam-bs
11:35 πŸ”— qw3rty119 has joined #archiveteam-bs
11:35 πŸ”— qw3rty118 has quit IRC (Read error: Operation timed out)
12:17 πŸ”— VerifiedJ has quit IRC (Read error: Connection reset by peer)
12:17 πŸ”— VerifiedJ has joined #archiveteam-bs
12:45 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
12:45 πŸ”— wp494 has joined #archiveteam-bs
13:02 πŸ”— Pixi` has quit IRC (Quit: Pixi`)
13:11 πŸ”— Pixi has joined #archiveteam-bs
13:29 πŸ”— HashbangI has joined #archiveteam-bs
14:22 πŸ”— icedice has quit IRC (Read error: Connection reset by peer)
14:22 πŸ”— icedice has joined #archiveteam-bs
14:45 πŸ”— dhyan_nat has quit IRC (Read error: Operation timed out)
15:18 πŸ”— Stiletto has joined #archiveteam-bs
15:20 πŸ”— ats has quit IRC (Read error: Operation timed out)
15:32 πŸ”— ats has joined #archiveteam-bs
17:06 πŸ”— PhrackD has quit IRC (Read error: Operation timed out)
17:07 πŸ”— PhrackD has joined #archiveteam-bs
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Aign.com
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Aneogaf.com
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Agamespot.com
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Aresetera.com
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Agamefaqs.com
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Amyanimelist.net
17:38 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Aanime-sharing.com
17:39 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Amangadex.org
17:39 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Aanidb.net
17:39 πŸ”— icedice ^ Game, anime, and manga forums with Mixtape.moe links
17:41 πŸ”— icedice Seems this Russian Reddit-like site also has some Mixtape.moe links: https://www.google.com/search?q=my.mixtape.moe+site%3Apikabu.ru
17:42 πŸ”— icedice Also, I found a list of chans that probably has some Mixtape.moe links:
17:42 πŸ”— icedice https://encyclopediadramatica.rs/Chans#.2Achans:_Futaba-style_imageboards
17:42 πŸ”— icedice https://encyclopediadramatica.rs/List_of_*chan_boards
17:43 πŸ”— Jopik has quit IRC (Remote host closed the connection)
17:44 πŸ”— Jopik has joined #archiveteam-bs
17:44 πŸ”— icedice ign.com, neogaf.com, gamespot.com, resetera.com, gamefaqs.com, myanimelist.net, and anime-sharing.com all have enough links to justify collecting them, at least
17:45 πŸ”— icedice From 4690 Google results for IGN to down to 230 Google results for Anime-Sharing
17:47 πŸ”— icedice Almost forgot Steam Community, which has 10 100 Google results: https://www.google.com/search?q=my.mixtape.moe+site%3Asteamcommunity.com
17:50 πŸ”— icedice Hmm,
17:51 πŸ”— icedice I forgot to put my.mixtape.moe in quotation marks
17:53 πŸ”— icedice Steam Community then has 2070 Google results and the other forums probably even less :/
17:58 πŸ”— icedice https://www.google.com/search?q=%22my.mixtape.moe%22+site%3Aleagueoflegends.com
17:58 πŸ”— icedice https://www.google.com/search?q=my.mixtape.moe+site%3Aubi.com
17:59 πŸ”— icedice https://www.google.com/search?q=%22my.mixtape.moe%22+site%3Abattle.net
18:00 πŸ”— icedice steamcommunity.com: 2070 Google results
18:00 πŸ”— icedice gamespot.com: 689 Google results
18:00 πŸ”— icedice neogaf.com: 297 Google results
18:01 πŸ”— icedice ign.com: 113 Google results
18:01 πŸ”— icedice leagueoflegends.com: 203 Google results
18:01 πŸ”— icedice gamefaqs.com: 187 Google results
18:02 πŸ”— icedice resetera.com: 182 Google results
18:02 πŸ”— icedice myanimelist.net: 151 Google results
18:03 πŸ”— icedice battle.net: 63 Google results
18:04 πŸ”— icedice The rest of the game, anime, and manga forums have few enough links that just copy pasting it manually is faster than crawling it
18:15 πŸ”— dhyan_nat has joined #archiveteam-bs
18:20 πŸ”— ndiddy has joined #archiveteam-bs
18:26 πŸ”— icedice has quit IRC (Quit: Leaving)
19:25 πŸ”— C4K3_ has quit IRC (Quit: leaving)
19:30 πŸ”— arkiver has quit IRC (Connection closed)
19:31 πŸ”— arkiver has joined #archiveteam-bs
19:32 πŸ”— svchfoo1 sets mode: +o arkiver
19:32 πŸ”— svchfoo3 sets mode: +o arkiver
20:03 πŸ”— Exairnous has joined #archiveteam-bs
20:19 πŸ”— Despatche has joined #archiveteam-bs
20:19 πŸ”— Despatche has quit IRC (Read error: Connection reset by peer)
20:31 πŸ”— icedice has joined #archiveteam-bs
20:48 πŸ”— Despatche has joined #archiveteam-bs
21:03 πŸ”— dhyan_nat has quit IRC (Read error: Operation timed out)
21:05 πŸ”— Odd0002_ has joined #archiveteam-bs
21:09 πŸ”— Odd0002 has quit IRC (Ping timeout: 615 seconds)
21:09 πŸ”— Odd0002_ is now known as Odd0002
21:28 πŸ”— killsushi has joined #archiveteam-bs
21:44 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
21:44 πŸ”— dhyan_nat has joined #archiveteam-bs
21:44 πŸ”— wp494 has joined #archiveteam-bs
21:58 πŸ”— icedice JAA: How are things going with the Mixtape.moe archivation project? Has Drybones responded yet? Which 4chan archivation sites and other sites have been crawled for links?
22:07 πŸ”— dhyan_nat has quit IRC (Read error: Operation timed out)
22:07 πŸ”— JAA icedice: I've started a test job for 1k URLs a few hours ago which is still running. I haven't been in contact with Drybones.
22:09 πŸ”— JAA I've scraped archived.moe and desuarchive.org. Fusl gave me a lot of additional URLs from Majestic-12 earlier. Still need to combine everything again to get a total URL count though.
22:09 πŸ”— icedice Ok
22:11 πŸ”— icedice Seems like Bing doesn't give as many results as Google for all of those forums, but at least those results are scrapeable (unlike Google). Hopefully the missing results are mostly duplicate results.
22:11 πŸ”— icedice https://www.bing.com/search?q=%22my.mixtape.moe%22+site%3Asteamcommunity.com
22:12 πŸ”— icedice 1280 results on Bing vs 2070 results on Google
22:12 πŸ”— wyatt8740 has quit IRC (Read error: Operation timed out)
22:13 πŸ”— JAA !ig e1r8qtljbscvouvladfh2bzg0 ^https?://static\.xx\.fbcdn\.net/rsrc\.php/
22:14 πŸ”— icedice Wrong channel lol
22:14 πŸ”— JAA icedice: That sounds like a lot of effort though. Just doing that search on Bing isn't enough, also need to grab each page and extract the links afterwards.
22:15 πŸ”— icedice Isn't it possible to make a script that extracts my.mixtape.moe links from the description of the search results pages?
22:16 πŸ”— icedice * descriptions
22:16 πŸ”— JAA Hmm, yeah, unless there are multiple links on the same page.
22:17 πŸ”— JAA Well, feel free to try it. My Bing scraper's on GitHub.
22:18 πŸ”— icedice Multiple links on the same search result page or multiple links in the same linked search result?
22:18 πŸ”— JAA The latter.
22:18 πŸ”— icedice I can't code :/
22:24 πŸ”— JAA Looks like I'm at 80k URLs now, thanks to Fusl's export.
22:24 πŸ”— JAA 80040 to be precise
22:31 πŸ”— BlueMax has joined #archiveteam-bs
22:52 πŸ”— icedice JAA: Is it possible to configure JDownloader in a way that it outputs captured mixtape.moe URLs into a .txt file and then just crawl search pages with JDownloader?
23:01 πŸ”— qw3rty111 has joined #archiveteam-bs
23:01 πŸ”— icedice JAA: I think you need to redo the Bing crawl: https://www.bing.com/search?q=site%3Amixtape.moe+-site%3Amy.mixtape.moe
23:01 πŸ”— icedice ^ I forgot how to wildcard the track#, but this shows it pretty well anyway
23:02 πŸ”— icedice Apparently there's also spit.mixtape.moe
23:05 πŸ”— VerifiedJ has quit IRC (Read error: Connection reset by peer)
23:05 πŸ”— VerifiedJ has joined #archiveteam-bs
23:06 πŸ”— qw3rty119 has quit IRC (Ping timeout: 600 seconds)
23:06 πŸ”— icedice has quit IRC (Read error: Connection reset by peer)
23:08 πŸ”— icedice has joined #archiveteam-bs
23:09 πŸ”— icedice The track#.mixtape.moe links should then be converted into my.mixtape.moe links so that the my.mixtape.moe links people follow still redirect properly
23:09 πŸ”— icedice Anyway, that's 16 800 Bing results for direct links
23:10 πŸ”— JAA icedice: I have no idea. I haven't used jDownloader in... 10 years?
23:10 πŸ”— icedice 8810 of them are spit links: https://www.bing.com/search?q=site%3Aspit.mixtape.moe
23:11 πŸ”— icedice Most of the rest are probably track# links
23:11 πŸ”— JAA The problem is, you can't actually get all of those results.
23:11 πŸ”— icedice Ah :/
23:11 πŸ”— JAA I scrape the first 1000 pages, but those contain lots of duplicates etc.
23:12 πŸ”— icedice Did you go for just my.mixtape.moe or for mixtape.moe when you crawled Bing?
23:12 πŸ”— JAA At some point, it's just not worth continuing because while Bing's more scraper-friendly than others, you still need a 2-second delay between requests to avoid getting banned.
23:12 πŸ”— icedice Ok
23:12 πŸ”— JAA I scraped site:mixtape.moe.
23:12 πŸ”— icedice Hmm
23:16 πŸ”— icedice If you did site:my.mixtape.moe, site:spit.mixtape.moe, site:track3.mixtape.moe, site:track4.mixtape.moe, site:track5.mixtape.moe, site:track6.mixtape.moe, and site:track9.mixtape.moe separately you should be able to get more links
23:16 πŸ”— JAA Yeah, testing that now.
23:16 πŸ”— dashcloud godane: thanks for the effort, and sorry the tape didn't work perfectly
23:17 πŸ”— icedice nice
23:18 πŸ”— godane oh i was in #archiveteam channel talking to you
23:18 πŸ”— godane i thought it was -bs
23:19 πŸ”— godane dashcloud: don't worry about
23:19 πŸ”— JAA icedice: Hmm interesting, I didn't get any my.mixtape.moe results with my mixtape.moe search scrape.
23:19 πŸ”— godane it
23:19 πŸ”— godane dashcloud: some tapes that are factory sealed and i couldn't get a picture out of them
23:20 πŸ”— Medowar has joined #archiveteam-bs
23:20 πŸ”— JAA icedice: But yeah, the first 10 pages of site:mixtape.moe all contain the same three results... :-|
23:20 πŸ”— godane one Entertainment Weekly tape i got from Jason didn't work at al
23:20 πŸ”— godane *all
23:28 πŸ”— SketchCow Oh nooo
23:28 πŸ”— SketchCow In what way
23:33 πŸ”— godane SketchCow: i think i told you about that but it was over a year ago
23:33 πŸ”— godane i think i digitize the paper that came with it
23:40 πŸ”— godane SketchCow: here it is: https://archive.org/details/vhscovers-misc-jason-scott-box1-20181203
23:40 πŸ”— godane and Entertainment Weekly page is the first one
23:50 πŸ”— icedice Hmm
23:51 πŸ”— icedice I think I could try installing a clipboard manager some day and go through all of those Google results for anime, manga, and game forums with mixtape.moe links

irclogger-viewer