[00:17] *** Dj-Wawa has joined #archiveteam-bs [00:26] *** Exairnous has joined #archiveteam-bs [00:40] *** wp494 has quit IRC (Ping timeout: 364 seconds) [00:44] *** wp494 has joined #archiveteam-bs [00:50] Cat Say No! : https://www.youtube.com/watch?v=cMESRatAG04 [00:59] *** killsushi has quit IRC (Quit: Leaving) [01:00] *** Stilett0 has joined #archiveteam-bs [01:03] *** Stiletto has quit IRC (Ping timeout: 492 seconds) [01:22] *** Odd0002 has quit IRC (Read error: Operation timed out) [01:23] *** VADemon has quit IRC (Read error: Connection reset by peer) [01:32] *** Odd0002 has joined #archiveteam-bs [01:43] *** Odd0002 has quit IRC (Read error: Operation timed out) [01:43] *** Odd0002 has joined #archiveteam-bs [04:11] chfoo: I have submitted pull request #113 against seesaw-kit to fix the "rsync missing folder" issue [04:11] That seems to be plaguging the Google Plus grab in particular [04:13] * MrRadar AFK [04:21] nice, i'll take a look at it soon [04:24] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [04:27] *** qw3rty117 has joined #archiveteam-bs [04:33] *** qw3rty116 has quit IRC (Read error: Operation timed out) [04:37] *** SimpBrain has quit IRC (Remote host closed the connection) [04:38] *** SimpBrain has joined #archiveteam-bs [04:44] *** odemgi has joined #archiveteam-bs [04:46] *** odemgi_ has quit IRC (Ping timeout: 252 seconds) [04:52] *** odemg has quit IRC (Ping timeout: 615 seconds) [04:53] *** ndiddy has quit IRC () [04:59] *** odemg has joined #archiveteam-bs [05:28] *** dhyan_nat has joined #archiveteam-bs [05:37] *** Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) [06:05] *** S1mpbrain has joined #archiveteam-bs [06:06] *** SimpBrain has quit IRC (Remote host closed the connection) [06:07] *** d5f4a3622 has joined #archiveteam-bs [06:43] *** MrRadar_ has joined #archiveteam-bs [06:45] *** MrRadar has quit IRC (Read error: Operation timed out) [07:06] *** Mateon1 has joined #archiveteam-bs [07:07] *** Despatche has quit IRC (Read error: Operation timed out) [07:57] *** Odd0002_ has joined #archiveteam-bs [07:59] *** Odd0002 has quit IRC (Read error: Operation timed out) [07:59] *** Odd0002_ is now known as Odd0002 [08:05] *** Exairnous has quit IRC (Ping timeout: 265 seconds) [08:22] lol oh man the paradisebay comments [08:22] > So basically all the money we've spent on this game is gone lmao wow [08:23] yeah, you spent money on digital accessoires [08:23] *in a third-party's world [08:24] *** dhyan_nat has quit IRC (Read error: Operation timed out) [08:54] *** S1mpbrain has quit IRC (Remote host closed the connection) [08:54] *** SimpBrain has joined #archiveteam-bs [09:05] schbirid: please [09:06] They spent money on flipping bits [09:43] *** wp494 has quit IRC (Ping timeout: 364 seconds) [09:45] *** wp494 has joined #archiveteam-bs [09:45] *** halt has quit IRC (hub.efnet.us efnet.deic.eu) [10:04] *** kode54 has quit IRC (Quit: ZNC 1.7.2 - https://znc.in) [10:10] *** kode54 has joined #archiveteam-bs [10:43] *** Dj-Wawa has joined #archiveteam-bs [11:16] *** dhyan_nat has joined #archiveteam-bs [11:33] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [11:33] *** VerifiedJ has joined #archiveteam-bs [12:02] *** MrRadar_ is now known as MrRadar [12:04] *** BlueMax has quit IRC (Quit: Leaving) [12:06] *** Stilett0 has quit IRC () [12:28] *** SimpBrain has quit IRC (Read error: Operation timed out) [12:35] *** SimpBrain has joined #archiveteam-bs [12:37] *** MrRadar2 sets mode: +o MrRadar [12:43] *** netsound has quit IRC (Leaving) [12:58] *** Hani has quit IRC (Read error: Operation timed out) [13:03] *** Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) [13:04] *** Hani has joined #archiveteam-bs [13:21] *** S1mpbrain has joined #archiveteam-bs [13:21] *** SimpBrain has quit IRC (Remote host closed the connection) [13:28] So I wrote a little tool to search Reddit using the Pushshift API and extract URLs from the results. It found about 5300 mixtape.moe links. (Still needs some manual cleanup because parsing Markdown is hard.) [13:32] *** ats has quit IRC (Ping timeout: 252 seconds) [13:35] JAA: Well done! Can you do the same with archived.moe? There seems to be a lot of mixtape.moe links there as well. [13:35] Yep, that's the plan. Should be easier there actually. [13:38] Nice [13:38] Are we going to do search engine crawls as well? [13:46] *** lindalap has joined #archiveteam-bs [13:47] I'm scraping Bing. [13:48] Mixtape.moe stats: "14,594 GBs of files", "5,846,150 uploaded files", "68,350 files uploaded this month (since the 1st)", etc. [13:48] shutting down in a week [13:49] > Mixtape serves over 450 Terabytes of files to over 7,300,000 unique visitors per month. 65% of our traffic is webm video files, 12% gif, 10% are other images (jpg, png). [13:49] *** dhyan_nat has quit IRC (Read error: Operation timed out) [13:50] Ok, nice [13:50] lindalap: Yep, working on it already. [13:50] Searx.me could be usable if you get a custom URL working since it covers all other search engines when configured properly [13:50] It'll be impossible to get everything because there's no index and bruteforce is infeasible. [13:51] icedice: I won't stop you from doing that. :-) [13:51] I know Drybones and could ask for a file list [13:53] Is it possible to grab the mixtape.moe links on Twitter as well? [13:53] Oh. Well, if they're willing to do that, would be nice. [13:53] Jaa: I can look into it after I wake up [13:54] 450 TB x $2000 = $900 000 [13:55] I think 450 TB was the monthly bandwidth use. lindalap said they currently have 14.5 TB of files [13:55] Ah, that's much more manageable [13:56] Yeah, also it's a Pomf.se clone and we've done that before [13:56] I didn't sleep last night, so my brain is mush now [13:56] Would be a Warrior job, but the code practically the same if possible [13:56] So I'm going to take a nap now [13:57] Drybones @ Rizon IRC, if you need to contact the admin of Mixtape.moe btw. I already sent an email and IRC query. [13:58] #pomfret on Rizon IRC, but I'm not there currently (looks like they enabled identification requirement) [13:58] I asked lesderid for an invite to that channel [13:58] https://twitter.com/drybones_5 on Twitter [13:58] (he tweets too much) [14:00] *** swebb has quit IRC (Read error: Operation timed out) [14:01] *** swebb has joined #archiveteam-bs [14:01] https://drybones.me/ btw [14:03] > We will not publicly share our user data and we will not sell the domain, so don't bother asking. [14:03] Doesn't seem likely, then [14:03] despite the files technically being publicly available (to those knowing the URL, which is semi-bruteforceable) [14:13] I was going to mention about a 3.1 TB archive of probably the biggest collection of console games, but looks like there was already a grab on 2018-08-13. :) [14:14] s/biggest/extensive/ [14:17] Not bruteforceable in this limited amount of time probably. [14:17] Yeah, it's not. [14:17] There are several servers/subdomains, [a-z]{6} codes, and you need to know the file extension. [14:17] I proposed to Drybones having a copy at the IA with a file "locked" status as a compromise, we'll see... [14:17] Which is about an order of magnitude too much to bruteforce. [14:19] "I was going to mention about a 3.1 TB archive" They even mentioned Archive Team as a reason for operating. :) [14:19] Currently 20 TB in total [14:19] *** S1mpbrain has quit IRC (Remote host closed the connection) [14:21] Oh, actually all three recursive attempts have been aborted at some point. [14:21] JAA knows about the site, though I can query to remind what it is (don't want to publicly mention it here) [14:23] Is it possible to crawl Twitter for Mixtape.moe links as well? [14:26] archived.moe scrape running now. [14:27] Nice [14:28] I'd guess reddit is a big one for Mixtape.moe links too. [14:28] Particularly anime subreddits. [14:28] That was the first one we did [14:28] Nice. [14:28] Or JAA handled it, to be specific [14:29] Why is Bing more scrape friendly than other search engines btw? [14:30] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [14:31] *** VerifiedJ has joined #archiveteam-bs [14:31] Yup, 5320 links from Reddit. [14:31] Bing scrape's done and found 223 URLs. [14:32] I'm searching my last scrape of Something Awful (only circa 2016) and I'll probably have a few hundred more to contribute as well when that finishes [14:32] Because Bing wouldn't get any users if it weren't for scrapers? Idk... [14:32] 8chan probably has some links [14:32] I'd be a bit concerned with what some of those links might contain though... [14:33] Hmm [14:33] Discord probably has links [14:33] No centralized way of searching it though [14:34] So someone would have to join all of the major anime and manga Discord servers and put together a file list [14:34] Not sure what the point is, if Discord supports attaching files anyway. [14:34] discord only allows ~8mb for non-nitro users [14:34] mixtape allows ~100mb [14:35] If you could, email Drybones and ask for help with archiving for IA... more requests the better? Come up with a compromise, maybe. [14:38] *** ats has joined #archiveteam-bs [14:42] Any file that has over a certain number of downloads maybe? [14:42] I mean, if a file has 100 downloads I'd hardly consider it private [14:43] * downloads or streams [14:43] Not that the server logs would know the difference [14:43] Oh [14:43] archived.moe scrape done, 5644 URLs found. [14:43] They might not store that info [14:44] So then they would not know how many times a file has been accessed [14:44] 3 days logs, IIRC [14:49] Well, that could still get us some additional files if Drybones thinks it's reasonable [14:51] Previously, Drybones hasn't been as cooperative as other Pomf clones from my experience. For example, refused many times to take my advice to register a copyright agent at the USCO to avoid DMCA liability. That's the kind of operator to deal with here... [14:53] In other things to discuss: Can someone !a http://kartat.kapsi.fi/ for me, please? Maps from Finnish government, mirrored from https://www.maanmittauslaitos.fi/kartat-ja-paikkatieto . [14:53] Lack of IA coverage for files. [14:54] Actually, most of it seems to be there at IA with few missing. [14:55] That thing's huge. [14:55] There are several datasets with hundreds of GB and a few of multiple TB. [14:56] I can get a selective list of what's missing from IA. [14:59] Getting that list now. [15:00] I've got 712 unique mixtape.moe URLs from my Something Awful crawl [15:01] There are few duplicates between my three scrapes, by the way. Less than a 100 out of over 11k discovered. [15:02] https://2by2.info/blip/mixtape.moe.sorted.txt [15:08] Re: kartat.kapsi.fi: Some indexes are archived, but the files are not. A bit moot of me trying to handpick datasets based on their last indexed time at IA... [15:11] Except for aerial photography ("ortoilmakuva"), it's not "too big" [15:13] Some links are duplicate with same target URI, btw. [15:14] It might be best to grab that as files with rsync and upload it as items. Especially considering it's only a mirror, not the original site (according to what you said above) and probably not in immediate risk. But in any case, it's large enough to first check with IA probably. [15:15] There's also "Laserkeilausaineisto" with 1.7 TB and "Korkeusmalli" with over 700 GB. [15:15] 1415 GB seems to cover the other 271 GB for Laserkeilausaineisto. [15:15] So it's only ~1.4 TB [15:16] Likewise, aerial photography is 3839 GB not 12.4 TB [15:16] at least so it seems [15:19] Ah yeah, duplicate entries and entries for subdirectories, great... [15:21] *** HashbangI has quit IRC (Remote host closed the connection) [15:25] I copy pasted Mixtape.moe links from a few Discord servers I'm in: https://pastebin.com/raw/fywM0VPC [15:25] ^ What was the recommended command for archiving a paste? [15:25] Something with < or > [15:25] *** JH88 has quit IRC (Read error: Connection reset by peer) [15:26] I believe !ao < $URL [15:26] *** HashbangI has joined #archiveteam-bs [15:27] which requires +o or +v on the channel, I believe. [15:28] I believe there's also undocumented !a < $URL, which requires +o [15:28] Thanks [15:28] *** JH88 has joined #archiveteam-bs [15:30] I wonder if https://github.com/tsudoko/pullcord could be modified so that it output lists of URLs on the Discord servers it's run on [15:30] Because copy pasting one by one is a pain in the ass [15:37] icedice: Please just give links to me instead of throwing them into ArchiveBot already. I'll combine everything, ensure there are no duplicates, and grab them. [15:39] Well, unless someone tells me not to for whatever reason. [15:40] *** ats has quit IRC (Read error: Operation timed out) [15:47] Sorry, I already ran the archivation job and it already finished [15:47] The links are here: https://pastebin.com/raw/fywM0VPC [15:56] JAA: Do you think any of these other 4chan archivers might have earlier archived posts than archived.moe? [15:56] https://www.archiveteam.org/index.php?title=4chan#List_of_Fuuka_Archivers_by_board [15:57] Possibly, but I have no idea. [15:58] yuki.la isn't on that list, btw [15:59] 2008-02-02 – today [15:59] but it's not Fuuka-based, afaik [15:59] but yuki.la is also a bit delayed to display the most recent results (a day or two) [16:01] btw, yuki.la hasn't been AB'd at all [16:02] My friend also hosts a private BASC (?) based archive of 4chan boards. [16:02] It's not public. [16:02] large datasets [16:03] JAA: I guess contacting mixtape.moe yielded no results? [16:03] https://desuarchive.org/_/search/text/my.mixtape.moe/ [16:03] "Returning only first 5000 of 22156 results found." [16:03] It's 11:03 AM or less at the US, give it time [16:03] I'm going to say, that's a yes [16:03] make sure to filter by board [16:03] Actually Drybones tweeted one hour ago [16:04] *** ats has joined #archiveteam-bs [16:05] https://fireden.net/ and https://warosu.org/ is probably also has links [16:06] 2560 results on rbt.asia / archive.rebeccablacktech.com [16:06] If we get those three we will have covered all active 4chan archivers that archive board related to anime/manga/gaming/Japanese culture [16:07] Ah, yeah Rebecca Black Tech seems to have some relevant boards as well [16:08] Scraping desuarchive.org now, but it'll be incomplete since even the per-board searches produce more than 5k results at least for /a/. [16:09] Yeah, https://yuki.la/ looks good as well [16:09] Seems like they have most of /a/ back to 2008 archived [16:12] fwiw desuarchive also allows setting date limits, maybe that's a solution for /a/ [16:13] JAA: Isn't it possible to just go to page number X, archive that, go to the next one, and then the next one, and so on [16:13] working on scraping github repos for mixtape links [16:14] asie: Oh, how does that work? [16:14] icedice: That's what I'm doing, but it only returns 5000 results in total, i.e. there is no page 201 and up. [16:14] Ah, ok [16:15] JAA: i'm not sure yet, but i did an initial search on github and there are 7000 (probably non-unique) mixtape URLs across git repos [16:15] i'm looking at the github search api now [16:15] asie: I mean the date ranges on desuarchive. [16:15] oh [16:15] And I assume that's a general feature of FoolFuuka? [16:15] "Date Start/Date End" on the menu, in URL they seem to be f.e. ".../start/2016-01-01/" [16:16] also i think so? i haven't analyzed it that deeply [16:16] sorry [16:16] and yes it does still display the first "5000" for a given date range so i presume you can use this to circumvent the limit [16:17] just set a date range ending at the earliest post scraped and keep going... someone'd have to try it [16:18] Thanks, will try that in a bit. [16:23] We could try getting some Mixtape links from 8chan by running "my.mixtape.moe site:8ch.net/v/", "my.mixtape.moe site:8ch.net/a/", "my.mixtape.moe site:8ch.net/co/", "my.mixtape.moe site:8ch.net/g/", and other Japanese or tech related boards in DuckDuckGo or Bing [16:24] minus the " " [16:38] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [16:38] *** VerifiedJ has joined #archiveteam-bs [17:07] https://old.reddit.com/user/Farnsworth_The_Dog/ may get banned for reinstanting r/WPD discussion, a sub which was banned and had two other sub moderators suspended for 3 days [17:07] Also "Only reason I'm still here is for r/WatchRedditDie and our site now, otherwise I would have nuked this account." [17:07] oops, forgot #shreddit [17:13] asie: I did some digging in the FoolFuuka source, and it looks like the "start" and "end" search options have existed since version 2.0.1, which was released over 4 years ago. :-) [17:14] I'll rewrite my scraper to use those. [17:14] Well, the "end" one. [17:29] JAA: https://paste.asie.pl/raw/pFdt github mixtape.moe scrape [17:30] wait n [17:30] two corrupt URLs crept in; https://paste.asie.pl/raw/FNFt is better [17:34] I'm shallow-grabbing that saidit.net sub I attempted at #archivebot manually. [17:34] with over18 cookie [17:35] Downloaded: 89 files, 1.3M in 0.3s (3.82 MB/s) [17:42] asie: Thanks! [17:46] Rescraping archived.moe and desuarchive.org now. [17:48] On another note, apparently some my.mixtape.moe URLs redirect to track# servers instead. I don't have any example handy right now though. [17:50] It's true [17:50] I *think* it's all of them, actually. [17:50] The IDs should be unique across all of Mixtape, but we can check that once we have a large set of URLs [17:51] Nope, I just tested a few links and got a direct download. [17:52] Interesting. [18:01] *** ndiddy has joined #archiveteam-bs [18:02] *** peanut has joined #archiveteam-bs [18:04] *** killsushi has joined #archiveteam-bs [18:04] It depends on what kind of file you're downloading iirc [18:04] zip --> direct download [18:04] image or video file --> track# URL [18:05] I think it was like that, at least [18:05] Yeah, I remember now [18:06] The zip file just needed to run the my.mixtape.moe URL through Wayback Machine [18:06] The ones I tested were WEBM and MP4 videos. [18:06] While a TIFF file redirected to track# and then I had to archive the track# URL [18:14] Has https://old.reddit.com/r/WatchRedditDie been archived? [18:14] If not, it should [18:25] I just realized that we can get thousands of links from the major game, anime, and manga forums [18:26] I'll put together a list when I get back [18:27] out of curiosity, should there be a channel/article for mixtape or is it too early? [18:27] err, not article, wiki page [18:32] *** netsound has joined #archiveteam-bs [18:34] JAA: https://paste.asie.pl/raw/voD5 very small assortment of mixtape URLs from the Polish Reddit-like "wykop.pl"; they have no real search facility for the relevant areas so i had to improvise [18:43] *** wp494 has quit IRC (Read error: Operation timed out) [18:43] *** wp494 has joined #archiveteam-bs [18:44] also, a suggestion to !a http://sbnc.khobbits.co.uk/log/logs/ - it's a comprehensive archive of 2011-2017 IRC logs of the most important channels of the Minecraft modding scene; the bot has been down for over a year so the site might go down at some point [18:45] and besides most recently everyone moved to Discord anyhow... [18:45] *** qwebirc96 has joined #archiveteam-bs [18:46] (not just the modding scene; #minecraft itself and key channels for say vanilla server administrators are present too) [19:08] *** qwebirc96 has quit IRC (Ping timeout: 260 seconds) [19:22] *** dhyan_nat has joined #archiveteam-bs [19:45] *** t3 has quit IRC (Quit: Connection closed for inactivity) [19:52] *** peanut has quit IRC (Quit: http://www.mibbit.com ajax IRC Client) [19:53] *** Exairnous has joined #archiveteam-bs [19:56] *** Despatche has joined #archiveteam-bs [20:00] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [20:00] *** VerifiedJ has joined #archiveteam-bs [20:31] *** t3 has joined #archiveteam-bs [20:42] asie: Thanks. And I've thrown the Minecraft logs into ArchiveBot. [21:01] *** Despatche has quit IRC (Quit: Read error: Connection reset by deer) [21:02] *** Despatche has joined #archiveteam-bs [21:12] *** delirein has joined #archiveteam-bs [21:13] *** dhyan_nat has quit IRC (Read error: Operation timed out) [22:18] archived.moe and desuarchive.org reprocessed, 21711 URLs from there. [22:21] *** BlueMax has joined #archiveteam-bs [22:38] *** ndiddy has quit IRC () [22:41] *** Soni has joined #archiveteam-bs [22:45] *** a_spook_ has joined #archiveteam-bs [22:54] JAA: might be silly/redundant/not a lot but I found a searchable thing to add to the mixtape.moe pile? http://index.commoncrawl.org/CC-MAIN-2019-09-index?url=*.mixtape.moe&output=json http://index.commoncrawl.org/ [23:17] *** HashbangI has quit IRC (net_error) [23:18] *** HashbangI has joined #archiveteam-bs [23:27] *** BlueMax has quit IRC (Quit: Leaving) [23:28] Drybones, the owner of Mixtape, posted on /r/Datahoarder by the way, and someone asked them if they're willing to archive everything onto IA/WBM: https://old.reddit.com/r/DataHoarder/comments/b4b7km/mixtapemoe_shutting_down/ [23:30] s/posted/commented/