[00:09] *** archodg has joined #archiveteam-bs [00:11] *** wp494 has quit IRC (Read error: Operation timed out) [00:11] *** wp494 has joined #archiveteam-bs [00:47] *** Stilett0- has joined #archiveteam-bs [00:56] *** RichardG has quit IRC (Ping timeout: 268 seconds) [01:24] *** Aoede has quit IRC (Ping timeout: 252 seconds) [01:37] *** Aoede has joined #archiveteam-bs [01:49] *** ta9le has quit IRC (Quit: Connection closed for inactivity) [01:50] *** archodg has quit IRC (Remote host closed the connection) [02:30] *** Petri152 has quit IRC (ZNC - http://znc.in) [03:13] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:26] *** odemg has joined #archiveteam-bs [04:48] *** ndiddy has quit IRC () [04:57] *** BlueMax has quit IRC (Leaving) [05:55] *** BlueMax has joined #archiveteam-bs [06:02] *** Specular has joined #archiveteam-bs [06:43] *** Kaz has quit IRC (Ping timeout: 268 seconds) [06:43] *** dxrt_ has quit IRC (Ping timeout: 268 seconds) [06:43] *** Kaz has joined #archiveteam-bs [06:45] *** Gfy has quit IRC (Ping timeout: 268 seconds) [06:45] *** Rai-chan has quit IRC (Ping timeout: 268 seconds) [06:45] *** Gfy_ has joined #archiveteam-bs [06:48] *** Kaz has quit IRC (Ping timeout: 268 seconds) [06:48] *** nightpool has quit IRC (Ping timeout: 268 seconds) [06:48] *** closure_ has quit IRC (Ping timeout: 268 seconds) [06:48] *** nightpool has joined #archiveteam-bs [06:48] *** zino has quit IRC (Ping timeout: 268 seconds) [06:48] *** closure has joined #archiveteam-bs [06:49] *** zino has joined #archiveteam-bs [06:51] *** dxrt_ has joined #archiveteam-bs [06:51] *** Rai-chan has joined #archiveteam-bs [06:52] *** Kaz has joined #archiveteam-bs [07:21] *** Pixi has quit IRC (Quit: Pixi) [07:21] *** Pixi has joined #archiveteam-bs [07:24] *** Aoede has quit IRC (Ping timeout: 252 seconds) [07:34] *** Aoede has joined #archiveteam-bs [08:07] *** Aoede has quit IRC (Ping timeout: 252 seconds) [08:13] *** Aoede has joined #archiveteam-bs [09:00] JAA: thanks for that. Think it is prudent to stop the AB crawl? [09:01] Actually, I'm not sure what was grabbed as part of that project exactly. [09:02] pipeline.py only has a reference to "halo3file" items which were/are also hosted on the same domain but are not the forums: https://github.com/ArchiveTeam/halo-grab/blob/93195fe56d6b5ec22c89f4586699deac1f28602e/pipeline.py#L199-L207 [09:03] I think we should let the ArchiveBot job run in any case. [09:03] yeah [09:03] my job is going pretty quick, but I think neither are quick enough [09:04] I might try a crawl for each subforum [09:04] Ew, that forum software doesn't have thread IDs. :-| [09:09] Post IDs are the forum IDs [09:10] s/forum/thread/, but yeah, that makes it ugly... If it had thread IDs, you could easily iterate over all threads to archive everything. With post IDs, that's way too inefficient. [09:10] I've checked a few old threads in the Wayback Machine, and it doesn't look like they were archived. [09:10] Hm [09:11] iterate over hte post ID? [09:12] Then you have to get each page [09:12] Yeah, but there are some 75 million post IDs or so. [09:13] Just wish we got more notice than "Oh yeah, in less than a week we're deleting the forums" [09:13] Well, it's better than "we're shutting it down tomorrow" or a plain "kthxbye". [09:14] But yeah, more time would be nice. [09:14] Looks like they're using CloudFlare. [09:16] im managing a solid 30-100Mbit down [09:22] JAA: Is there any way to see the URLs a job grabbed after it's finished? [09:23] eientei95: You can look at the log file, which is inside the job's -meta.warc.gz. [09:23] Ah [09:23] But only when it's uploaded to IA, obviously. [09:24] No way to check before it's uploaded? [09:24] Nope [09:24] Darn [09:29] *** adinbied has joined #archiveteam-bs [09:30] Posted this in the #aohell channel as well, but I figured I'd post here as well. Not sure who's currently in charge of the AOL project - but I think the last server has gone offline. I just spent the last day getting a OS X 10.4 Virtual Machine working, so that I could get the AOL_For_Mac_OS_X.dmg to work - only to find out that it won't connect. After some network digging, the AOL Client is trying to reach americaonline.aol [09:30] .com - which does not currently exist as a valid subdomain. [09:31] just on the Halo backup since I noticed in the ArchiveBot list, what was missing? I see it mentions 'grab to get the topics'. The crawl back in 2014-2015 missed these? [09:31] Specular: Yes, apparently. I've checked a few old threads in the Wayback Machine, and it doesn't look like they were archived. [09:31] I gave that explanation because I first tried /forum/ but case-sensitivity missed the topics which are /Forum/ [09:32] well, glad someone picked up on it in time [09:32] *** schbirid has joined #archiveteam-bs [10:16] *** DragonMon has joined #archiveteam-bs [10:17] *** ta9le has joined #archiveteam-bs [10:22] *** DragonMon has quit IRC (Quit: Leaving) [10:59] *** BlueMax has quit IRC (Read error: Connection reset by peer) [12:46] *** lindalap has quit IRC (Ping timeout: 260 seconds) [12:58] *** lindalap has joined #archiveteam-bs [13:00] *** Stilett0- has quit IRC (Read error: Connection reset by peer) [13:19] *** lindalap has quit IRC (Ping timeout: 260 seconds) [13:34] *** lindalap has joined #archiveteam-bs [13:44] *** lindalap has quit IRC (Ping timeout: 260 seconds) [13:57] *** lindalap has joined #archiveteam-bs [14:31] *** jschwart has joined #archiveteam-bs [15:11] *** archodg has joined #archiveteam-bs [15:12] arkiver, warrior or archivebot? https://old.reddit.com/r/Archiveteam/comments/8ta23j/bungie_scrubbing_legacy_halo_forums/?st=jirjlsi3&sh=816d4bb5 [15:55] already in the archivebot [15:55] But we're not sure we'll have the time [16:19] btw seems (some/all?) these URLs haven't been saved (at least judging by the version available on IA: http://halo.bungie.net/Stats/PlayerStatsHalo2.aspx?player=http://halo.bungie.net/Stats/PlayerStatsHalo2.aspx?player=IconicRyan [16:21] it's the last string that's dynamic. A Google search returns a bunch of names but not sure if there's some list on the site itself someplace. [16:23] *** schbirid has quit IRC (Remote host closed the connection) [16:49] edit: correction the above URL should be http://halo.bungie.net/Stats/PlayerStatsHalo2.aspx?player=IconicRyan [17:57] *** Sk2d has joined #archiveteam-bs [17:59] *** Sk1d has quit IRC (Read error: Operation timed out) [17:59] *** Sk2d is now known as Sk1d [18:09] *** Mateon1 has quit IRC (Ping timeout: 260 seconds) [18:09] *** Mateon1 has joined #archiveteam-bs [18:18] *** plue has quit IRC (Ping timeout: 260 seconds) [18:25] *** Mateon1 has quit IRC (Remote host closed the connection) [18:28] *** Mateon1 has joined #archiveteam-bs [19:23] *** Aoede has quit IRC (Ping timeout: 252 seconds) [19:29] *** Aoede has joined #archiveteam-bs [19:58] *** Stilett0- has joined #archiveteam-bs [20:46] *** Stilett0- has quit IRC () [21:02] *** Stilett0- has joined #archiveteam-bs [21:04] *** Stilett0- has quit IRC (Read error: Connection reset by peer) [21:07] *** Specular has quit IRC (Quit: Leaving) [21:17] *** fie has joined #archiveteam-bs [21:21] arkiver: Can we do a quick project for PureVolume MP3s? Should be very straightforward; I described it in here a few weeks ago around 2018-06-06 21:45 UTC. They'll shut down in a week (maybe; they extended the deadline twice already). [21:22] sure [21:23] A full project for the website would probably be much more work. I'll try to grab it myself instead (and we already got a significant part through ArchiveBot as well). [21:23] s/probably/certainly/ [21:24] what info do we have about purevolume? [21:24] What kind of info do you mean? [21:25] just didn't see a wikipage, reading IRC logs now [21:25] *** Stilett0- has joined #archiveteam-bs [21:26] *** Stilett0- has quit IRC (Read error: Connection reset by peer) [21:26] apparently I already announced there would be a project :P [21:26] Yeah, we talked about it before. [21:27] *** Stilett0 has joined #archiveteam-bs [21:27] are you talking about only the http://www.purevolume.com/download.php?id= URLs? [21:28] Yep [21:29] Unless you want to do the full thing, but that's fairly complex and would probably take a while to test etc. (Plus both the warrior VM and the tracker aren't exactly in good shape at the moment.) [21:29] The downloads should be really easy to do, just grab IDs up to 3.6 million or whatever. [21:30] I seem to be getting 'ERROR 93' on downloads [21:30] from your message on 2018-06-06 Someone want to grab PureVolume's MP3 downloads? URLs are http://www.purevolume.com/download.php?id=$id where the ID goes up to at least 3665510. Many IDs are invalid/not available for download, meaning the server returns HTTP 200 with the content "Unauthorized Download Attempt". Valid downloads are HTTP 302 redirects to /downloads//.mp3. Examples of valid IDs: 120066 1864706 2615487 3153795. [21:31] Hmm, yeah, me too. [21:33] notice really says June 30th, 2018 [21:33] arkiver: Here's a working link: http://www.purevolume.com/download.php?id=3190128 [21:33] From http://www.purevolume.com/new/thestartingline [21:34] So not everything's broken yet. [21:35] Five more examples on http://www.purevolume.com/new/papercitiesband [21:35] yeah, so this is only with the downloadable MP3 links [21:36] could be my browser, but I'm not able to play music through their player [21:36] I don't remember where I took those IDs from in my message on the 6th, but I'm sure they were downloadable then. [21:37] For the record, the website's also available on http://g.purevolumecdn.com/. Just in case only www.purevolume.com disappears on the 30th. [21:38] *** plue has joined #archiveteam-bs [21:39] Looks like the "ERROR 93" thing is not new: here's someone complaining about it on the site's Facebook page in February 2017: https://www.facebook.com/PureVolumedotcom/posts/10154968022569344 [21:39] I could quickly setup a project now for those IDs [21:39] (could also help you setting on up if you want) [21:39] one* [21:42] Please do. I'd love to learn how all of that works, but I don't really have time for that right now. Need to get some other grabs running ASAP and should already be in bed. [21:42] ok [21:42] that's fine, we'll get to it another time [21:43] this one should be running in a bit if nothing strange pops up [21:43] :-) [21:44] uhh [21:45] oh nvm you already have a list of IDs [21:47] Well, not really, but it's a fairly small space of just a few million IDs. [21:47] The website grab is trickier though. [21:52] no too tricky [21:52] I think we can split this up in pages found on http://www.purevolume.com/albums [21:53] hmm or not [21:53] You can only get 9 pages from that. [21:53] Same for the lists of artists etc. [21:53] yeah just noticed [21:54] The only way is to do it recursively, i.e. discover new artists from who listeners are following and new fans from the artist pages. [21:54] Which is of course possible but trickier with the warrior. [21:55] we can use the player, the embeds [21:55] [21:55] http://www.purevolume.com/_iframe/audio_button_player.php?songId=3190128 then shows all info [21:55] going through all IDs should give us everything [21:56] Hmm, neat. [21:57] arkiver: Well, if you can get such a project set up quickly, that's great. But if we can't get this started this weekend, it's probably better to just grab what we can easily get (i.e. downloads and maybe that player embed while we're at it). [22:00] *** apache2 has quit IRC (Remote host closed the connection) [22:00] *** apache2 has joined #archiveteam-bs [23:02] Looks like my script for the Bungie forums works. I'll start it tomorrow when I have time to actually watch it and make sure everything's fine. [23:12] Initial test with only one process and 50 concurrent connections looks good. 60 forum pages (4.5k requests, 2 MB tx, 500 MB rx) per minute. If that holds for higher concurrencies, I'll saturate the network way before my CPU or disk. [23:13] I'm grabbing only forum and thread pages. No imagery or external links or anything else. I'm also producing a list of members so I can retrieve the profile pages and potentially groups later. [23:13] Not sure if that part was archived in the earlier project a few years ago. [23:14] *** jschwart has quit IRC (Quit: Konversation terminated!) [23:14] To be continued tomorrow. [23:17] On another note, there's a presidential election in Turkey tomorrow/today (Sunday). Might be a good idea to throw related stuff into ArchiveBot, e.g. campaign websites. I haven't had much luck finding anything, presumably because it's all in Turkish. [23:44] *** BlueMax has joined #archiveteam-bs [23:57] JAA: are you doing this with wget-lua or wpull? or is it a custom script? [23:58] as in custom WARC creation [23:59] I'm going to try to get the purevolume stuff started this week