#archiveteam-bs 2018-06-23,Sat

↑back Search

Time Nickname Message
00:09 🔗 archodg has joined #archiveteam-bs
00:11 🔗 wp494 has quit IRC (Read error: Operation timed out)
00:11 🔗 wp494 has joined #archiveteam-bs
00:47 🔗 Stilett0- has joined #archiveteam-bs
00:56 🔗 RichardG has quit IRC (Ping timeout: 268 seconds)
01:24 🔗 Aoede has quit IRC (Ping timeout: 252 seconds)
01:37 🔗 Aoede has joined #archiveteam-bs
01:49 🔗 ta9le has quit IRC (Quit: Connection closed for inactivity)
01:50 🔗 archodg has quit IRC (Remote host closed the connection)
02:30 🔗 Petri152 has quit IRC (ZNC - http://znc.in)
03:13 🔗 odemg has quit IRC (Ping timeout: 260 seconds)
03:26 🔗 odemg has joined #archiveteam-bs
04:48 🔗 ndiddy has quit IRC ()
04:57 🔗 BlueMax has quit IRC (Leaving)
05:55 🔗 BlueMax has joined #archiveteam-bs
06:02 🔗 Specular has joined #archiveteam-bs
06:43 🔗 Kaz has quit IRC (Ping timeout: 268 seconds)
06:43 🔗 dxrt_ has quit IRC (Ping timeout: 268 seconds)
06:43 🔗 Kaz has joined #archiveteam-bs
06:45 🔗 Gfy has quit IRC (Ping timeout: 268 seconds)
06:45 🔗 Rai-chan has quit IRC (Ping timeout: 268 seconds)
06:45 🔗 Gfy_ has joined #archiveteam-bs
06:48 🔗 Kaz has quit IRC (Ping timeout: 268 seconds)
06:48 🔗 nightpool has quit IRC (Ping timeout: 268 seconds)
06:48 🔗 closure_ has quit IRC (Ping timeout: 268 seconds)
06:48 🔗 nightpool has joined #archiveteam-bs
06:48 🔗 zino has quit IRC (Ping timeout: 268 seconds)
06:48 🔗 closure has joined #archiveteam-bs
06:49 🔗 zino has joined #archiveteam-bs
06:51 🔗 dxrt_ has joined #archiveteam-bs
06:51 🔗 Rai-chan has joined #archiveteam-bs
06:52 🔗 Kaz has joined #archiveteam-bs
07:21 🔗 Pixi has quit IRC (Quit: Pixi)
07:21 🔗 Pixi has joined #archiveteam-bs
07:24 🔗 Aoede has quit IRC (Ping timeout: 252 seconds)
07:34 🔗 Aoede has joined #archiveteam-bs
08:07 🔗 Aoede has quit IRC (Ping timeout: 252 seconds)
08:13 🔗 Aoede has joined #archiveteam-bs
09:00 🔗 HCross JAA: thanks for that. Think it is prudent to stop the AB crawl?
09:01 🔗 JAA Actually, I'm not sure what was grabbed as part of that project exactly.
09:02 🔗 JAA pipeline.py only has a reference to "halo3file" items which were/are also hosted on the same domain but are not the forums: https://github.com/ArchiveTeam/halo-grab/blob/93195fe56d6b5ec22c89f4586699deac1f28602e/pipeline.py#L199-L207
09:03 🔗 JAA I think we should let the ArchiveBot job run in any case.
09:03 🔗 HCross yeah
09:03 🔗 HCross my job is going pretty quick, but I think neither are quick enough
09:04 🔗 HCross I might try a crawl for each subforum
09:04 🔗 JAA Ew, that forum software doesn't have thread IDs. :-|
09:09 🔗 eientei95 Post IDs are the forum IDs
09:10 🔗 JAA s/forum/thread/, but yeah, that makes it ugly... If it had thread IDs, you could easily iterate over all threads to archive everything. With post IDs, that's way too inefficient.
09:10 🔗 JAA I've checked a few old threads in the Wayback Machine, and it doesn't look like they were archived.
09:10 🔗 eientei95 Hm
09:11 🔗 HCross iterate over hte post ID?
09:12 🔗 eientei95 Then you have to get each page
09:12 🔗 JAA Yeah, but there are some 75 million post IDs or so.
09:13 🔗 eientei95 Just wish we got more notice than "Oh yeah, in less than a week we're deleting the forums"
09:13 🔗 JAA Well, it's better than "we're shutting it down tomorrow" or a plain "kthxbye".
09:14 🔗 JAA But yeah, more time would be nice.
09:14 🔗 JAA Looks like they're using CloudFlare.
09:16 🔗 HCross im managing a solid 30-100Mbit down
09:22 🔗 eientei95 JAA: Is there any way to see the URLs a job grabbed after it's finished?
09:23 🔗 JAA eientei95: You can look at the log file, which is inside the job's -meta.warc.gz.
09:23 🔗 eientei95 Ah
09:23 🔗 JAA But only when it's uploaded to IA, obviously.
09:24 🔗 eientei95 No way to check before it's uploaded?
09:24 🔗 JAA Nope
09:24 🔗 eientei95 Darn
09:29 🔗 adinbied has joined #archiveteam-bs
09:30 🔗 adinbied Posted this in the #aohell channel as well, but I figured I'd post here as well. Not sure who's currently in charge of the AOL project - but I think the last server has gone offline. I just spent the last day getting a OS X 10.4 Virtual Machine working, so that I could get the AOL_For_Mac_OS_X.dmg to work - only to find out that it won't connect. After some network digging, the AOL Client is trying to reach americaonline.aol
09:30 🔗 adinbied .com - which does not currently exist as a valid subdomain.
09:31 🔗 Specular just on the Halo backup since I noticed in the ArchiveBot list, what was missing? I see it mentions 'grab to get the topics'. The crawl back in 2014-2015 missed these?
09:31 🔗 JAA Specular: Yes, apparently. I've checked a few old threads in the Wayback Machine, and it doesn't look like they were archived.
09:31 🔗 eientei95 I gave that explanation because I first tried /forum/ but case-sensitivity missed the topics which are /Forum/
09:32 🔗 Specular well, glad someone picked up on it in time
09:32 🔗 schbirid has joined #archiveteam-bs
10:16 🔗 DragonMon has joined #archiveteam-bs
10:17 🔗 ta9le has joined #archiveteam-bs
10:22 🔗 DragonMon has quit IRC (Quit: Leaving)
10:59 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
12:46 🔗 lindalap has quit IRC (Ping timeout: 260 seconds)
12:58 🔗 lindalap has joined #archiveteam-bs
13:00 🔗 Stilett0- has quit IRC (Read error: Connection reset by peer)
13:19 🔗 lindalap has quit IRC (Ping timeout: 260 seconds)
13:34 🔗 lindalap has joined #archiveteam-bs
13:44 🔗 lindalap has quit IRC (Ping timeout: 260 seconds)
13:57 🔗 lindalap has joined #archiveteam-bs
14:31 🔗 jschwart has joined #archiveteam-bs
15:11 🔗 archodg has joined #archiveteam-bs
15:12 🔗 archodg arkiver, warrior or archivebot? https://old.reddit.com/r/Archiveteam/comments/8ta23j/bungie_scrubbing_legacy_halo_forums/?st=jirjlsi3&sh=816d4bb5
15:55 🔗 Igloo already in the archivebot
15:55 🔗 Igloo But we're not sure we'll have the time
16:19 🔗 Specular btw seems (some/all?) these URLs haven't been saved (at least judging by the version available on IA: http://halo.bungie.net/Stats/PlayerStatsHalo2.aspx?player=http://halo.bungie.net/Stats/PlayerStatsHalo2.aspx?player=IconicRyan
16:21 🔗 Specular it's the last string that's dynamic. A Google search returns a bunch of names but not sure if there's some list on the site itself someplace.
16:23 🔗 schbirid has quit IRC (Remote host closed the connection)
16:49 🔗 Specular edit: correction the above URL should be http://halo.bungie.net/Stats/PlayerStatsHalo2.aspx?player=IconicRyan
17:57 🔗 Sk2d has joined #archiveteam-bs
17:59 🔗 Sk1d has quit IRC (Read error: Operation timed out)
17:59 🔗 Sk2d is now known as Sk1d
18:09 🔗 Mateon1 has quit IRC (Ping timeout: 260 seconds)
18:09 🔗 Mateon1 has joined #archiveteam-bs
18:18 🔗 plue has quit IRC (Ping timeout: 260 seconds)
18:25 🔗 Mateon1 has quit IRC (Remote host closed the connection)
18:28 🔗 Mateon1 has joined #archiveteam-bs
19:23 🔗 Aoede has quit IRC (Ping timeout: 252 seconds)
19:29 🔗 Aoede has joined #archiveteam-bs
19:58 🔗 Stilett0- has joined #archiveteam-bs
20:46 🔗 Stilett0- has quit IRC ()
21:02 🔗 Stilett0- has joined #archiveteam-bs
21:04 🔗 Stilett0- has quit IRC (Read error: Connection reset by peer)
21:07 🔗 Specular has quit IRC (Quit: Leaving)
21:17 🔗 fie has joined #archiveteam-bs
21:21 🔗 JAA arkiver: Can we do a quick project for PureVolume MP3s? Should be very straightforward; I described it in here a few weeks ago around 2018-06-06 21:45 UTC. They'll shut down in a week (maybe; they extended the deadline twice already).
21:22 🔗 arkiver sure
21:23 🔗 JAA A full project for the website would probably be much more work. I'll try to grab it myself instead (and we already got a significant part through ArchiveBot as well).
21:23 🔗 JAA s/probably/certainly/
21:24 🔗 arkiver what info do we have about purevolume?
21:24 🔗 JAA What kind of info do you mean?
21:25 🔗 arkiver just didn't see a wikipage, reading IRC logs now
21:25 🔗 Stilett0- has joined #archiveteam-bs
21:26 🔗 Stilett0- has quit IRC (Read error: Connection reset by peer)
21:26 🔗 arkiver apparently I already announced there would be a project :P
21:26 🔗 JAA Yeah, we talked about it before.
21:27 🔗 Stilett0 has joined #archiveteam-bs
21:27 🔗 arkiver are you talking about only the http://www.purevolume.com/download.php?id= URLs?
21:28 🔗 JAA Yep
21:29 🔗 JAA Unless you want to do the full thing, but that's fairly complex and would probably take a while to test etc. (Plus both the warrior VM and the tracker aren't exactly in good shape at the moment.)
21:29 🔗 JAA The downloads should be really easy to do, just grab IDs up to 3.6 million or whatever.
21:30 🔗 arkiver I seem to be getting 'ERROR 93' on downloads
21:30 🔗 arkiver from your message on 2018-06-06 <JAA> Someone want to grab PureVolume's MP3 downloads? URLs are http://www.purevolume.com/download.php?id=$id where the ID goes up to at least 3665510. Many IDs are invalid/not available for download, meaning the server returns HTTP 200 with the content "Unauthorized Download Attempt". Valid downloads are HTTP 302 redirects to /downloads/<ID>/<name>.mp3. Examples of valid IDs: 120066 1864706 2615487 3153795.
21:31 🔗 JAA Hmm, yeah, me too.
21:33 🔗 arkiver notice really says June 30th, 2018
21:33 🔗 JAA arkiver: Here's a working link: http://www.purevolume.com/download.php?id=3190128
21:33 🔗 JAA From http://www.purevolume.com/new/thestartingline
21:34 🔗 JAA So not everything's broken yet.
21:35 🔗 JAA Five more examples on http://www.purevolume.com/new/papercitiesband
21:35 🔗 arkiver yeah, so this is only with the downloadable MP3 links
21:36 🔗 arkiver could be my browser, but I'm not able to play music through their player
21:36 🔗 JAA I don't remember where I took those IDs from in my message on the 6th, but I'm sure they were downloadable then.
21:37 🔗 JAA For the record, the website's also available on http://g.purevolumecdn.com/. Just in case only www.purevolume.com disappears on the 30th.
21:38 🔗 plue has joined #archiveteam-bs
21:39 🔗 JAA Looks like the "ERROR 93" thing is not new: here's someone complaining about it on the site's Facebook page in February 2017: https://www.facebook.com/PureVolumedotcom/posts/10154968022569344
21:39 🔗 arkiver I could quickly setup a project now for those IDs
21:39 🔗 arkiver (could also help you setting on up if you want)
21:39 🔗 arkiver one*
21:42 🔗 JAA Please do. I'd love to learn how all of that works, but I don't really have time for that right now. Need to get some other grabs running ASAP and should already be in bed.
21:42 🔗 arkiver ok
21:42 🔗 arkiver that's fine, we'll get to it another time
21:43 🔗 arkiver this one should be running in a bit if nothing strange pops up
21:43 🔗 JAA :-)
21:44 🔗 Kaz uhh
21:45 🔗 Kaz oh nvm you already have a list of IDs
21:47 🔗 JAA Well, not really, but it's a fairly small space of just a few million IDs.
21:47 🔗 JAA The website grab is trickier though.
21:52 🔗 arkiver no too tricky
21:52 🔗 arkiver I think we can split this up in pages found on http://www.purevolume.com/albums
21:53 🔗 arkiver hmm or not
21:53 🔗 JAA You can only get 9 pages from that.
21:53 🔗 JAA Same for the lists of artists etc.
21:53 🔗 arkiver yeah just noticed
21:54 🔗 JAA The only way is to do it recursively, i.e. discover new artists from who listeners are following and new fans from the artist pages.
21:54 🔗 JAA Which is of course possible but trickier with the warrior.
21:55 🔗 arkiver we can use the player, the embeds
21:55 🔗 arkiver <iframe src="http://www.purevolume.com/_iframe/audio_button_player.php?songId=3190128" frameborder="0" width="100%" height="80"></iframe>
21:55 🔗 arkiver http://www.purevolume.com/_iframe/audio_button_player.php?songId=3190128 then shows all info
21:55 🔗 arkiver going through all IDs should give us everything
21:56 🔗 JAA Hmm, neat.
21:57 🔗 JAA arkiver: Well, if you can get such a project set up quickly, that's great. But if we can't get this started this weekend, it's probably better to just grab what we can easily get (i.e. downloads and maybe that player embed while we're at it).
22:00 🔗 apache2 has quit IRC (Remote host closed the connection)
22:00 🔗 apache2 has joined #archiveteam-bs
23:02 🔗 JAA Looks like my script for the Bungie forums works. I'll start it tomorrow when I have time to actually watch it and make sure everything's fine.
23:12 🔗 JAA Initial test with only one process and 50 concurrent connections looks good. 60 forum pages (4.5k requests, 2 MB tx, 500 MB rx) per minute. If that holds for higher concurrencies, I'll saturate the network way before my CPU or disk.
23:13 🔗 JAA I'm grabbing only forum and thread pages. No imagery or external links or anything else. I'm also producing a list of members so I can retrieve the profile pages and potentially groups later.
23:13 🔗 JAA Not sure if that part was archived in the earlier project a few years ago.
23:14 🔗 jschwart has quit IRC (Quit: Konversation terminated!)
23:14 🔗 JAA To be continued tomorrow.
23:17 🔗 JAA On another note, there's a presidential election in Turkey tomorrow/today (Sunday). Might be a good idea to throw related stuff into ArchiveBot, e.g. campaign websites. I haven't had much luck finding anything, presumably because it's all in Turkish.
23:44 🔗 BlueMax has joined #archiveteam-bs
23:57 🔗 arkiver JAA: are you doing this with wget-lua or wpull? or is it a custom script?
23:58 🔗 arkiver as in custom WARC creation
23:59 🔗 arkiver I'm going to try to get the purevolume stuff started this week

irclogger-viewer