[00:00] Hi hello [00:02] So, Halo games. [00:03] How much disk space are we talking [00:05] SketchCow: Between 8.6 and 12.5 TiB for all Halo 2 games, extrapolating from the numbers above and the total count of ~800 million games in the database. [00:05] This is only the HTML page with some statistics and doesn't include the viewer thingy on the website, I think. [00:06] Put it somewhere other than archive.org. Put some subset on archive.org, like, 50gb [00:06] There's also the Bungie Pro Video service, which is hosting numerous video renders of game recordings or whatever. Not sure if we can even archive that at all though. [00:07] Get some hard drives. Put it on them. [00:07] I've grabbed the oldest million games and about half a million of the most recent ones. That's about 21 GiB. [00:08] I might grab another million of random IDs inbetween or something like that. [00:08] Astrid's right that it's not a great use of time, and additionally, anything over a million sample is quite enough unless you're insane. [00:08] I realize in the future we'll have petabyte drives and I'll seem like a moron [00:08] But really, at some point, let a specialized game archive take that over [00:09] You'll be a hero, caloo calay [00:09] Also, bear in mind you're talking to someone collaborating with folks to save thousands of WinAmp skins [00:09] AND I just wrote a fucking screenshotter for winamp skins baby [00:09] Yeah, and also this "1.6 GiB per 100k games" is a massive waste of storage as well. It could be stored much, much more compact in a database instead of those HTML pages. [00:10] Like I said, take a little sample, 50gb, whatever you can shove it. Make it one object, describe heavily [00:10] Yep, will do. [00:10] The lizard race can muse about in 2402 [00:10] https://archive.org/details/winampskins?and%5B%5D=identifier%3Awinampskin_2*&sin= [00:10] Behold [00:12] I'm also setting up a grab for the user profile pages of people who posted in the forums. There are also "groups" (~ clans?) on the site, and I might grab those as well if there aren't too many and there's still time. [00:12] It appears that there are slightly over 12 million accounts on the site, and a bit under 300k of them have posted in the forums. [00:23] SketchCow: "unless you're insane"? Aren't we all at least *slightly* deranged for being here? ;) [00:26] I'm gonna hit the bed. Keep being awesome, JAA [00:26] :) [00:26] I'll try. :-) [00:26] Good night. [00:26] nn [00:28] project paused again then [00:28] or we can start it and just get a little bit [00:28] all items are in, so it will be random packs of 4000 IDs [00:35] *** Darkstar has quit IRC (Ping timeout: 1212 seconds) [00:40] *** Darkstar has joined #archiveteam-bs [01:30] *** Darkstar has quit IRC (Ping timeout: 633 seconds) [01:41] *** Lord_Nigh has quit IRC (Ping timeout: 268 seconds) [01:41] *** Darkstar has joined #archiveteam-bs [02:05] *** ta9le has quit IRC (Quit: Connection closed for inactivity) [02:22] *** Darkstar has quit IRC (Ping timeout: 480 seconds) [02:39] *** Darkstar has joined #archiveteam-bs [03:02] *** K4k has quit IRC (Read error: Connection reset by peer) [03:05] *** archodg_ has joined #archiveteam-bs [03:08] *** Darkstar has quit IRC (Ping timeout: 268 seconds) [03:08] *** archodg has quit IRC (Ping timeout: 252 seconds) [03:08] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:18] *** Darkstar has joined #archiveteam-bs [03:21] *** odemg has joined #archiveteam-bs [03:34] *** Lord_Nigh has joined #archiveteam-bs [03:53] *** Lord_Nigh has quit IRC (Ping timeout: 268 seconds) [03:59] *** Lord_Nigh has joined #archiveteam-bs [03:59] *** Darkstar has quit IRC (Ping timeout: 260 seconds) [04:09] *** Darkstar has joined #archiveteam-bs [04:51] *** Darkstar has quit IRC (Ping timeout: 480 seconds) [04:55] *** Darkstar has joined #archiveteam-bs [04:55] SketchCow: https://archive.org/details/disney-adventures-v7i4 [04:55] SketchCow: https://archive.org/details/disney-adventures-v9i11 [04:56] v6i7 issue is also being uploaded now [05:07] *** fenn has quit IRC (Ping timeout: 260 seconds) [05:15] *** fenn has joined #archiveteam-bs [05:43] *** Darkstar has quit IRC (Ping timeout: 480 seconds) [05:43] SketchCow: https://archive.org/details/disney-adventures-v6i7 [05:45] *** Darkstar has joined #archiveteam-bs [05:46] *** swebb has quit IRC (Read error: Operation timed out) [06:28] *** Darkstar has quit IRC (Ping timeout: 480 seconds) [06:44] *** Darkstar has joined #archiveteam-bs [07:13] *** schbirid has joined #archiveteam-bs [07:56] *** schbirid has quit IRC (Quit: Leaving) [08:04] *** svchfoo3 has quit IRC (Read error: Operation timed out) [08:04] *** svchfoo3 has joined #archiveteam-bs [08:05] *** svchfoo1 sets mode: +o svchfoo3 [08:10] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [08:10] *** Mateon1 has joined #archiveteam-bs [08:33] *** Aoede has quit IRC (Quit: ZNC - https://znc.in) [08:35] *** Aoede has joined #archiveteam-bs [09:14] *** Aoede has quit IRC (Ping timeout: 252 seconds) [09:17] *** Aoede has joined #archiveteam-bs [09:17] My Bungie profile grab finished around 04:30 UTC and discovered some 15k groups. I'll look into those now. [09:21] I think I'll just throw these into ArchiveBot. [09:28] Done, job 2lefnzv589c6pyid0ik8lyd6i [09:34] JAA: you grabbed the latest 500k, not 1M games? [09:35] Muad-Dib: 1M oldest and 1M newest Halo 2 games [09:35] ah, ok [09:35] I'll set up another million randomly scattered through the remaining 801 million IDs, I think. [09:36] 27.3 GiB for those 2 million games, by the way. [09:37] the last 1M contains all games from 2010, which somehow sounds like a nice thing to have [09:37] Indeed :-) [09:39] It's cool to see the release-hype in the distribution of the amount of games played, less than 1% of the total amount of games in the last 4 months, compared to 5% (~43 million) within the first 2 months [09:40] everything but surprising, but still cool to see [09:42] those random clusters were also a good idea, btw [09:42] s/were/are/ [09:42] I have a list of 731499 player names ("XBL gamer tag"), by the way, which could be used to grab the profile pages for the individual games. This is combined from the forum posters and the ones extracted from those 2 million games. [09:43] Muad-Dib: You think I should do clusters rather than just individual random games? If so, what cluster size do you think would be best? [09:45] Oh, I thought they were clustered for some reason, unclustered would also be just fine, I guess [10:03] *** Flashfire has joined #archiveteam-bs [10:04] https://shop.velocityfrequentflyer.com/ Wanna grab as much as possible before a redesign? [10:06] I can host some if we grab more of the Bungie stuff i have an education googledrive [10:06] JAA I can help a bit with Bungie [10:07] Storage isn't really the issue I think. Time is. They're taking it offline in about 7 hours. [10:08] Then I say we gungho it and grab as much as possible. But then again thats my answer to a lot of things [10:08] Have a look at the logs of this channel. There was a lot of discussion about it yesterday. [10:08] Also can someone please queue the velocity frequent flyers store in the archivebot as I dont have voice and dont think im allowed to lol [10:08] Only looked over todays will look now [10:13] Wow ok [10:13] Started another grab for 2 million random Halo 2 games from the ID range 1000001 to 802138049. [10:13] JAA: I think we mostly just want to make sure the games are evenly distributed [10:13] oh, you just started [10:14] why not send it to the archive bot and it grabs as much as possible before the deadline? [10:14] or is that just stupidity? [10:14] Muad-Dib: It should be pretty evenly distributed. I used Python's random.randint to generate the IDs, which is supposed to use a uniform distribution. [10:15] Flashfire: I already did a small batch there, but it's slow [10:15] Flashfire: ArchiveBot is slooooooow. I'm doing ~20k requests per minute. [10:15] JAA's waay faster ;) [10:15] that [10:15] lol [10:15] Ok also muad if you could queue that job in the archivebot I think its a legit grab to do [10:15] archivebot does have the !yahoo command, but I think it should be renamed !yolo [10:16] lol [10:16] Yeah, but even that doesn't necessarily help. It's the HTML parsing which slows everything down. [10:16] I mean yahoo is owned by i think its verizon now so there is that [10:16] yeah, parsing's the hurdle [10:16] I'm not doing any parsing, just string manipulation. [10:17] Which is ugly but works fine for well-constrained projects (i.e. known and stable HTML structure). [10:18] Muad-Dib: Do you know how long your ArchiveBot job for the games took? (And how many IDs were that?) [10:18] I am sad I dont think anyone will run my aussie job on archivebot [10:19] JAA: they did more or less the amount of workers per second [10:20] Ah, so something like 100 times slower than my grab. [10:20] *** ta9le has joined #archiveteam-bs [10:30] JAA: and they were 6066-10000 and 803000000-803138049 [10:35] If you have time for any more let me know [10:38] 500k of the 2M random IDs are done. I love how fast this is. [10:38] Well if its that fast why not grab more? [10:38] I might. We'll see. [10:38] JAA: Is it fast enough for !yahoo? [10:39] eientei95: This isn't ArchiveBot, it's a custom script which is about 100 times faster than ArchiveBot could ever be. [10:39] Oo, nice [10:39] something that could be incorporated into archivebot? [10:40] Nope, the code is specific to the individual site I'm grabbing. [10:40] ah ok [10:40] It may be possible to use it in the warrior though, but it needs a lot more polishing for that. [10:41] Am I the only one who finds it oddly theraputic to watch archivebot tick over via the web interface? [10:41] What specific things does not do in order to be faster than archivebot? [10:43] eientei95: HTML parsing is the most important one. [10:43] Hm [10:44] I'm also using a different network stack than wpull, and the DB is lighter as well (at the cost of some duplication in the archives if I were to grab resources that are shared between different pages). [10:44] I GTG [10:44] bye [10:44] *** Flashfire has quit IRC (Quit: Bye) [10:44] And I can run multiple instances of the same thing against one DB, which wpull currently doesn't support. [10:45] So parallelisation across CPU cores is possible. [10:45] Nice [10:45] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:34] *** Flashfire has joined #archiveteam-bs [11:34] How are we going with bungee? JAA [11:39] *** Flashfire has quit IRC (Ping timeout: 260 seconds) [12:11] The 2 million random Halo 2 game IDs are almost done. [12:18] JAA: it seems like the full year 2009 is about 6.4 million, would this be desirable? [12:19] also: nice about the 2M [12:23] Not sure. Jason suggested ~50 GB total yesterday, which is what I have now with those 4 million. [12:24] Also, this is only Halo 2 so far. We should try to get some coverage of the other games as well. [12:27] Can someone try to figure out what the maximum IDs for the other games are? I haven't tried at all, but here are some high IDs I've seen: Halo 3 http://halo.bungie.net/Stats/GameStatsHalo3.aspx?gameid=1910788443 and ODST http://halo.bungie.net/Stats/ODSTg.aspx?gameid=111197785 [12:29] Reach: http://halo.bungie.net/Stats/Reach/GameStats.aspx?gameid=972979787 [12:34] JAA: good point about the other games, I'll try to take a look between study breaks [12:37] I discovered something: http://halo.bungie.net/api/odst/ODSTService.svc and http://halo.bungie.net/api/reach/reachapisoap.svc [12:38] Presumably, there's something for Halo 2 and 3 as well, but I haven't found it yet. [12:40] Highest I just found for halo 3 is http://halo.bungie.net/Stats/GameStatsHalo3.aspx?gameid=1917736471 [12:41] JAA: oh BOY http://halo.bungie.net/api/odst/ODSTService.svc?singleWsdl [12:41] Yep :-) [12:42] Too bad I really can't stomach any more C# after last semester :") [12:47] How does one call this API? [12:48] Not sure, I've tried a few things but those didn't work. I've never used WSDL before either. [12:56] I barely know anything about web-API's :/ [12:57] about working with* [12:57] Urgh, it’s SOAP. JAA: Do you have a valid game ID for that API? [12:58] PurpleSym: 1000000 (one million) should work. [12:59] The ArchiveBot job for the groups finished, by the way. [13:00] Nope, internal server error with soappy. [13:11] Could that be related to the URLs in the WSDL being broken? They point to www.bungie.net instead of halo.bungie.net. [13:17] Yeah, could be. Changing the URLs I get SOAPpy.Types.faultType: the sender and the receiver. Check that sender and receiver have the same contract and the same binding (including security requirements, e.g. Message, Transport, None).> [14:20] *** swebb has joined #archiveteam-bs [14:32] JAA: you still want max gameid's for reach and odst? [14:33] Yes, please. [14:33] *** antomati_ has quit IRC (Ping timeout: 268 seconds) [14:33] I just started a grab of the 500k first, 500k random, and the 500k last games for Halo 3. Not enough time to grab more, unfortunately. [14:33] I can probably do the same for Reach and ODST if I start it soon. [14:34] (T minus 2 hours 25 minutes 50 seconds) [14:35] ODST: ID's 1 to 132830697 [14:37] JAA: I'm at a TB and a bit of the forums [14:40] HCross: That's a recursive grab-site run, right? With offsite links? [14:41] No offsite links [14:47] Sounds good. I should have all content, but no images, stylesheets, etc. and also not all possible views (e.g. &viewreplies=1, which doesn't actually seem to do anything). [14:47] So browsability of my archives would be... limited. [14:47] JAA: For your reach link, http://halo.bungie.net/Stats/Reach/GameStats.aspx?gameid=974412276 seems to be the highest [14:48] *** antomatic has joined #archiveteam-bs [15:07] Muad-Dib: Thanks. I've started crawls (500k first/random/last each) for ODST and Reach as well. [15:38] *** jschwart has joined #archiveteam-bs [15:40] *** antomatic has quit IRC (Ping timeout: 252 seconds) [15:42] *** antomatic has joined #archiveteam-bs [16:24] JAA: great! [16:45] JAA: if my maths is right... We have 15 minutes left [16:49] HCross: Yep, that's correct. [16:59] less than 1 minute [17:03] *** Darkstar has quit IRC (Ping timeout: 246 seconds) [17:04] The forums are still online at least. [17:15] *** Stilett0 has joined #archiveteam-bs [17:15] JAA: Aaaaaaaand... it's gone http://halo.bungie.net/forums/default.aspx [17:23] *** Darkstar has joined #archiveteam-bs [17:26] *** m007a83_ has joined #archiveteam-bs [17:26] Games seem to be still there, at least the Halo 3 ones. [17:27] Then again, I don't think they were planning on deleting them entirely anyway. [17:29] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [17:39] *** DragonMon has quit IRC (Remote host closed the connection) [18:14] *** Darkstar has quit IRC (Ping timeout: 506 seconds) [18:22] ODST game pages throw an error currently, but not sure whether that was also the case during my grab (which finished half an hour ago or so). [18:22] *** schbirid has joined #archiveteam-bs [18:23] 3 and Reach seem to be fine. [18:23] And my grabs of those are still running. [18:25] *** Darkstar has joined #archiveteam-bs [18:32] *** K4k has joined #archiveteam-bs [18:58] *** Darkstar has quit IRC (Ping timeout: 246 seconds) [19:08] *** Darkstar has joined #archiveteam-bs [19:22] *** Stilett0 has quit IRC (Read error: Operation timed out) [19:29] *** verifiedj has joined #archiveteam-bs [19:32] JAA: no, they were going to delete the detailed information about those games, along with the forums [19:45] Looks like they've purged the ODST games entirely. http://halo.bungie.net/Stats/ODSTg.aspx?gameid=1000000 worked before, for example. [19:45] Muad-Dib: ^ [19:46] ODST seems to be gone from #1 up [19:48] okay, that's more grave than they announced earlier [19:52] *** Darkstar has quit IRC (Ping timeout: 246 seconds) [20:01] *** archodg_ has quit IRC (Remote host closed the connection) [20:06] *** archodg_ has joined #archiveteam-bs [20:08] *** Darkstar has joined #archiveteam-bs [20:37] Someone look at 4y59ewu7fohzirjmoplp5j0bn please [20:38] JAA: starting my IA upload for the forums now [20:50] *** verifiedj has quit IRC (http://www.mibbit.com ajax IRC Client) [21:23] *** Stilett0 has joined #archiveteam-bs [21:26] I'll have to run some deduplication on my archives first probably to avoid the thousands of copies of the pinned topics. [21:30] arkiver: What's the status on PureVolume? We have two days left until the shutdown. [21:41] There will be general elections in Mexico this weekend. Might be a good idea to compile a list of campaign websites etc. and throw them into ArchiveBot. [21:49] *** Jens has quit IRC (Remote host closed the connection) [21:49] *** Jens has joined #archiveteam-bs [22:03] *** jschwart has quit IRC (Quit: Konversation terminated!) [22:20] agree [22:24] *** Flashfire has joined #archiveteam-bs [22:25] JAA great job with what you grabbed from Bungie [22:25] Just wanted to hop on and say that before I got to school [22:25] *** Flashfire has left [22:32] *** schbirid has quit IRC (Remote host closed the connection) [22:35] *** Stilett0 has quit IRC (Read error: Operation timed out) [22:40] Why isn't archivebot able to grab complete tumblr sites anymore? [22:48] *** flashfire has joined #archiveteam-bs [22:50] I can do a list of mexican campaign sites if someone is keeping an eye on whatever urls I dump in the archivebot chat [22:50] JAA astrid arkiver [22:51] sure or i can give you voice to do it yourself :) [22:51] Lol I am not sure Purple is happy with me still but sure if you want I will queue a few [22:51] eh go for it [22:52] So long as I only grab them lol [22:53] if you do "!a http://whatever --no-offsite-links" then that'll keep the size down ... also it'll omit external pages, so you lose context [22:53] ok [22:53] not sure if good or not :P [22:53] i'd say just !a http://whateevr [22:53] if they go on too long we can cut them off [22:53] might want to add --explain "mexican election" [22:54] ok [23:01] *** m007a83_ is now known as m007a83 [23:02] *** m007a83 has quit IRC (Quit: Leaving) [23:02] *** m007a83 has joined #archiveteam-bs [23:18] JAA I took a look at PureVolume it seems stuck [23:18] 97mda7ux34dixidqjmgo0g7d1 isnt budging