[00:14] Restarting to try and make a full backup of my laptop. Wish me luck... [00:15] *** JesseW has left [00:15] anyone know when this is happening or if they've started a small beta test of it yet? http://gizmodo.com/the-wayback-machine-is-getting-a-search-engine-1739099940 [00:16] it's going to be at least a year [00:18] what do they have to do to get it working? [00:27] anyone have the rest of the geekfu action grip podcast? i got what was in the podcast core sample on fos, but i'm pretty sure thats not all of it [00:43] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [00:52] *** BlueMaxim has quit IRC (Read error: Operation timed out) [00:53] *** BlueMaxim has joined #archiveteam-bs [01:13] joepie91: for stuff like Python virtual environments are very helpful when you've got multiple applications. I imagine Ruby is similar [01:14] if you're installing all your dependencies globally to the system you're gonna have a bad time [01:15] esp. for py3 stuff since venv comes bundled natively now, makes deployment instructions fairly nice [01:37] *** Snoo26423 has joined #archiveteam-bs [01:55] *** RichardG has joined #archiveteam-bs [03:00] *** Snoo26423 has quit IRC (Read error: Operation timed out) [03:03] *** Snoo26423 has joined #archiveteam-bs [03:26] *** toad2 has joined #archiveteam-bs [03:28] *** toad1 has quit IRC (Read error: Operation timed out) [03:58] *** toad1 has joined #archiveteam-bs [03:59] is their a way to set files as non-public on archive.org? [04:00] *** toad2 has quit IRC (Read error: Operation timed out) [04:04] *** toad2 has joined #archiveteam-bs [04:07] *** toad1 has quit IRC (Read error: Operation timed out) [04:09] Greets from Westminster, MD [04:11] *** bwn__ has quit IRC (Read error: Operation timed out) [04:16] *** toad1 has joined #archiveteam-bs [04:18] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:18] *** toad2 has quit IRC (Read error: Operation timed out) [04:24] *** Sk1d has joined #archiveteam-bs [04:37] *** toad2 has joined #archiveteam-bs [04:39] *** toad1 has quit IRC (Read error: Operation timed out) [05:09] *** toad1 has joined #archiveteam-bs [05:10] *** toad2 has quit IRC (Read error: Operation timed out) [05:18] is their a way to submit a list of urls to be archived on the wayback machine? [05:48] *** toad2 has joined #archiveteam-bs [05:49] *** toad1 has quit IRC (Read error: Operation timed out) [05:57] *** toad1 has joined #archiveteam-bs [06:00] *** toad2 has quit IRC (Read error: Operation timed out) [06:02] *** toad2 has joined #archiveteam-bs [06:05] Hey, I'm clueless how to use megawarc for this https://archive.org/details/archiveteam_gamemaker&tab=collection [06:05] *** toad3 has joined #archiveteam-bs [06:05] I see that for IA you need to split your warcs up [06:05] But how do you put them back together? I see this https://github.com/alard/megawarc but have no clue how to use it [06:05] *** toad1 has quit IRC (Read error: Operation timed out) [06:06] Do I need to get all the json files from all those items in that collection and put it in one file or something to use the above tool and aggghhh [06:08] *** toad2 has quit IRC (Read error: Operation timed out) [06:13] Honno: what are you trying to do [06:14] yipdw: make a warc composed of all those warcs [06:15] Honno: use cat [06:15] yipdw: whats that sorry? [06:15] if your only goal is concatentation it's faster than extract/compress [06:16] cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz [06:16] is there like, a linux command where I can literally just write a concat command the all the file names? [06:16] cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz [06:16] so 1991-03 of tagesschau is getting uploaded [06:17] is their a way to submit a list of urls to be archived on the wayback machine? [06:17] not officially; use web.archive.org's save page thing or you can send stuff in via archivebot [06:18] Honno: the JSON file accompanying each megawarc item is there to make it possible to split the megawarc back into its source files [06:18] so if you're splitting, yes, you want that [06:18] but warc.gz files produced by megawarc are individually gzipped WARC records so concatentation is fine [06:19] this applies only to the WARC output; catting warc.gz with tarballs may not do what you want [06:19] fortunately most tarballs created in megawarced warrior output are empty tarball [06:19] s [06:20] yipdw: sooo, no need for the JSON files if I'm going to concat right? [06:20] if all you want to do is make a gigantic warc then no you don't need the JSON files [06:20] I'm wondering why you want a gigantic warc, but that's a second question [06:21] yipdw: it's because components of one archive rely on things in the archive archives, for general browsing [06:21] download all the warcs and load them up into pywb [06:21] it'll find them [06:21] wayback has similar functionality [06:22] wayback seems ridiculously hard to set up [06:22] then try pywb, it's easier [06:22] or webarchiveplayer, which is pywb with a nicer interface [06:22] I'm a complete noob by the way heh, I don't do programming or anything [06:22] yeah I tried webarchiveplayer, that doesn't seem to have the feature of using all things tho [06:22] also takes ridiculously long to load [06:23] *** Microguru has joined #archiveteam-bs [06:23] you're throwing hundreds of gigabytes of data [06:23] it's going to take a while no matter [06:23] yeah heh, ugh [06:24] in any case webarchiveplayer should support multiple WARCs fine [06:24] I don't remember if it uses the cdx files [06:24] or if it must reconstruct them [06:24] you may have better luck downloading the WARC and CDX files, and dumping them in the same place [06:25] cdx huh, need to check what that is [06:25] WARC index [06:25] if webarchiveplayer can use the indexes you can avoid a costly reindexing [06:25] oh [06:25] I know pywb uses indexes to speed up retrieval, I just can't remember whether or not it will use the ones generated at IA [06:26] well thanks yipdw, the ultimate goal is to web scrape data and extract all the game downloads from the site, but it seems theres a lot I need to learn about first [06:28] you may want to ask ikreymer for more tips [06:28] he pops in here occasionally [06:28] heh, another thing yipdw, the game downloads don't show up in the index of webarchiveplayer [06:28] being the author of pywb I suspect he'll know more about it than me [06:28] ah right haha [06:28] I don't know what that's from [06:28] all the downloads have a weird download link see, it's a query ie games/220702-karoshi-factory-remake-gmk/send_download?code=1ed32eb417091bed7fffe9e99269867ba01b54da [06:29] from games/220702/download [06:29] the site was pretty weird [06:29] I can't easily download the game files then? [06:30] I don't know, I didn't participate in that one [06:30] arkiver probably knows more about the quirks of that site [06:31] mhmk I'll see if they know [06:32] yipdw, where do I see who organized these crawls sorry? [06:32] I see the tracker lists folk, but thats people who contributed their computers right on the warrior [06:32] oh I guess it was chfoo [06:32] https://github.com/ArchiveTeam/gamemaker-sandbox-grab [06:33] yeah chfoo made the archive team wiki page about the project [06:33] also helped me out earlier so I spose thats the person I want heh [06:34] I'll be off, thanks for your help [06:34] Really need to learn this stuff, want to make a clean archive of the games from this old site [06:36] np [06:43] *** toad1 has joined #archiveteam-bs [06:44] *** toad3 has quit IRC (Read error: Operation timed out) [07:09] is their a way to set files as non-public on archive.org? [07:10] *** JesseW has joined #archiveteam-bs [07:11] bsmith093: Finished all but Naruto (which is 18G uncompressed) -- now working on that. [07:12] Currently up to 105G compressed, as opposed to the originals 108G. So it will likely be bigger, but probably not very. [07:12] probably about 2GB bigger. [07:13] hook54321: not as a normal user; IA staffers can do various things, though. [07:36] *** bwn has joined #archiveteam-bs [07:58] *** VADemon has quit IRC (Quit: left4dead) [08:01] *** metalcamp has joined #archiveteam-bs [08:12] *** JesseW has left [08:16] Frogging: "virtual environments" is the recommendation everybody automatically makes for Python and Ruby but 1) they are a hack that really shouldn't be necessary to begin with and 2) they don't actually fully solve the problem [08:16] they isolate dependencies on a per-application basis [08:16] but it doesn't magically allow for nested / differently versioned dependencies *within* a project [08:17] so the dep model remains broken [08:17] (and frankly, virtual environments are typically an utter mess to integrate with service/daemon managers and such) [08:26] *** BlueMaxim has quit IRC (Read error: Operation timed out) [08:30] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [08:34] *** fie has joined #archiveteam-bs [08:36] *** fie__ has quit IRC (Ping timeout: 244 seconds) [08:55] *** lytv has joined #archiveteam-bs [08:59] *** fie_ has joined #archiveteam-bs [09:00] *** vtyl has quit IRC (Read error: Operation timed out) [09:00] *** fie has quit IRC (Read error: Operation timed out) [09:37] *** fie__ has joined #archiveteam-bs [09:38] *** fie_ has quit IRC (Read error: Operation timed out) [09:42] SketchCow: all of 2012 kpfa is uploaded [09:42] i'm uploading 2013-01 now [09:44] *** metalcamp has joined #archiveteam-bs [09:45] *** fie_ has joined #archiveteam-bs [09:46] *** fie__ has quit IRC (Read error: Operation timed out) [09:49] *** fie__ has joined #archiveteam-bs [09:49] *** fie__ has quit IRC (Client Quit) [09:53] *** fie_ has quit IRC (Ping timeout: 370 seconds) [09:55] *** metalcamp has quit IRC (Quit: Bye) [10:06] *** metalcamp has joined #archiveteam-bs [10:16] morning all [10:33] Just read a blog post about 500px.com raising their cut for every sold picture from 30% to 70% ("to help the further growth of 500px"), one of the founders is the same as livejournal. Maby we should do a sanity grab? [10:48] Of 500px? Of LiveJournal? [10:50] Well the sanity grab of livejournal is already in the disco phase. So I mean it might be good to check up on 500px as well if it's feasible to do a sanity check [10:51] What the fuck is a disco phase [10:53] Oh, discovery phase [10:53] discovery [11:08] BEARS > BEES [12:02] i'm up to 1991-03-31 of tagesschau evening news [12:02] NOTE: there is no 1991-03-26 episode on there site [12:27] i think uploads to IA are getting stuck [12:34] godane, ditto. Newsgrabber is getting stuck [12:47] *** acridAxid has quit IRC (marauder) [12:49] *** acridAxid has joined #archiveteam-bs [12:57] *** alfie has quit IRC (Quit: Seeeya! - ZNC 1.6.3+deb1+jessie0) [12:57] *** alfie has joined #archiveteam-bs [13:38] *** schbirid has joined #archiveteam-bs [14:07] *** chazchaz has quit IRC (Read error: Operation timed out) [14:08] *** Honno has quit IRC (Read error: Connection reset by peer) [14:14] *** Coderjoe has quit IRC (Ping timeout: 260 seconds) [14:16] *** hook54321 has quit IRC (Ping timeout: 268 seconds) [14:17] *** chazchaz has joined #archiveteam-bs [14:39] ersi: The most fabulous phase of course :p [14:41] it depends, its either the discovery phase or the "angry person yelling" phase [14:51] *** Coderjoe has joined #archiveteam-bs [15:03] *** Honno has joined #archiveteam-bs [15:11] *** vitzli has joined #archiveteam-bs [16:13] *** closure has quit IRC (ZNC - 1.6.0 - http://znc.in) [17:05] *** RichardG has quit IRC (Read error: Operation timed out) [17:06] *** RichardG has joined #archiveteam-bs [17:16] *** closure has joined #archiveteam-bs [17:31] *** vitzli has quit IRC (Leaving) [17:47] *** dxrt- has quit IRC (Ping timeout: 633 seconds) [17:47] soooooooooo what craziness is Jason upto atm [17:47] i'm wathcingf on twitter [17:51] Smiley: just moving the manuals from one place to another, AFAIK [17:54] Smiley: http://pastebin.com/3meEDnQ5 that is a bit of an overview of what's going on [17:55] tl;dr: SketchCow and friends rescued a shitload of manuals, and now they're just moving the manuals into a consolidated space for money savings sake. [18:01] oh these the one from that shop which closed? [18:04] If it wasnt for the other-side-of-the-world problem, id be there [18:04] *** bsmith093 has quit IRC (Ping timeout: 258 seconds) [18:05] nod [18:05] money i don't have right now, time,... not really [18:05] but i might of been able to help at least a bit [18:05] hopefully moving on thursday \o/ [18:07] Jason needs some stuff to move in the UK :P [18:17] *** DopefishJ has joined #archiveteam-bs [18:17] *** swebb sets mode: +o DopefishJ [18:18] *** bwn has quit IRC (Ping timeout: 246 seconds) [18:19] *** DFJustin has quit IRC (Ping timeout: 274 seconds) [18:48] *** bwn has joined #archiveteam-bs [18:48] *** bsmith093 has joined #archiveteam-bs [18:54] *** Smiley has quit IRC (Remote host closed the connection) [18:56] *** schbirid has quit IRC (Quit: Leaving) [19:23] HCross: have you signed up on the archivecorps mailing list? there may be some moving jobs there. :-) [19:25] I havent [19:34] signup is here: http://archive.us7.list-manage.com/subscribe?u=30ffefa96d1767cc661f2e3ce&id=3b19db5cef [19:39] Done [19:49] *** tomwsmf-a has joined #archiveteam-bs [19:54] *** DopefishJ is now known as DFJustin [20:07] Honno: did you see the wiki page? i updated instructions on how to access it in wayback machine if that helps [20:07] chfoo, yeah I did, thanks for that, will do more into explaining how to get the warcs going offline [20:08] just got it all downloaded and running myself [20:12] so much confusion in #archiveteam... [20:13] *** tomwsmf-a has quit IRC (Ping timeout: 258 seconds) [20:16] JW_work: i was about to say... linebreaks aren't fuckin punctuation :P [20:39] *** luckcolor has joined #archiveteam-bs [20:39] *** luckcolor has left [20:46] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [20:51] *** BlueMaxim has joined #archiveteam-bs [20:51] *** JetBalsa has joined #archiveteam-bs [20:52] *** Tom__ has joined #archiveteam-bs [20:52] oi Tom__, so what's your question [20:54] So the thing is the archive team crawled a social network site. it has 519 collections. I want to find a specific profile, otherwise I need to download 519 collections which is a lot TB [20:54] hm yeah [20:55] you could download the .cdx files that go with, those are basically an index of urls [20:55] Yes, is there software to open it specifally? [20:57] not much that you might find useful [20:57] I mean what is the he best software to open the .cdx.idx files? I can open it with notepad, but its not good with spacing and aligning. [20:57] but they're just plain text files so you can just use grep [20:57] if you find a url in a cdx then that means it is available in the matching warc file [20:59] Ok, thank you. I will download the files and start searching. [21:05] *** Tom__ has quit IRC (Quit: Page closed) [21:10] *** luckcolor has joined #archiveteam-bs [21:10] *** luckcolor has left [21:26] 519 collections, is it hyves? [21:32] Tom__: I had a list around, uploaded it here: https://archive.org/details/warcindex-usernames.7z [21:40] added this list to the wiki for others searching an archive containing their own or a specific username on hyves [22:17] *** Honno has quit IRC (Ping timeout: 492 seconds) [22:25] *** BlueMaxim has quit IRC (Read error: Operation timed out) [22:52] *** hook54321 has joined #archiveteam-bs [22:57] *** bauruine has quit IRC (Ping timeout: 260 seconds) [23:14] *** bauruine has joined #archiveteam-bs [23:22] *** hook54321 has quit IRC (Ping timeout: 268 seconds) [23:44] *** RichardG has quit IRC (Read error: Connection reset by peer) [23:49] *** RichardG has joined #archiveteam-bs