#archiveteam-bs 2016-03-28,Mon

↑back Search

Time Nickname Message
00:14 🔗 JesseW Restarting to try and make a full backup of my laptop. Wish me luck...
00:15 🔗 JesseW has left
00:15 🔗 hook54321 anyone know when this is happening or if they've started a small beta test of it yet? http://gizmodo.com/the-wayback-machine-is-getting-a-search-engine-1739099940
00:16 🔗 yipdw it's going to be at least a year
00:18 🔗 hook54321 what do they have to do to get it working?
00:27 🔗 bsmith093 anyone have the rest of the geekfu action grip podcast? i got what was in the podcast core sample on fos, but i'm pretty sure thats not all of it
00:43 🔗 tomwsmf-a has quit IRC (Read error: Operation timed out)
00:52 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
00:53 🔗 BlueMaxim has joined #archiveteam-bs
01:13 🔗 Frogging joepie91: for stuff like Python virtual environments are very helpful when you've got multiple applications. I imagine Ruby is similar
01:14 🔗 Frogging if you're installing all your dependencies globally to the system you're gonna have a bad time
01:15 🔗 dan- esp. for py3 stuff since venv comes bundled natively now, makes deployment instructions fairly nice
01:37 🔗 Snoo26423 has joined #archiveteam-bs
01:55 🔗 RichardG has joined #archiveteam-bs
03:00 🔗 Snoo26423 has quit IRC (Read error: Operation timed out)
03:03 🔗 Snoo26423 has joined #archiveteam-bs
03:26 🔗 toad2 has joined #archiveteam-bs
03:28 🔗 toad1 has quit IRC (Read error: Operation timed out)
03:58 🔗 toad1 has joined #archiveteam-bs
03:59 🔗 hook54321 is their a way to set files as non-public on archive.org?
04:00 🔗 toad2 has quit IRC (Read error: Operation timed out)
04:04 🔗 toad2 has joined #archiveteam-bs
04:07 🔗 toad1 has quit IRC (Read error: Operation timed out)
04:09 🔗 SketchCow Greets from Westminster, MD
04:11 🔗 bwn__ has quit IRC (Read error: Operation timed out)
04:16 🔗 toad1 has joined #archiveteam-bs
04:18 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:18 🔗 toad2 has quit IRC (Read error: Operation timed out)
04:24 🔗 Sk1d has joined #archiveteam-bs
04:37 🔗 toad2 has joined #archiveteam-bs
04:39 🔗 toad1 has quit IRC (Read error: Operation timed out)
05:09 🔗 toad1 has joined #archiveteam-bs
05:10 🔗 toad2 has quit IRC (Read error: Operation timed out)
05:18 🔗 hook54321 is their a way to submit a list of urls to be archived on the wayback machine?
05:48 🔗 toad2 has joined #archiveteam-bs
05:49 🔗 toad1 has quit IRC (Read error: Operation timed out)
05:57 🔗 toad1 has joined #archiveteam-bs
06:00 🔗 toad2 has quit IRC (Read error: Operation timed out)
06:02 🔗 toad2 has joined #archiveteam-bs
06:05 🔗 Honno Hey, I'm clueless how to use megawarc for this https://archive.org/details/archiveteam_gamemaker&tab=collection
06:05 🔗 toad3 has joined #archiveteam-bs
06:05 🔗 Honno I see that for IA you need to split your warcs up
06:05 🔗 Honno But how do you put them back together? I see this https://github.com/alard/megawarc but have no clue how to use it
06:05 🔗 toad1 has quit IRC (Read error: Operation timed out)
06:06 🔗 Honno Do I need to get all the json files from all those items in that collection and put it in one file or something to use the above tool and aggghhh
06:08 🔗 toad2 has quit IRC (Read error: Operation timed out)
06:13 🔗 yipdw Honno: what are you trying to do
06:14 🔗 Honno yipdw: make a warc composed of all those warcs
06:15 🔗 yipdw Honno: use cat
06:15 🔗 Honno yipdw: whats that sorry?
06:15 🔗 yipdw if your only goal is concatentation it's faster than extract/compress
06:16 🔗 yipdw cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz
06:16 🔗 Honno is there like, a linux command where I can literally just write a concat command the all the file names?
06:16 🔗 yipdw cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz
06:16 🔗 godane so 1991-03 of tagesschau is getting uploaded
06:17 🔗 hook54321 is their a way to submit a list of urls to be archived on the wayback machine?
06:17 🔗 yipdw not officially; use web.archive.org's save page thing or you can send stuff in via archivebot
06:18 🔗 yipdw Honno: the JSON file accompanying each megawarc item is there to make it possible to split the megawarc back into its source files
06:18 🔗 yipdw so if you're splitting, yes, you want that
06:18 🔗 yipdw but warc.gz files produced by megawarc are individually gzipped WARC records so concatentation is fine
06:19 🔗 yipdw this applies only to the WARC output; catting warc.gz with tarballs may not do what you want
06:19 🔗 yipdw fortunately most tarballs created in megawarced warrior output are empty tarball
06:19 🔗 yipdw s
06:20 🔗 Honno yipdw: sooo, no need for the JSON files if I'm going to concat right?
06:20 🔗 yipdw if all you want to do is make a gigantic warc then no you don't need the JSON files
06:20 🔗 yipdw I'm wondering why you want a gigantic warc, but that's a second question
06:21 🔗 Honno yipdw: it's because components of one archive rely on things in the archive archives, for general browsing
06:21 🔗 yipdw download all the warcs and load them up into pywb
06:21 🔗 yipdw it'll find them
06:21 🔗 yipdw wayback has similar functionality
06:22 🔗 Honno wayback seems ridiculously hard to set up
06:22 🔗 yipdw then try pywb, it's easier
06:22 🔗 yipdw or webarchiveplayer, which is pywb with a nicer interface
06:22 🔗 Honno I'm a complete noob by the way heh, I don't do programming or anything
06:22 🔗 Honno yeah I tried webarchiveplayer, that doesn't seem to have the feature of using all things tho
06:22 🔗 Honno also takes ridiculously long to load
06:23 🔗 Microguru has joined #archiveteam-bs
06:23 🔗 yipdw you're throwing hundreds of gigabytes of data
06:23 🔗 yipdw it's going to take a while no matter
06:23 🔗 Honno yeah heh, ugh
06:24 🔗 yipdw in any case webarchiveplayer should support multiple WARCs fine
06:24 🔗 yipdw I don't remember if it uses the cdx files
06:24 🔗 yipdw or if it must reconstruct them
06:24 🔗 yipdw you may have better luck downloading the WARC and CDX files, and dumping them in the same place
06:25 🔗 Honno cdx huh, need to check what that is
06:25 🔗 yipdw WARC index
06:25 🔗 yipdw if webarchiveplayer can use the indexes you can avoid a costly reindexing
06:25 🔗 Honno oh
06:25 🔗 yipdw I know pywb uses indexes to speed up retrieval, I just can't remember whether or not it will use the ones generated at IA
06:26 🔗 Honno well thanks yipdw, the ultimate goal is to web scrape data and extract all the game downloads from the site, but it seems theres a lot I need to learn about first
06:28 🔗 yipdw you may want to ask ikreymer for more tips
06:28 🔗 yipdw he pops in here occasionally
06:28 🔗 Honno heh, another thing yipdw, the game downloads don't show up in the index of webarchiveplayer
06:28 🔗 yipdw being the author of pywb I suspect he'll know more about it than me
06:28 🔗 Honno ah right haha
06:28 🔗 yipdw I don't know what that's from
06:28 🔗 Honno all the downloads have a weird download link see, it's a query ie games/220702-karoshi-factory-remake-gmk/send_download?code=1ed32eb417091bed7fffe9e99269867ba01b54da
06:29 🔗 Honno from games/220702/download
06:29 🔗 Honno the site was pretty weird
06:29 🔗 Honno I can't easily download the game files then?
06:30 🔗 yipdw I don't know, I didn't participate in that one
06:30 🔗 yipdw arkiver probably knows more about the quirks of that site
06:31 🔗 Honno mhmk I'll see if they know
06:32 🔗 Honno yipdw, where do I see who organized these crawls sorry?
06:32 🔗 Honno I see the tracker lists folk, but thats people who contributed their computers right on the warrior
06:32 🔗 yipdw oh I guess it was chfoo
06:32 🔗 yipdw https://github.com/ArchiveTeam/gamemaker-sandbox-grab
06:33 🔗 Honno yeah chfoo made the archive team wiki page about the project
06:33 🔗 Honno also helped me out earlier so I spose thats the person I want heh
06:34 🔗 Honno I'll be off, thanks for your help
06:34 🔗 Honno Really need to learn this stuff, want to make a clean archive of the games from this old site
06:36 🔗 yipdw np
06:43 🔗 toad1 has joined #archiveteam-bs
06:44 🔗 toad3 has quit IRC (Read error: Operation timed out)
07:09 🔗 hook54321 is their a way to set files as non-public on archive.org?
07:10 🔗 JesseW has joined #archiveteam-bs
07:11 🔗 JesseW bsmith093: Finished all but Naruto (which is 18G uncompressed) -- now working on that.
07:12 🔗 JesseW Currently up to 105G compressed, as opposed to the originals 108G. So it will likely be bigger, but probably not very.
07:12 🔗 JesseW probably about 2GB bigger.
07:13 🔗 JesseW hook54321: not as a normal user; IA staffers can do various things, though.
07:36 🔗 bwn has joined #archiveteam-bs
07:58 🔗 VADemon has quit IRC (Quit: left4dead)
08:01 🔗 metalcamp has joined #archiveteam-bs
08:12 🔗 JesseW has left
08:16 🔗 joepie91 Frogging: "virtual environments" is the recommendation everybody automatically makes for Python and Ruby but 1) they are a hack that really shouldn't be necessary to begin with and 2) they don't actually fully solve the problem
08:16 🔗 joepie91 they isolate dependencies on a per-application basis
08:16 🔗 joepie91 but it doesn't magically allow for nested / differently versioned dependencies *within* a project
08:17 🔗 joepie91 so the dep model remains broken
08:17 🔗 joepie91 (and frankly, virtual environments are typically an utter mess to integrate with service/daemon managers and such)
08:26 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
08:30 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
08:34 🔗 fie has joined #archiveteam-bs
08:36 🔗 fie__ has quit IRC (Ping timeout: 244 seconds)
08:55 🔗 lytv has joined #archiveteam-bs
08:59 🔗 fie_ has joined #archiveteam-bs
09:00 🔗 vtyl has quit IRC (Read error: Operation timed out)
09:00 🔗 fie has quit IRC (Read error: Operation timed out)
09:37 🔗 fie__ has joined #archiveteam-bs
09:38 🔗 fie_ has quit IRC (Read error: Operation timed out)
09:42 🔗 godane SketchCow: all of 2012 kpfa is uploaded
09:42 🔗 godane i'm uploading 2013-01 now
09:44 🔗 metalcamp has joined #archiveteam-bs
09:45 🔗 fie_ has joined #archiveteam-bs
09:46 🔗 fie__ has quit IRC (Read error: Operation timed out)
09:49 🔗 fie__ has joined #archiveteam-bs
09:49 🔗 fie__ has quit IRC (Client Quit)
09:53 🔗 fie_ has quit IRC (Ping timeout: 370 seconds)
09:55 🔗 metalcamp has quit IRC (Quit: Bye)
10:06 🔗 metalcamp has joined #archiveteam-bs
10:16 🔗 alfie morning all
10:33 🔗 BnA-Rob1n Just read a blog post about 500px.com raising their cut for every sold picture from 30% to 70% ("to help the further growth of 500px"), one of the founders is the same as livejournal. Maby we should do a sanity grab?
10:48 🔗 ersi Of 500px? Of LiveJournal?
10:50 🔗 BnA-Rob1n Well the sanity grab of livejournal is already in the disco phase. So I mean it might be good to check up on 500px as well if it's feasible to do a sanity check
10:51 🔗 ersi What the fuck is a disco phase
10:53 🔗 ersi Oh, discovery phase
10:53 🔗 HCross discovery
11:08 🔗 alfie BEARS > BEES
12:02 🔗 godane i'm up to 1991-03-31 of tagesschau evening news
12:02 🔗 godane NOTE: there is no 1991-03-26 episode on there site
12:27 🔗 godane i think uploads to IA are getting stuck
12:34 🔗 HCross godane, ditto. Newsgrabber is getting stuck
12:47 🔗 acridAxid has quit IRC (marauder)
12:49 🔗 acridAxid has joined #archiveteam-bs
12:57 🔗 alfie has quit IRC (Quit: Seeeya! - ZNC 1.6.3+deb1+jessie0)
12:57 🔗 alfie has joined #archiveteam-bs
13:38 🔗 schbirid has joined #archiveteam-bs
14:07 🔗 chazchaz has quit IRC (Read error: Operation timed out)
14:08 🔗 Honno has quit IRC (Read error: Connection reset by peer)
14:14 🔗 Coderjoe has quit IRC (Ping timeout: 260 seconds)
14:16 🔗 hook54321 has quit IRC (Ping timeout: 268 seconds)
14:17 🔗 chazchaz has joined #archiveteam-bs
14:39 🔗 Frogging ersi: The most fabulous phase of course :p
14:41 🔗 HCross it depends, its either the discovery phase or the "angry person yelling" phase
14:51 🔗 Coderjoe has joined #archiveteam-bs
15:03 🔗 Honno has joined #archiveteam-bs
15:11 🔗 vitzli has joined #archiveteam-bs
16:13 🔗 closure has quit IRC (ZNC - 1.6.0 - http://znc.in)
17:05 🔗 RichardG has quit IRC (Read error: Operation timed out)
17:06 🔗 RichardG has joined #archiveteam-bs
17:16 🔗 closure has joined #archiveteam-bs
17:31 🔗 vitzli has quit IRC (Leaving)
17:47 🔗 dxrt- has quit IRC (Ping timeout: 633 seconds)
17:47 🔗 Smiley soooooooooo what craziness is Jason upto atm
17:47 🔗 Smiley i'm wathcingf on twitter
17:51 🔗 JW_work Smiley: just moving the manuals from one place to another, AFAIK
17:54 🔗 phuzion Smiley: http://pastebin.com/3meEDnQ5 that is a bit of an overview of what's going on
17:55 🔗 phuzion tl;dr: SketchCow and friends rescued a shitload of manuals, and now they're just moving the manuals into a consolidated space for money savings sake.
18:01 🔗 Smiley oh these the one from that shop which closed?
18:04 🔗 HCross If it wasnt for the other-side-of-the-world problem, id be there
18:04 🔗 bsmith093 has quit IRC (Ping timeout: 258 seconds)
18:05 🔗 Smiley nod
18:05 🔗 Smiley money i don't have right now, time,... not really
18:05 🔗 Smiley but i might of been able to help at least a bit
18:05 🔗 Smiley hopefully moving on thursday \o/
18:07 🔗 HCross Jason needs some stuff to move in the UK :P
18:17 🔗 DopefishJ has joined #archiveteam-bs
18:17 🔗 swebb sets mode: +o DopefishJ
18:18 🔗 bwn has quit IRC (Ping timeout: 246 seconds)
18:19 🔗 DFJustin has quit IRC (Ping timeout: 274 seconds)
18:48 🔗 bwn has joined #archiveteam-bs
18:48 🔗 bsmith093 has joined #archiveteam-bs
18:54 🔗 Smiley has quit IRC (Remote host closed the connection)
18:56 🔗 schbirid has quit IRC (Quit: Leaving)
19:23 🔗 JW_work HCross: have you signed up on the archivecorps mailing list? there may be some moving jobs there. :-)
19:25 🔗 HCross2 I havent
19:34 🔗 BnA-Rob1n signup is here: http://archive.us7.list-manage.com/subscribe?u=30ffefa96d1767cc661f2e3ce&id=3b19db5cef
19:39 🔗 HCross2 Done
19:49 🔗 tomwsmf-a has joined #archiveteam-bs
19:54 🔗 DopefishJ is now known as DFJustin
20:07 🔗 chfoo Honno: did you see the wiki page? i updated instructions on how to access it in wayback machine if that helps
20:07 🔗 Honno chfoo, yeah I did, thanks for that, will do more into explaining how to get the warcs going offline
20:08 🔗 Honno just got it all downloaded and running myself
20:12 🔗 JW_work so much confusion in #archiveteam...
20:13 🔗 tomwsmf-a has quit IRC (Ping timeout: 258 seconds)
20:16 🔗 alfie JW_work: i was about to say... linebreaks aren't fuckin punctuation :P
20:39 🔗 luckcolor has joined #archiveteam-bs
20:39 🔗 luckcolor has left
20:46 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
20:51 🔗 BlueMaxim has joined #archiveteam-bs
20:51 🔗 JetBalsa has joined #archiveteam-bs
20:52 🔗 Tom__ has joined #archiveteam-bs
20:52 🔗 xmc oi Tom__, so what's your question
20:54 🔗 Tom__ So the thing is the archive team crawled a social network site. it has 519 collections. I want to find a specific profile, otherwise I need to download 519 collections which is a lot TB
20:54 🔗 xmc hm yeah
20:55 🔗 xmc you could download the .cdx files that go with, those are basically an index of urls
20:55 🔗 Tom__ Yes, is there software to open it specifally?
20:57 🔗 xmc not much that you might find useful
20:57 🔗 Tom__ I mean what is the he best software to open the .cdx.idx files? I can open it with notepad, but its not good with spacing and aligning.
20:57 🔗 xmc but they're just plain text files so you can just use grep
20:57 🔗 xmc if you find a url in a cdx then that means it is available in the matching warc file
20:59 🔗 Tom__ Ok, thank you. I will download the files and start searching.
21:05 🔗 Tom__ has quit IRC (Quit: Page closed)
21:10 🔗 luckcolor has joined #archiveteam-bs
21:10 🔗 luckcolor has left
21:26 🔗 BnA-Rob1n 519 collections, is it hyves?
21:32 🔗 BnA-Rob1n Tom__: I had a list around, uploaded it here: https://archive.org/details/warcindex-usernames.7z
21:40 🔗 BnA-Rob1n added this list to the wiki for others searching an archive containing their own or a specific username on hyves
22:17 🔗 Honno has quit IRC (Ping timeout: 492 seconds)
22:25 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
22:52 🔗 hook54321 has joined #archiveteam-bs
22:57 🔗 bauruine has quit IRC (Ping timeout: 260 seconds)
23:14 🔗 bauruine has joined #archiveteam-bs
23:22 🔗 hook54321 has quit IRC (Ping timeout: 268 seconds)
23:44 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
23:49 🔗 RichardG has joined #archiveteam-bs

irclogger-viewer