#archiveteam-bs 2016-03-28,Mon

↑back Search

Time	Nickname	Message
00:14 ^🔗	JesseW	Restarting to try and make a full backup of my laptop. Wish me luck...
00:15 ^🔗		JesseW has left
00:15 ^🔗	hook54321	anyone know when this is happening or if they've started a small beta test of it yet? http://gizmodo.com/the-wayback-machine-is-getting-a-search-engine-1739099940
00:16 ^🔗	yipdw	it's going to be at least a year
00:18 ^🔗	hook54321	what do they have to do to get it working?
00:27 ^🔗	bsmith093	anyone have the rest of the geekfu action grip podcast? i got what was in the podcast core sample on fos, but i'm pretty sure thats not all of it
00:43 ^🔗		tomwsmf-a has quit IRC (Read error: Operation timed out)
00:52 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
00:53 ^🔗		BlueMaxim has joined #archiveteam-bs
01:13 ^🔗	Frogging	joepie91: for stuff like Python virtual environments are very helpful when you've got multiple applications. I imagine Ruby is similar
01:14 ^🔗	Frogging	if you're installing all your dependencies globally to the system you're gonna have a bad time
01:15 ^🔗	dan-	esp. for py3 stuff since venv comes bundled natively now, makes deployment instructions fairly nice
01:37 ^🔗		Snoo26423 has joined #archiveteam-bs
01:55 ^🔗		RichardG has joined #archiveteam-bs
03:00 ^🔗		Snoo26423 has quit IRC (Read error: Operation timed out)
03:03 ^🔗		Snoo26423 has joined #archiveteam-bs
03:26 ^🔗		toad2 has joined #archiveteam-bs
03:28 ^🔗		toad1 has quit IRC (Read error: Operation timed out)
03:58 ^🔗		toad1 has joined #archiveteam-bs
03:59 ^🔗	hook54321	is their a way to set files as non-public on archive.org?
04:00 ^🔗		toad2 has quit IRC (Read error: Operation timed out)
04:04 ^🔗		toad2 has joined #archiveteam-bs
04:07 ^🔗		toad1 has quit IRC (Read error: Operation timed out)
04:09 ^🔗	SketchCow	Greets from Westminster, MD
04:11 ^🔗		bwn__ has quit IRC (Read error: Operation timed out)
04:16 ^🔗		toad1 has joined #archiveteam-bs
04:18 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
04:18 ^🔗		toad2 has quit IRC (Read error: Operation timed out)
04:24 ^🔗		Sk1d has joined #archiveteam-bs
04:37 ^🔗		toad2 has joined #archiveteam-bs
04:39 ^🔗		toad1 has quit IRC (Read error: Operation timed out)
05:09 ^🔗		toad1 has joined #archiveteam-bs
05:10 ^🔗		toad2 has quit IRC (Read error: Operation timed out)
05:18 ^🔗	hook54321	is their a way to submit a list of urls to be archived on the wayback machine?
05:48 ^🔗		toad2 has joined #archiveteam-bs
05:49 ^🔗		toad1 has quit IRC (Read error: Operation timed out)
05:57 ^🔗		toad1 has joined #archiveteam-bs
06:00 ^🔗		toad2 has quit IRC (Read error: Operation timed out)
06:02 ^🔗		toad2 has joined #archiveteam-bs
06:05 ^🔗	Honno	Hey, I'm clueless how to use megawarc for this https://archive.org/details/archiveteam_gamemaker&tab=collection
06:05 ^🔗		toad3 has joined #archiveteam-bs
06:05 ^🔗	Honno	I see that for IA you need to split your warcs up
06:05 ^🔗	Honno	But how do you put them back together? I see this https://github.com/alard/megawarc but have no clue how to use it
06:05 ^🔗		toad1 has quit IRC (Read error: Operation timed out)
06:06 ^🔗	Honno	Do I need to get all the json files from all those items in that collection and put it in one file or something to use the above tool and aggghhh
06:08 ^🔗		toad2 has quit IRC (Read error: Operation timed out)
06:13 ^🔗	yipdw	Honno: what are you trying to do
06:14 ^🔗	Honno	yipdw: make a warc composed of all those warcs
06:15 ^🔗	yipdw	Honno: use cat
06:15 ^🔗	Honno	yipdw: whats that sorry?
06:15 ^🔗	yipdw	if your only goal is concatentation it's faster than extract/compress
06:16 ^🔗	yipdw	cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz
06:16 ^🔗	Honno	is there like, a linux command where I can literally just write a concat command the all the file names?
06:16 ^🔗	yipdw	cat warc1.warc.gz warc2.warc.gz ... warcn.warc.gz > big.warc.gz
06:16 ^🔗	godane	so 1991-03 of tagesschau is getting uploaded
06:17 ^🔗	hook54321	is their a way to submit a list of urls to be archived on the wayback machine?
06:17 ^🔗	yipdw	not officially; use web.archive.org's save page thing or you can send stuff in via archivebot
06:18 ^🔗	yipdw	Honno: the JSON file accompanying each megawarc item is there to make it possible to split the megawarc back into its source files
06:18 ^🔗	yipdw	so if you're splitting, yes, you want that
06:18 ^🔗	yipdw	but warc.gz files produced by megawarc are individually gzipped WARC records so concatentation is fine
06:19 ^🔗	yipdw	this applies only to the WARC output; catting warc.gz with tarballs may not do what you want
06:19 ^🔗	yipdw	fortunately most tarballs created in megawarced warrior output are empty tarball
06:19 ^🔗	yipdw	s
06:20 ^🔗	Honno	yipdw: sooo, no need for the JSON files if I'm going to concat right?
06:20 ^🔗	yipdw	if all you want to do is make a gigantic warc then no you don't need the JSON files
06:20 ^🔗	yipdw	I'm wondering why you want a gigantic warc, but that's a second question
06:21 ^🔗	Honno	yipdw: it's because components of one archive rely on things in the archive archives, for general browsing
06:21 ^🔗	yipdw	download all the warcs and load them up into pywb
06:21 ^🔗	yipdw	it'll find them
06:21 ^🔗	yipdw	wayback has similar functionality
06:22 ^🔗	Honno	wayback seems ridiculously hard to set up
06:22 ^🔗	yipdw	then try pywb, it's easier
06:22 ^🔗	yipdw	or webarchiveplayer, which is pywb with a nicer interface
06:22 ^🔗	Honno	I'm a complete noob by the way heh, I don't do programming or anything
06:22 ^🔗	Honno	yeah I tried webarchiveplayer, that doesn't seem to have the feature of using all things tho
06:22 ^🔗	Honno	also takes ridiculously long to load
06:23 ^🔗		Microguru has joined #archiveteam-bs
06:23 ^🔗	yipdw	you're throwing hundreds of gigabytes of data
06:23 ^🔗	yipdw	it's going to take a while no matter
06:23 ^🔗	Honno	yeah heh, ugh
06:24 ^🔗	yipdw	in any case webarchiveplayer should support multiple WARCs fine
06:24 ^🔗	yipdw	I don't remember if it uses the cdx files
06:24 ^🔗	yipdw	or if it must reconstruct them
06:24 ^🔗	yipdw	you may have better luck downloading the WARC and CDX files, and dumping them in the same place
06:25 ^🔗	Honno	cdx huh, need to check what that is
06:25 ^🔗	yipdw	WARC index
06:25 ^🔗	yipdw	if webarchiveplayer can use the indexes you can avoid a costly reindexing
06:25 ^🔗	Honno	oh
06:25 ^🔗	yipdw	I know pywb uses indexes to speed up retrieval, I just can't remember whether or not it will use the ones generated at IA
06:26 ^🔗	Honno	well thanks yipdw, the ultimate goal is to web scrape data and extract all the game downloads from the site, but it seems theres a lot I need to learn about first
06:28 ^🔗	yipdw	you may want to ask ikreymer for more tips
06:28 ^🔗	yipdw	he pops in here occasionally
06:28 ^🔗	Honno	heh, another thing yipdw, the game downloads don't show up in the index of webarchiveplayer
06:28 ^🔗	yipdw	being the author of pywb I suspect he'll know more about it than me
06:28 ^🔗	Honno	ah right haha
06:28 ^🔗	yipdw	I don't know what that's from
06:28 ^🔗	Honno	all the downloads have a weird download link see, it's a query ie games/220702-karoshi-factory-remake-gmk/send_download?code=1ed32eb417091bed7fffe9e99269867ba01b54da
06:29 ^🔗	Honno	from games/220702/download
06:29 ^🔗	Honno	the site was pretty weird
06:29 ^🔗	Honno	I can't easily download the game files then?
06:30 ^🔗	yipdw	I don't know, I didn't participate in that one
06:30 ^🔗	yipdw	arkiver probably knows more about the quirks of that site
06:31 ^🔗	Honno	mhmk I'll see if they know
06:32 ^🔗	Honno	yipdw, where do I see who organized these crawls sorry?
06:32 ^🔗	Honno	I see the tracker lists folk, but thats people who contributed their computers right on the warrior
06:32 ^🔗	yipdw	oh I guess it was chfoo
06:32 ^🔗	yipdw	https://github.com/ArchiveTeam/gamemaker-sandbox-grab
06:33 ^🔗	Honno	yeah chfoo made the archive team wiki page about the project
06:33 ^🔗	Honno	also helped me out earlier so I spose thats the person I want heh
06:34 ^🔗	Honno	I'll be off, thanks for your help
06:34 ^🔗	Honno	Really need to learn this stuff, want to make a clean archive of the games from this old site
06:36 ^🔗	yipdw	np
06:43 ^🔗		toad1 has joined #archiveteam-bs
06:44 ^🔗		toad3 has quit IRC (Read error: Operation timed out)
07:09 ^🔗	hook54321	is their a way to set files as non-public on archive.org?
07:10 ^🔗		JesseW has joined #archiveteam-bs
07:11 ^🔗	JesseW	bsmith093: Finished all but Naruto (which is 18G uncompressed) -- now working on that.
07:12 ^🔗	JesseW	Currently up to 105G compressed, as opposed to the originals 108G. So it will likely be bigger, but probably not very.
07:12 ^🔗	JesseW	probably about 2GB bigger.
07:13 ^🔗	JesseW	hook54321: not as a normal user; IA staffers can do various things, though.
07:36 ^🔗		bwn has joined #archiveteam-bs
07:58 ^🔗		VADemon has quit IRC (Quit: left4dead)
08:01 ^🔗		metalcamp has joined #archiveteam-bs
08:12 ^🔗		JesseW has left
08:16 ^🔗	joepie91	Frogging: "virtual environments" is the recommendation everybody automatically makes for Python and Ruby but 1) they are a hack that really shouldn't be necessary to begin with and 2) they don't actually fully solve the problem
08:16 ^🔗	joepie91	they isolate dependencies on a per-application basis
08:16 ^🔗	joepie91	but it doesn't magically allow for nested / differently versioned dependencies within a project
08:17 ^🔗	joepie91	so the dep model remains broken
08:17 ^🔗	joepie91	(and frankly, virtual environments are typically an utter mess to integrate with service/daemon managers and such)
08:26 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
08:30 ^🔗		metalcamp has quit IRC (Ping timeout: 244 seconds)
08:34 ^🔗		fie has joined #archiveteam-bs
08:36 ^🔗		fie__ has quit IRC (Ping timeout: 244 seconds)
08:55 ^🔗		lytv has joined #archiveteam-bs
08:59 ^🔗		fie_ has joined #archiveteam-bs
09:00 ^🔗		vtyl has quit IRC (Read error: Operation timed out)
09:00 ^🔗		fie has quit IRC (Read error: Operation timed out)
09:37 ^🔗		fie__ has joined #archiveteam-bs
09:38 ^🔗		fie_ has quit IRC (Read error: Operation timed out)
09:42 ^🔗	godane	SketchCow: all of 2012 kpfa is uploaded
09:42 ^🔗	godane	i'm uploading 2013-01 now
09:44 ^🔗		metalcamp has joined #archiveteam-bs
09:45 ^🔗		fie_ has joined #archiveteam-bs
09:46 ^🔗		fie__ has quit IRC (Read error: Operation timed out)
09:49 ^🔗		fie__ has joined #archiveteam-bs
09:49 ^🔗		fie__ has quit IRC (Client Quit)
09:53 ^🔗		fie_ has quit IRC (Ping timeout: 370 seconds)
09:55 ^🔗		metalcamp has quit IRC (Quit: Bye)
10:06 ^🔗		metalcamp has joined #archiveteam-bs
10:16 ^🔗	alfie	morning all
10:33 ^🔗	BnA-Rob1n	Just read a blog post about 500px.com raising their cut for every sold picture from 30% to 70% ("to help the further growth of 500px"), one of the founders is the same as livejournal. Maby we should do a sanity grab?
10:48 ^🔗	ersi	Of 500px? Of LiveJournal?
10:50 ^🔗	BnA-Rob1n	Well the sanity grab of livejournal is already in the disco phase. So I mean it might be good to check up on 500px as well if it's feasible to do a sanity check
10:51 ^🔗	ersi	What the fuck is a disco phase
10:53 ^🔗	ersi	Oh, discovery phase
10:53 ^🔗	HCross	discovery
11:08 ^🔗	alfie	BEARS > BEES
12:02 ^🔗	godane	i'm up to 1991-03-31 of tagesschau evening news
12:02 ^🔗	godane	NOTE: there is no 1991-03-26 episode on there site
12:27 ^🔗	godane	i think uploads to IA are getting stuck
12:34 ^🔗	HCross	godane, ditto. Newsgrabber is getting stuck
12:47 ^🔗		acridAxid has quit IRC (marauder)
12:49 ^🔗		acridAxid has joined #archiveteam-bs
12:57 ^🔗		alfie has quit IRC (Quit: Seeeya! - ZNC 1.6.3+deb1+jessie0)
12:57 ^🔗		alfie has joined #archiveteam-bs
13:38 ^🔗		schbirid has joined #archiveteam-bs
14:07 ^🔗		chazchaz has quit IRC (Read error: Operation timed out)
14:08 ^🔗		Honno has quit IRC (Read error: Connection reset by peer)
14:14 ^🔗		Coderjoe has quit IRC (Ping timeout: 260 seconds)
14:16 ^🔗		hook54321 has quit IRC (Ping timeout: 268 seconds)
14:17 ^🔗		chazchaz has joined #archiveteam-bs
14:39 ^🔗	Frogging	ersi: The most fabulous phase of course :p
14:41 ^🔗	HCross	it depends, its either the discovery phase or the "angry person yelling" phase
14:51 ^🔗		Coderjoe has joined #archiveteam-bs
15:03 ^🔗		Honno has joined #archiveteam-bs
15:11 ^🔗		vitzli has joined #archiveteam-bs
16:13 ^🔗		closure has quit IRC (ZNC - 1.6.0 - http://znc.in)
17:05 ^🔗		RichardG has quit IRC (Read error: Operation timed out)
17:06 ^🔗		RichardG has joined #archiveteam-bs
17:16 ^🔗		closure has joined #archiveteam-bs
17:31 ^🔗		vitzli has quit IRC (Leaving)
17:47 ^🔗		dxrt- has quit IRC (Ping timeout: 633 seconds)
17:47 ^🔗	Smiley	soooooooooo what craziness is Jason upto atm
17:47 ^🔗	Smiley	i'm wathcingf on twitter
17:51 ^🔗	JW_work	Smiley: just moving the manuals from one place to another, AFAIK
17:54 ^🔗	phuzion	Smiley: http://pastebin.com/3meEDnQ5 that is a bit of an overview of what's going on
17:55 ^🔗	phuzion	tl;dr: SketchCow and friends rescued a shitload of manuals, and now they're just moving the manuals into a consolidated space for money savings sake.
18:01 ^🔗	Smiley	oh these the one from that shop which closed?
18:04 ^🔗	HCross	If it wasnt for the other-side-of-the-world problem, id be there
18:04 ^🔗		bsmith093 has quit IRC (Ping timeout: 258 seconds)
18:05 ^🔗	Smiley	nod
18:05 ^🔗	Smiley	money i don't have right now, time,... not really
18:05 ^🔗	Smiley	but i might of been able to help at least a bit
18:05 ^🔗	Smiley	hopefully moving on thursday \o/
18:07 ^🔗	HCross	Jason needs some stuff to move in the UK :P
18:17 ^🔗		DopefishJ has joined #archiveteam-bs
18:17 ^🔗		swebb sets mode: +o DopefishJ
18:18 ^🔗		bwn has quit IRC (Ping timeout: 246 seconds)
18:19 ^🔗		DFJustin has quit IRC (Ping timeout: 274 seconds)
18:48 ^🔗		bwn has joined #archiveteam-bs
18:48 ^🔗		bsmith093 has joined #archiveteam-bs
18:54 ^🔗		Smiley has quit IRC (Remote host closed the connection)
18:56 ^🔗		schbirid has quit IRC (Quit: Leaving)
19:23 ^🔗	JW_work	HCross: have you signed up on the archivecorps mailing list? there may be some moving jobs there. :-)
19:25 ^🔗	HCross2	I havent
19:34 ^🔗	BnA-Rob1n	signup is here: http://archive.us7.list-manage.com/subscribe?u=30ffefa96d1767cc661f2e3ce&id=3b19db5cef
19:39 ^🔗	HCross2	Done
19:49 ^🔗		tomwsmf-a has joined #archiveteam-bs
19:54 ^🔗		DopefishJ is now known as DFJustin
20:07 ^🔗	chfoo	Honno: did you see the wiki page? i updated instructions on how to access it in wayback machine if that helps
20:07 ^🔗	Honno	chfoo, yeah I did, thanks for that, will do more into explaining how to get the warcs going offline
20:08 ^🔗	Honno	just got it all downloaded and running myself
20:12 ^🔗	JW_work	so much confusion in #archiveteam...
20:13 ^🔗		tomwsmf-a has quit IRC (Ping timeout: 258 seconds)
20:16 ^🔗	alfie	JW_work: i was about to say... linebreaks aren't fuckin punctuation :P
20:39 ^🔗		luckcolor has joined #archiveteam-bs
20:39 ^🔗		luckcolor has left
20:46 ^🔗		metalcamp has quit IRC (Ping timeout: 244 seconds)
20:51 ^🔗		BlueMaxim has joined #archiveteam-bs
20:51 ^🔗		JetBalsa has joined #archiveteam-bs
20:52 ^🔗		Tom__ has joined #archiveteam-bs
20:52 ^🔗	xmc	oi Tom__, so what's your question
20:54 ^🔗	Tom__	So the thing is the archive team crawled a social network site. it has 519 collections. I want to find a specific profile, otherwise I need to download 519 collections which is a lot TB
20:54 ^🔗	xmc	hm yeah
20:55 ^🔗	xmc	you could download the .cdx files that go with, those are basically an index of urls
20:55 ^🔗	Tom__	Yes, is there software to open it specifally?
20:57 ^🔗	xmc	not much that you might find useful
20:57 ^🔗	Tom__	I mean what is the he best software to open the .cdx.idx files? I can open it with notepad, but its not good with spacing and aligning.
20:57 ^🔗	xmc	but they're just plain text files so you can just use grep
20:57 ^🔗	xmc	if you find a url in a cdx then that means it is available in the matching warc file
20:59 ^🔗	Tom__	Ok, thank you. I will download the files and start searching.
21:05 ^🔗		Tom__ has quit IRC (Quit: Page closed)
21:10 ^🔗		luckcolor has joined #archiveteam-bs
21:10 ^🔗		luckcolor has left
21:26 ^🔗	BnA-Rob1n	519 collections, is it hyves?
21:32 ^🔗	BnA-Rob1n	Tom__: I had a list around, uploaded it here: https://archive.org/details/warcindex-usernames.7z
21:40 ^🔗	BnA-Rob1n	added this list to the wiki for others searching an archive containing their own or a specific username on hyves
22:17 ^🔗		Honno has quit IRC (Ping timeout: 492 seconds)
22:25 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
22:52 ^🔗		hook54321 has joined #archiveteam-bs
22:57 ^🔗		bauruine has quit IRC (Ping timeout: 260 seconds)
23:14 ^🔗		bauruine has joined #archiveteam-bs
23:22 ^🔗		hook54321 has quit IRC (Ping timeout: 268 seconds)
23:44 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
23:49 ^🔗		RichardG has joined #archiveteam-bs

irclogger-viewer