#archiveteam 2013-10-18,Fri

↑back Search

Time	Nickname	Message
00:40 ^🔗	joepie91	hmm
00:40 ^🔗	joepie91	I want to try my hand at setting up a seesaw script
00:40 ^🔗	joepie91	but for the tracker I'd need to have an IA collection...
00:40 ^🔗	joepie91	is that strictly necessary or can I put it elsewhere for now?
00:41 ^🔗	omf_	you can stick it in opensource as a texts mediatype
00:41 ^🔗	omf_	you also need an upload target
00:42 ^🔗	joepie91	:P
00:42 ^🔗	joepie91	omf_: as in, rsync target? that'd be my OVH box
00:43 ^🔗	*	joepie91 is setting up the megawarc factory
00:43 ^🔗	omf_	on that box you have to setup rsync and a megawarc factory if needed
00:43 ^🔗	joepie91	yes, I'm following the instructions for that atm
00:43 ^🔗	joepie91	hence my question about collections
00:43 ^🔗	joepie91	"Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on EFNet."
00:44 ^🔗	joepie91	but 'opensource' as target collection would do also?
00:45 ^🔗	omf_	opensource is an open collection anyone can stick things in
00:45 ^🔗	ATZ0	what's the frequency, kenneths?
00:45 ^🔗	omf_	To make sure it is kept track of I set the keywords "archiveteam webcrawl"
00:45 ^🔗	joepie91	omf_: no idea how to do that in the config.sh
00:45 ^🔗	omf_	and you can add the project name as a keyword as well
00:47 ^🔗	joepie91	omf_: it is sufficient to have the same prefix for each item name? I can't figure out where to set keywords :P
00:47 ^🔗	joepie91	is it *
00:48 ^🔗	omf_	you modify the curl command in upload-one to add the proper s3 headers
00:49 ^🔗	omf_	like this
00:49 ^🔗	omf_	--header 'x-archive-meta-subject:archiveteam;webcrawl' \
00:51 ^🔗	joepie91	I see, thanks
00:52 ^🔗	omf_	here is the one we used for zapd. It is nothing special but you get the idea --> http://paste.archivingyoursh.it/velajudige.pl
00:54 ^🔗	omf_	Each s3 header matches an option on the web interface for creating/editing an item
01:08 ^🔗	joepie91	omf_: halp!
01:08 ^🔗	joepie91	'ruby' was not found, cannot install rubygems unless ruby is present (Do you have an RVM ruby installed & selected?)
01:08 ^🔗	joepie91	when running `rvm rubygems current`
01:08 ^🔗	joepie91	after the rvm install 2.0
01:09 ^🔗	*	joepie91 knows 0 about ruby
01:11 ^🔗	joepie91	wtf ruby
07:13 ^🔗	Lord_Nigh	nintendo says fullscreenmario.com is illegal, takedown may happen at some point (ref: http://www.washingtonpost.com/blogs/the-switch/wp/2013/10/17/nintendo-says-this-amazing-super-mario-site-is-illegal-heres-why-it-shouldnt-be/ )
07:13 ^🔗	Lord_Nigh	fullscreenmario's engine code is at https://github.com/Diogenesthecynic/FullScreenMario.git
07:52 ^🔗	tephra	Lord_Nigh: I cloned the repo yesterday :)
07:53 ^🔗	Lord_Nigh	I got it today, they fixed a bug on one of the levels
07:55 ^🔗	tephra	nioce
08:00 ^🔗	godane	i'm grabbing all the imaginingamercia videos
08:01 ^🔗	godane	will archive.org take a zip file of videos and display the videos?
08:32 ^🔗	godane	looks like this video disappeared: http://www.youtube.com/watch?v=hT_rY-Lk8nc
08:33 ^🔗	godane	it goes from close to 2 hours to just 1 second now
14:38 ^🔗	joepie91	righ
14:38 ^🔗	joepie91	right *
14:38 ^🔗	joepie91	I'm giving up on setting up the archiveteam tracker for now :\|
14:39 ^🔗	joepie91	it's a disaster to set up... it expects upstart to exist according to the guide (which it doesn't on Debian, and installing it would break sysvinit), I have issues with rvm and the absence of a login shell, and so on
14:39 ^🔗	joepie91	D:
14:41 ^🔗	omf_	joepie91, you were trying to install the universal tracker, why?
14:41 ^🔗	joepie91	omf_: because that's what the guide says?
14:42 ^🔗	joepie91	http://www.archiveteam.org/index.php?title=Tracker_Setup
14:42 ^🔗	joepie91	I want to get a tracker / project running
14:42 ^🔗	joepie91	but I've spent some 4 hours on this now
14:42 ^🔗	joepie91	and I still don't have a working setup
14:42 ^🔗	omf_	We already have a tracker instance it is tracker.archiveteam.org
14:43 ^🔗	joepie91	which I don't have any form of admin access to, nor do I expect it to be appreciated to use it while testing stuff
14:44 ^🔗	omf_	We all test shit using that instance, we just don't put the projects in the projects.json
14:44 ^🔗	omf_	All you need to know about the tracker is it takes a list of items to send out, beyond that is not needed to startup a new project
14:46 ^🔗	omf_	bam new instance --> http://tracker.archiveteam.org/isoprey
14:46 ^🔗	omf_	now let me create you an account
14:53 ^🔗	omf_	You add items using the Queues page
14:53 ^🔗	omf_	Claims page is for monitoring
15:33 ^🔗	yipdw	joepie91: you don't need upstart
15:58 ^🔗	joepie91	unrelated note; if scraping a site, you'll want to pretend to be Firefox, not Chrome
15:58 ^🔗	joepie91	:P
15:58 ^🔗	joepie91	Chrome auto-updates in nearly every case so if the spoofed useragent you're using is an outdated Chrome version it's very very easy for a server admin to single you out
15:58 ^🔗	joepie91	FF is much less rigorous with that
16:38 ^🔗	omf_	Anyone remember which warrior project had the pipeline that required multiple ids to do the crawling
16:38 ^🔗	omf_	was that puush or xanga? or something else?
16:38 ^🔗	antomatic	I think Puush specifies a number of IDs within a range
16:39 ^🔗	antomatic	Xanga was straightforward 1:1 items
16:40 ^🔗	antomatic	Might have been Patch that issued multiple items-per-item ?
16:44 ^🔗	joepie91	it was puush indeed
16:45 ^🔗	joepie91	mmm, is there a particular reason for the MoveFiles task existing? cc omf_
16:45 ^🔗	joepie91	can't immediately see the point of it, and considering rm'ing it from my script
16:53 ^🔗	antomatic	It may not be vital but I think I'm right in saying that it moves the .warc.gz files from the location that they're temporarily downloading in, to a location where they can be assumed as 'finished and ready to upload'.
16:54 ^🔗	antomatic	Can be useful if a script crashes sometimes.
16:54 ^🔗	antomatic	What's finished can be shown with ls data//.gz, whereas partial downloads are left at data///*.gz
16:55 ^🔗	joepie91	right.
17:11 ^🔗	yipdw	antomatic: yeah, patch.com did items-per-item
17:12 ^🔗	yipdw	I don't recommend taking the patch.com pipeline as an example, though -- it's not that I think it's bad, but it's doing some really specialized stuff that requires substantial additional server support
17:20 ^🔗	joepie91	in seesaw, what's the conceptual idea behind Task.process and Task.enqueue, and how do they differ/relate?
17:24 ^🔗	joepie91	cc yipdw
17:54 ^🔗	joepie91	also, what's the "realize" thing?
18:02 ^🔗	joepie91	OH
18:02 ^🔗	joepie91	realize does the item interpolation?
18:41 ^🔗	yipdw	joepie91: item interpolation is best handled by the ItemInterpolation object
18:41 ^🔗	joepie91	yipdw: what I meant was that realize appears to handle the processing of ItemInterpolation objects and such
18:41 ^🔗	joepie91	in the actual Task code
18:41 ^🔗	yipdw	oh
18:42 ^🔗	yipdw	yeah, maybe -- I try to keep above that level in the seesaw code
18:42 ^🔗	joepie91	I'm still trying to figure out what all this does
18:42 ^🔗	joepie91	:P
18:42 ^🔗	joepie91	right
18:42 ^🔗	yipdw	I've not gone that far into it
18:42 ^🔗	yipdw	luckily, you don't need to go that far to write custom tasks
18:42 ^🔗	joepie91	yipdw: idk, I'm trying to do ranges
18:42 ^🔗	joepie91	and the WgetDownloadMany thing in puush seemed faaaaar too complex in use for me
18:43 ^🔗	yipdw	joepie91: you can write a SimpleTask subclass that expands the range and adds the expanded list to an item key
18:44 ^🔗	yipdw	from there you can feed them in as URLs to wget (if they're URLs) or process them further into a wget-usable form
18:44 ^🔗	yipdw	that's what the ExpandItem task in the patch pipeline does
18:49 ^🔗	joepie91	yipdw: right, my setup is a bit more complex though :P
18:49 ^🔗	joepie91	it tries to separately download a .torrent file first and depending on whether that succeeds it attempts to do a recursive grab of other pages
20:01 ^🔗	Lord_Nigh	http://www.dopplr.com/ shutting down
20:02 ^🔗	Lord_Nigh	godane: that video can be downloaded as .mp4 here, 1.6gb
20:02 ^🔗	Lord_Nigh	http://www.youtube.com/watch?v=hT_rY-Lk8nc <-i mean that one
20:21 ^🔗	godane	i just tryed and i'm only getting 128kb
20:21 ^🔗	godane	also it will not play in browser
20:41 ^🔗	ersi	joepie91: There's a #warrior channel that you could use for seesaw discussions
22:13 ^🔗	lemonkey	http://www.dopplr.com/
22:13 ^🔗	lemonkey	nm lord_nigh already posted
23:18 ^🔗	SketchCow	Let's doooo it
23:35 ^🔗	kyan	guys there's no way I'm going to be able to download all of fisheye.toolserver.org in a monthâ¦ I'm only at 43k URLs (been downloading for a day or two now) and have over 3 million queued already. Is there a way to distribute it so multiple requests can be happening at once??
23:38 ^🔗	joepie91	kyan: what exactly is the situation with fisheye.toolserver.org?
23:39 ^🔗	joepie91	also, re: isohunt
23:39 ^🔗	joepie91	<cayce>7 calendar days from the signing of the judgement
23:39 ^🔗	joepie91	<cayce>Filed 10/17/13
23:39 ^🔗	joepie91	<cayce>It's pretty much a legal cease and desist, but he's got 7 days to do it. Nothing stated that he can't do stuff in the interim, as long as he makes that deadline.
23:39 ^🔗	joepie91	<cayce>better hurry the fuck up with that grab, you've got 7 days
23:39 ^🔗	joepie91	<cayce>especially since the only applicable parties is him and his company
23:39 ^🔗	joepie91	<cayce>joepie91:) someone should ask him. He's required to shut it down within 7 days and not continue operating it, but there's nothing in there about not making a backup or somesuch.
23:39 ^🔗	joepie91	<cayce>yeah, okay
23:39 ^🔗	kyan	joepie91: it's a website that's shutting down, but it's a) really slow and b) really big
23:39 ^🔗	joepie91	kyan: it just looks like a repository viewer to me?
23:40 ^🔗	kyan	joepie91: it is, but for some reason things can't be exported via SVN normally
23:40 ^🔗	kyan	joepie91: apparently the history can only be obtained through the web diff interface
23:41 ^🔗	joepie91	that makes no sense...
23:41 ^🔗	joepie91	:\|
23:42 ^🔗	balrog	kyan: are these svn repos?
23:42 ^🔗	balrog	did you try svnsync?
23:42 ^🔗	balrog	if you can svn co -r <rev>, then svnsync will do the job
23:42 ^🔗	kyan	balrog: IDK, I just know someone else ran into issues with doing it the normal way and so they switched over to wget
23:42 ^🔗	kyan	and then wget borked because the site was so big
23:42 ^🔗	balrog	link me a repo that has issues
23:43 ^🔗	balrog	ugh
23:43 ^🔗	kyan	so I tried taking it on with Heritrix
23:43 ^🔗	balrog	using wget for this...
23:43 ^🔗	kyan	and it's going ok, but not fast enough with a deadline
23:43 ^🔗	kyan	let me see if i can find the chatlogs about it
23:45 ^🔗	kyan	here we go http://badcheese.com/~steve/atlogs/?chan=archiveteam&day=2013-10-16
23:45 ^🔗	kyan	10 or 15 lines down
23:46 ^🔗	balrog	I suggested svnsync there...
23:46 ^🔗	balrog	Nemo_bis: ping
23:47 ^🔗	balrog	Nemo_bis: svnsync DOES give you history
23:47 ^🔗	balrog	it works by using svn co -r to check out each rev and build a new svn repo from those checkouts
23:47 ^🔗	balrog	with all metadata and such
23:49 ^🔗	kyan	balrog: "at least one root refuses svn export"â¦ not sure what that indicates
23:49 ^🔗	balrog	kyan: he was using svn export
23:49 ^🔗	balrog	I'd like to know which repo failed it
23:49 ^🔗	balrog	kyan: are you good with terminal/command line?
23:50 ^🔗	kyan	balrog: not really. I can do enough to get by usually
23:50 ^🔗	balrog	ah :/ ok
23:50 ^🔗	*	kyan is, however, an EXPERT at writing unusable spaghetti code in php

irclogger-viewer