#archiveteam 2014-10-20,Mon

↑back Search

Time	Nickname	Message
00:11 ^🔗	aaaaaaaaa	bebzol: you could download the tracker dev env and set up a network on virtualbox for it and just change the tracker_host in your pipeline.py file
00:14 ^🔗	aaaaaaaaa	https://github.com/ArchiveTeam/archiveteam-dev-env or follow the directions here: http://archiveteam.org/index.php?title=Dev/Tracker
00:20 ^🔗	bebzol	yay, thanks :)
02:41 ^🔗	dashcloud	any idea why wpull is telling me "ImportError: No module named 'sqlalchemy" ? I used the pip install wpull method
03:00 ^🔗	garyrh	try pip install -U wpull
05:11 ^🔗	amerrykan	i'm trying to pull down every video from LA Podfest 2014
05:11 ^🔗	amerrykan	i've got all but three videos, youtube-dl fails with a weird error
05:13 ^🔗	amerrykan	Go Bayside! - player.vimeo.com/video/107790103
05:13 ^🔗	amerrykan	Road Stories - player.vimeo.com/video/107786707
05:13 ^🔗	amerrykan	The JV Club - player.vimeo.com/video/107793309
05:13 ^🔗	amerrykan	i'm about to head to bed, but if anyone can suggest, i'd appreciate it
05:32 ^🔗	danneh_	amerrykan: I'm giving them a shot, will let you know how they go
05:36 ^🔗	amerrykan	if they even start for you, that's further than I've got
05:38 ^🔗	danneh_	yep, first one finished and second one started, maybe try: pip install --upgrade youtube-dl
05:39 ^🔗	danneh_	or it could've blocked you for doing too many at one time? though I've downloaded about 30 at once and not had issues
05:40 ^🔗	amerrykan	i'm on arch, so freshness shouldn't be my problem
05:41 ^🔗	amerrykan	i'm getting 'unable to extract info' type errors
05:44 ^🔗	danneh_	fair enough, that's weird
05:46 ^🔗	danneh_	I'd probably still try either that pip or youtube-dl -U , since that issue is generally related to out-of-date extractor info and shouldn't hurt anything
05:47 ^🔗	danneh_	in any case, I can upload these if all else fails
08:05 ^🔗	danneh_	amerrykan: downloaded both of those, let me know if you want me to upload them somewhere
08:05 ^🔗	danneh_	all three of those*
08:32 ^🔗	sharpobje	hi, any way to find a particular twitch vod on the internet archive?
08:32 ^🔗	sharpobje	I am looking for the (only?) vod saved for the channel leveluplive2, with highlight ID 1854389
08:42 ^🔗	Atluxity	sharpobje: lurk around, someone will be with you eventually
10:53 ^🔗	bebzol	hello! I'm developing seesaw and lua scripts for archiving ownlog.com service - could you create an ownlog-grab github repository for it?
12:06 ^🔗	Muad-Dib	arkiver, ivan` ^
15:18 ^🔗	stevenola	SketchCow: Looking for your opinion on something
15:19 ^🔗	stevenola	I run artpacks.org, and I've had someone contact me asking for full packs that they've participated in to be removed from the archive
15:19 ^🔗	stevenola	Because their art is being indexed by google, and contains their real name, mailing address, phone number, exgirlfriend names, etc
15:20 ^🔗	stevenola	I've already added their art to robots.txt as a quick fix for this issue
15:20 ^🔗	stevenola	And I have no intention to remove full packs
15:21 ^🔗	stevenola	But I'm curious what you think about this situation
15:22 ^🔗	DFJustin	I'd suggest adding a robots.txt rule to whitelist ia_archiver
15:22 ^🔗	balrog	stevenola: how were these packs produced?
15:22 ^🔗	DFJustin	because what you really care about is google etc
15:22 ^🔗	stevenola	I've written (but not yet sent) an email describing what I did with robots.txt, and offered to censor their phone numbers and personal details. I think I'm willing to remove the specific art, but I'm still curious to hear your thoughts
15:22 ^🔗	balrog	were they collected from other places?
15:22 ^🔗	balrog	I'd censor the personal details, since that's what they're worried about
15:23 ^🔗	stevenola	balrog: it's old artscene artwork. It was produced by the artists, published by an "artgroup" and distributed to many sources by the group via BBS, FTP and web.
15:23 ^🔗	raylee	what'd i miss?
15:23 ^🔗	raylee	damn bnc
15:25 ^🔗	aaaaaaaaa	A part of me can't help but think that it is available elsewhere and they put it on the internet and they knew it was publicly posted
15:25 ^🔗	DFJustin	raylee: http://badcheese.com/~steve/atlogs/?chan=archiveteam
15:25 ^🔗	balrog	DFJustin: whitelist ia_archiver globally
15:25 ^🔗	balrog	aaaaaaaaa: hah, yeah
15:26 ^🔗	stevenola	aaaaaaaaa: yes, artpacks were basically distributed in a "hey, here's the file. pass it around!" way
15:26 ^🔗	stevenola	i understand the artist's concern
15:27 ^🔗	stevenola	just looking for other perspectives or thoughts i havent considered
15:28 ^🔗	aaaaaaaaa	I'd whitelist archive.org though. No use in potentially deleting it forever and it won't show up unless you specifically look for it.
15:29 ^🔗	stevenola	have i done it correclty? http://artpacks.org/robots.txt
15:30 ^🔗	stevenola	SketchCow: Since you're familiar with the artscene, your thoughts are greatly appreciated (when you get a chance)
15:30 ^🔗	schbirid	looks correct to me
15:31 ^🔗	schbirid	if you want, someone from here could initiate a full crawl for archive.org
15:31 ^🔗	schbirid	(while ignoring robots.txt)
15:31 ^🔗	aaaaaaaaa	did you delete them too, I'm getting 404s
15:31 ^🔗	DFJustin	stevenola: you might try emailing him instead since he seems to be afk
15:31 ^🔗	aaaaaaaaa	on some
15:31 ^🔗	stevenola	No worries about that. THe actual content is all over. I think most of the pre-2004 content is on archive.org already
15:33 ^🔗	stevenola	aaaaaaaaa: ah, my script generated the urls to be blocked incofrrectly :0
15:33 ^🔗	stevenola	:)
15:33 ^🔗	stevenola	goddamn this new keyboard
15:33 ^🔗	stevenola	DFJustin: di you have a contact email for him?
15:35 ^🔗	DFJustin	jason@textfiles.com
15:37 ^🔗	stevenola	thank you!thank you!
17:19 ^🔗	joepie91	stevenola: I don't think IA is indexed by Google
17:19 ^🔗	joepie91	so if the concern is name find-ability, that shouldn't be an issue
17:19 ^🔗	joepie91	err
17:19 ^🔗	joepie91	IA is indexed
17:19 ^🔗	joepie91	I meant I don't think the wayback is indexed by Google *
17:21 ^🔗	SketchCow	Boop.
17:22 ^🔗	SketchCow	I never respond to those.
17:58 ^🔗	stevenola	Ah. Maybe that would have been a good strategy
17:58 ^🔗	stevenola	:)
17:59 ^🔗	stevenola	"strategy"
18:10 ^🔗	signius	stevenola, Its the thing of the Internet is written in INK not Pencil
18:12 ^🔗	stevenola	preaching to the choir
18:12 ^🔗	signius	:D
18:44 ^🔗	namespace	So why does the warrior lose all its data on shutdown?
18:45 ^🔗	aaaaaaaaa	it reformats on startup
18:45 ^🔗	namespace	Yes but why.
18:46 ^🔗	Jonimus	to make sure it has a clean slate such that the next run doesn't run into issues with space or leftover data.
18:46 ^🔗	namespace	Oh well, I have to shut down my computer sometimes and feel incredibly guilty losing you guys 1.2 gigs of data.
18:47 ^🔗	chronomex	you can hit the "suspend" button in virtualbox
18:47 ^🔗	aaaaaaaaa	you can pause the virtual machine, it can usually start right back up where it left off.
18:47 ^🔗	chronomex	yeah
18:47 ^🔗	namespace	chronomex: Oh so that is how you're supposed to do it?
18:47 ^🔗	namespace	Okay.
18:47 ^🔗	Jonimus	If you tell the warrior to shut down using the web interface it will shutdown once the data is sent.
18:47 ^🔗	chronomex	^
18:47 ^🔗	chronomex	but that can take a little while, depending
18:47 ^🔗	namespace	Well that'll take way too long. :P
18:47 ^🔗	Jonimus	yeah
18:48 ^🔗	Jonimus	that usually they do a few release claims towards the end of a project to make sure data lost in that manor is grabbed by someone else.
18:49 ^🔗	namespace	I mean this seems like sort of a 'gotcha' to me and I feel like there's probably some better solution.
18:50 ^🔗	aaaaaaaaa	Then just save the state when you close it and what to shut off the computer; unless i am missing your point.
18:50 ^🔗	Jonimus	there is like a vbox setting to have it suspend rather than shutdown boxes when you shutdown
18:51 ^🔗	namespace	aaaaaaaaa: My point is that this isn't intuitive for a first time user to know to do.
18:52 ^🔗	Jonimus	which is why we have the release claims methodology.
18:52 ^🔗	namespace	K.
18:52 ^🔗	yipdw	also why it's good practice for items to not be too big
18:52 ^🔗	chronomex	yeah
18:52 ^🔗	chronomex	ten minutes of downloading is a nice number
18:53 ^🔗	namespace	Well it doesn't help that my upload is like a soda straw compared to my download.
18:53 ^🔗	namespace	I'm not sure if that's on my end or Archive.org's end.
18:53 ^🔗	Jonimus	as is most home users.
18:54 ^🔗	Jonimus	and many projects are uploaded to a staging server rather than directly to Archive.org
18:54 ^🔗	namespace	I mean obviously the warrior is rate limiting, and it's very plausible that the staging server/etc has trouble with recieving the data as fast as it's being grabbed.
18:55 ^🔗	chronomex	maybe we should investigate overlapping the upload and download phases
18:55 ^🔗	Jonimus	that depends on the thing being grabbed, I know that was an issue for twitch as there were a large number of VPS's being used.
18:55 ^🔗	Jonimus	They kinda already do overlapt.
18:55 ^🔗	chronomex	hm, ok
18:55 ^🔗	chronomex	i'm not very up on things
18:56 ^🔗	Jonimus	It starts the next download as it uploads the previous task.
18:56 ^🔗	chronomex	oh, yeah, i guess it does
18:56 ^🔗	chronomex	my bad
18:58 ^🔗	bebzol	hi! anyone here can create me a github repository?
18:59 ^🔗	bebzol	i need ownlog-grab to start a rescue of a blogging platform
18:59 ^🔗	bebzol	i'm almost done with scripts and lua
19:02 ^🔗	namespace	bebzol: What blogging platform?
19:02 ^🔗	yipdw	ownlog
19:03 ^🔗	*	namespace still wants to hit ravearchive
19:03 ^🔗	sharpobje	hi, any way to find a particular twitch vod on the internet archive?
19:03 ^🔗	sharpobje	I am looking for the (only?) vod saved for the channel leveluplive2, with highlight ID 1854389
19:03 ^🔗	bebzol	its ownlog.com - a platform for about 45 000 blogs in Poland
19:03 ^🔗	bebzol	it's rotting away as its owners don't seem to care
19:03 ^🔗	namespace	bebzol: How can I help?
19:05 ^🔗	bebzol	I can prepare seesaw script and lua (amost done). I've created a list of all items to download - just don't know what to do next ;). I suppose I should put this on github repository and send someone an item list (about 45 000 items - each item is a particular subdomain)
19:06 ^🔗	garyrh	you can also create your own repo then transfer ownership over to archiveteam
19:08 ^🔗	bebzol	this may be an idea
19:09 ^🔗	bebzol	whom should I contact to do it?
19:10 ^🔗	garyrh	probably yipdw or chfoo
19:10 ^🔗	aaaaaaaaa	I'd make the repo, test it with your own tracker and then let us know when it is done. Then the admins will take a look.
19:11 ^🔗	bebzol	all right
19:15 ^🔗	aaaaaaaaa	I think most of the admins are currently taking care of their day jobs.
19:24 ^🔗	yipdw	^
19:24 ^🔗	yipdw	also if someone knows of a good way to trace leaks in Tomcat's fucking connection pool that would be awesome
19:24 ^🔗	yipdw	logAbandoned property seems to do jack shit
19:25 ^🔗	midas	best way to trace stuff in tomcat is to shoot it with a tank.
19:25 ^🔗	yipdw	not an option
19:26 ^🔗	yipdw	bebzol: shoot me your github username, I'll get the repo and permissions set up
19:27 ^🔗	bebzol	it's "basement-labs"
19:27 ^🔗	bebzol	thanks in advance
19:28 ^🔗	midas	yipdw: are you using eclipse for tracing yet?
19:29 ^🔗	yipdw	midas: IntelliJ, but I suspect I can do something similar
19:29 ^🔗	yipdw	the production application isn't configured with remote debugging etc though
19:29 ^🔗	yipdw	I guess I could turn that on
19:29 ^🔗	yipdw	anyway #-bs
19:29 ^🔗	midas	yeah lets move over there
19:29 ^🔗	yipdw	bebzol: invitation emailed, repo online
19:29 ^🔗	bebzol	thx :)
20:26 ^🔗	bebzol	does anyone knows how to debug pipeline script? I get info that wget failed - but no further info
21:25 ^🔗	dserodio	bebzol I've never tried, but I know Python. did you try adding a "-v" to wget_args (around line 216) ?
21:26 ^🔗	bebzol	unfortunately - no wget output is printed - that is the problem
21:26 ^🔗	bebzol	but I've already resolved my problem :)
21:28 ^🔗	ersi_	It's probably because the exit/return code wasn't as the pipeline wished
21:33 ^🔗	bebzol	nah, I didn't set variables in python - item_type and item_value. I suppose this is important :P

irclogger-viewer