#archiveteam 2013-11-15,Fri

↑back Search

Time	Nickname	Message
09:25 ^🔗	SketchCow	http://instagram.com/p/gszkewiOJq/
09:45 ^🔗	midas1	lol SketchCow
11:32 ^🔗	SketchCow	The uploading of what we got from blip.tv so far has been going well now.
11:32 ^🔗	SketchCow	Thanks again to joepie for the metadata extraction routines.
11:32 ^🔗	SketchCow	Over 2,500 videos added
12:05 ^🔗	Nemo_bis	SketchCow: did that kill s3 or something? It's almost idle since 12 hours ago.
12:32 ^🔗	Nemo_bis	I'm pushing some stuff too https://archive.org/services/collection-rss.php?collection=wikiteam
14:45 ^🔗	balrog	it would be nice to archive http://vetusware.com but all the downloads are severely limited
14:46 ^🔗	balrog	even with the highest "membership" you're limited to 10 downloads per day
15:20 ^🔗	midas1	balrog: and you need to have karma first before getting there
15:20 ^🔗	balrog	yes...
15:21 ^🔗	midas1	would be great to archive tho, they have alot of old software
16:06 ^🔗	odie5533	How can I determine the boundary offsets for a warc.gz file in Python?
16:11 ^🔗	joepie91	odie5533: there's some WARC modules for Python that can do that afaik
16:12 ^🔗	Lord_Nigh	10 downloads per day? with 5 or 6 members at that level the site could be mirrored in a few months
16:13 ^🔗	odie5533	joepie91: There are. Sadly, I couldn't quite make sense of the code =/
16:13 ^🔗	joepie91	odie5533: you could just use them :)
16:14 ^🔗	odie5533	I am using them, but some are quite slow, and I was hoping to make a really lightweight fast version.
16:15 ^🔗	odie5533	it works, but it doesn't determine offsets, and now I'm worried that the others are slow simply because they determine offsets and that's the only reason mine's at all fast. But I don't actually know this for sure since I can't determine offsets in my code.
16:16 ^🔗	odie5533	initial testing showed mine was approximately 5x as fast as Hanzo warctools, but again, that could be 100% because of the offsets.
16:19 ^🔗	DFJustin	Lord_Nigh: sounds like a job for a RECAP type extension
16:20 ^🔗	anounyym1	http://wdl2.winworldpc.com/Abandonware%20Applications/PC/ there's some abandonware stuff
16:41 ^🔗	joepie91	odie5533: I'd imagine it's just looking for headers and storing the pos in the file
16:41 ^🔗	joepie91	other than having to read the full file, I can't see it being slow
17:59 ^🔗	odie5533	joepie91: finding the position in the file is not that easy the compressed position is different than the uncompressed position
18:00 ^🔗	joepie91	odie5533: right... WARC itself however isn't compressed
18:00 ^🔗	odie5533	and I'm not sure how to find the compressed position
18:00 ^🔗	joepie91	so your issue is with gzip indexing rather than warc indexing :)
18:00 ^🔗	odie5533	joepie91: I am trying to find the offset in the compressed one. yes. gzip indexing.
18:01 ^🔗	joepie91	odie5533: http://stackoverflow.com/questions/9317281/how-to-get-a-random-line-from-within-a-gzip-compressed-file-in-python-without-re
18:02 ^🔗	joepie91	first answer may have relevant info
18:02 ^🔗	joepie91	wrt seeking in gzip
18:04 ^🔗	odie5533	Doesn't seem very helpful. I think there is way to do it based on the fact that warc.gz are concatenated .gz files.
18:06 ^🔗	joepie91	odie5533: huh?
18:07 ^🔗	joepie91	warc.gz is a gzipped WARC, not concatenated gzips?
18:07 ^🔗	odie5533	Each Warc Record is gzipped individually
18:26 ^🔗	odie5533	joepie91: Seems like you can determine where an entry ended by using zlib's Decompress.unused_data.
20:41 ^🔗	joepie91	odie5533: where do I find the qt warc viewer?
20:42 ^🔗	zenguy_pc	does anyone have a torrent for the videos deleted form conservatives.com ? just found out about this scandal
21:07 ^🔗	adam_arch	List of URLs from Conan O'Brien 20th anniversary that will go away at the end of day today: http://archive.org/~adam/conan20.txt
21:08 ^🔗	SmileyG	adam_arch: #archivebot - i'll be adding them
21:08 ^🔗	adam_arch	I've used the wayback save feature to grab the pages, but the videos will be a bit more labor intensive. They do seem to be mp4s eventually, so those could be piped through wayback /save as well but as far as I can tell, they need user interaction to get the video URL
21:09 ^🔗	SketchCow	Hi, gang.
21:09 ^🔗	adam_arch	I was planning on just using firebug to grab the video URLs but I'm not sure I have the time to get it done by myself
21:09 ^🔗	SketchCow	adam_arch works for the Internet Archive with me and wants to grab these Conan O'Brien videos before they're all deleted today.
21:10 ^🔗	SketchCow	Can someone get on that? He is able to grab the website but we want the videos as well.
21:11 ^🔗	SmileyG	ahhh
21:11 ^🔗	*	SmileyG goes to find some people to throw at it
21:11 ^🔗	adam_arch	sweet
21:12 ^🔗	SmileyG	flu has blown my mind away but i can at least try shouting at people to do it
21:12 ^🔗	SketchCow	In other news, I've uploaded 3000 blip.tv videos to archive today.
21:12 ^🔗	DFJustin	$ ~/youtube-dl -g http://teamcoco.com/video/first-ten-years-of-conan-package-1
21:12 ^🔗	DFJustin	http://cdn.teamcococdn.com/storage/2013-11/72871-Conan20_102813_Conan20_First10Comedy_Pkg1-1080p.mp4
21:12 ^🔗	DFJustin	you should be able to do the same for the rest
21:12 ^🔗	SmileyG	DFJustin: leave it with you then?
21:13 ^🔗	SketchCow	Really? OK, I can do that.
21:13 ^🔗	DFJustin	no I'm at work
21:13 ^🔗	SmileyG	for x in $(cat ./file); do youtube-dl $x; done
21:13 ^🔗	SmileyG	\o/ for bad use opf cats
21:14 ^🔗	SketchCow	SmileyG: I would have gotten that one
21:14 ^🔗	SketchCow	Do we have a best youtube-dl these days?
21:14 ^🔗	SketchCow	Is https://yt-dl.org/downloads/2013.11.15.1/youtube-dl the best?
21:14 ^🔗	DFJustin	I use http://rg3.github.io/youtube-dl/
21:15 ^🔗	DFJustin	which I guess is the same
21:19 ^🔗	SketchCow	[Teamcoco] 72965: Downloading data webpage
21:19 ^🔗	SketchCow	[Teamcoco] 72965: Extracting information
21:19 ^🔗	SketchCow	[Teamcoco] last-5-ln-guest-pkg01: Downloading webpage
21:19 ^🔗	SketchCow	[download] 18.4% of 141.96MiB at 1.87MiB/s ETA 01:01
21:19 ^🔗	SketchCow	[download] Destination: The Best of 'Late Night' Guests, Vol. 2-72965.mp4
21:19 ^🔗	SketchCow	etc.
21:19 ^🔗	SketchCow	adam_arch: OK, I got this. I'm downloading the videos more than the websites.
21:19 ^🔗	DFJustin	well downloading them with youtube-dl doesn't get the warc goodness
21:20 ^🔗	SketchCow	No. But we did scrape them into wayback today. Just the videos didn't go
21:20 ^🔗	SketchCow	ERROR: unable to download video data: HTTP Error 400: Bad Request
21:20 ^🔗	SketchCow	[Teamcoco] 72980: Downloading data webpage
21:20 ^🔗	SketchCow	[Teamcoco] 72980: Extracting information
21:20 ^🔗	SketchCow	[Teamcoco] in-the-year-2000: Downloading webpage
21:33 ^🔗	adam_arch	found that one: cdn.teamcococdn.com/storage/2013-11/72980-IntheYear2000-XDCAM%20HD%201080i60%2050mbs-1000k.mp4
21:34 ^🔗	adam_arch	wayback /save doesn't seem to like it though
22:17 ^🔗	odie5533	joepie91: on my github
22:17 ^🔗	odie5533	joepie91: https://github.com/odie5533/WarcQtViewer
22:19 ^🔗	odie5533	joepie91: where'd you hear about it? :)
22:32 ^🔗	odie5533	Sorta related to Archiveteam: How do you guys archive personal files?
22:36 ^🔗	nico_	hi
22:36 ^🔗	odie5533	hi
22:36 ^🔗	Marcelo	hi
22:36 ^🔗	joepie91	odie5533: you mentioned it in here
22:36 ^🔗	joepie91	:P
22:36 ^🔗	joepie91	thanks
22:36 ^🔗	odie5533	So I did :)
22:36 ^🔗	odie5533	Sadly, it doesn't work well with large files. which is what I'm trying to fix
22:37 ^🔗	*	joepie91 stares at .exe
22:37 ^🔗	joepie91	:P
22:37 ^🔗	joepie91	what is this kind of file? how do I open it?
22:37 ^🔗	odie5533	joepie91: just double click it!
22:38 ^🔗	nico_	hum a .exe
22:38 ^🔗	joepie91	odie5533: doesn't work
22:38 ^🔗	joepie91	:P
22:38 ^🔗	joepie91	:P :P :P
22:38 ^🔗	nico_	virt-sandbox su nobody -c 'wine /tmp/a.exe'
22:38 ^🔗	odie5533	Try clicking harder
22:38 ^🔗	joepie91	nico_: oh, that looks like it might work :D
22:38 ^🔗	odie5533	really weigh down on the mouse
22:38 ^🔗	joepie91	haha
22:38 ^🔗	nico_	joepie91: safety first ;)
22:38 ^🔗	yipdw	odie5533: I tarsnap them
22:39 ^🔗	odie5533	yipdw: hmm?
22:39 ^🔗	odie5533	oh, personal file archiving
22:39 ^🔗	joepie91	my god the pyside pypi package is noisy
22:40 ^🔗	odie5533	joepie91: just so you know it was actually a lot of work to get it all to compile into a single exe. Which I think is important to creating a truly portable program that everyone can use.
22:40 ^🔗	yipdw	that's Windows-ist
22:40 ^🔗	yipdw	:P
22:46 ^🔗	odie5533	Yes, it works on any Windows
22:46 ^🔗	odie5533	I tried it in a sandbox running XP and it works fine
22:47 ^🔗	nico_	try it under windows 2000
22:47 ^🔗	nico_	:)
22:47 ^🔗	odie5533	probably will work
22:47 ^🔗	nico_	probably not
22:47 ^🔗	nico_	lacks of api :(
22:47 ^🔗	nico_	everything require at least windows xp sp2 nowadays
22:48 ^🔗	odie5533	I don't think I have a 98 sandbox running to test that
22:48 ^🔗	odie5533	so maybe it doesn't work on EVERY windows system, but all major opreating systems currently in use
22:50 ^🔗	odie5533	joepie91: did you get it working?
22:50 ^🔗	nico_	odie5533: i want to run it under 2k on alpha
22:50 ^🔗	joepie91	odie5533: have to do other things first
22:50 ^🔗	joepie91	pyside is still building anyway
22:50 ^🔗	joepie91	29%
22:50 ^🔗	joepie91	slow build is slow
22:51 ^🔗	odie5533	joepie91: just apt-get install it?
22:51 ^🔗	nico_	http://www.xanthos.se/~joachim/PWS-Info.GIF
22:52 ^🔗	nico_	just for fun
22:53 ^🔗	xmc	woop woop woop off-topic siren
22:53 ^🔗	xmc	take it to #archiveteam-bs
22:55 ^🔗	*	joepie91 woops xmc
23:17 ^🔗	odie5533	Written in Python as opposed to e.g. the zlib library, which I believe is just a wrapper for the C/C++ zlib library.
23:17 ^🔗	odie5533	joepie91: I found a solution to getting the offsets in the gzip files. The regular Python Gzip library is written in Python and reads "members" one at a time, so it basically determines the offsets as it goes: http://hg.python.org/cpython/file/2.7/Lib/gzip.py
23:18 ^🔗	joepie91	The user of the file doesn't have to worry about the compression,
23:18 ^🔗	joepie91	but random access is not allowed."""
23:43 ^🔗	nico_	anyone know if there are a mean to make the wayback machine take a snapshot of a website now ?
23:43 ^🔗	odie5533_	only of single pages
23:43 ^🔗	nico_	anything else i need to make a warc ?
23:44 ^🔗	odie5533_	What do you mean?
23:45 ^🔗	nico_	i want to backup a whole domain
23:45 ^🔗	nico_	because its owner is dead
23:45 ^🔗	nico_	and the archive.org is way out of date
23:46 ^🔗	odie5533_	definitely want to crawl it then
23:46 ^🔗	odie5533_	If it's a small-ish site someone might run archivebot on it.
23:52 ^🔗	nico_	anyone has a working wget line ?
23:52 ^🔗	nico_	i will try
23:53 ^🔗	nico_	wget --warc-file="sid" --warc-cdx=sid --domains="domain.tld" -l inf -m -p -U "Mozilla/5.0 (Photon; U; QNX x86pc; en-US; rv:1.6) Gecko/20040429"
23:55 ^🔗	nico_	--random-wait -w 5 --retry-connrefused
23:55 ^🔗	nico_	-t 25
23:56 ^🔗	nico_	ho
23:56 ^🔗	nico_	http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget
23:56 ^🔗	nico_	it is already on the wiki
23:57 ^🔗	ivan`	what's the domain?
23:58 ^🔗	nico_	sid.rstack.org

irclogger-viewer