#archiveteam 2013-11-15,Fri

↑back Search

Time Nickname Message
09:25 🔗 SketchCow http://instagram.com/p/gszkewiOJq/
09:45 🔗 midas1 lol SketchCow
11:32 🔗 SketchCow The uploading of what we got from blip.tv so far has been going well now.
11:32 🔗 SketchCow Thanks again to joepie for the metadata extraction routines.
11:32 🔗 SketchCow Over 2,500 videos added
12:05 🔗 Nemo_bis SketchCow: did that kill s3 or something? It's almost idle since 12 hours ago.
12:32 🔗 Nemo_bis I'm pushing some stuff too https://archive.org/services/collection-rss.php?collection=wikiteam
14:45 🔗 balrog it would be nice to archive http://vetusware.com but all the downloads are severely limited
14:46 🔗 balrog even with the highest "membership" you're limited to 10 downloads per day
15:20 🔗 midas1 balrog: and you need to have karma first before getting there
15:20 🔗 balrog yes...
15:21 🔗 midas1 would be great to archive tho, they have alot of old software
16:06 🔗 odie5533 How can I determine the boundary offsets for a warc.gz file in Python?
16:11 🔗 joepie91 odie5533: there's some WARC modules for Python that can do that afaik
16:12 🔗 Lord_Nigh 10 downloads per day? with 5 or 6 members at that level the site could be mirrored in a few months
16:13 🔗 odie5533 joepie91: There are. Sadly, I couldn't quite make sense of the code =/
16:13 🔗 joepie91 odie5533: you could just use them :)
16:14 🔗 odie5533 I am using them, but some are quite slow, and I was hoping to make a really lightweight fast version.
16:15 🔗 odie5533 it works, but it doesn't determine offsets, and now I'm worried that the others are slow simply because they determine offsets and that's the only reason mine's at all fast. But I don't actually know this for sure since I can't determine offsets in my code.
16:16 🔗 odie5533 initial testing showed mine was approximately 5x as fast as Hanzo warctools, but again, that could be 100% because of the offsets.
16:19 🔗 DFJustin Lord_Nigh: sounds like a job for a RECAP type extension
16:20 🔗 anounyym1 http://wdl2.winworldpc.com/Abandonware%20Applications/PC/ there's some abandonware stuff
16:41 🔗 joepie91 odie5533: I'd imagine it's just looking for headers and storing the pos in the file
16:41 🔗 joepie91 other than having to read the full file, I can't see it being slow
17:59 🔗 odie5533 joepie91: finding the position in the file is not that easy the compressed position is different than the uncompressed position
18:00 🔗 joepie91 odie5533: right... WARC itself however isn't compressed
18:00 🔗 odie5533 and I'm not sure how to find the compressed position
18:00 🔗 joepie91 so your issue is with gzip indexing rather than warc indexing :)
18:00 🔗 odie5533 joepie91: I am trying to find the offset in the compressed one. yes. gzip indexing.
18:01 🔗 joepie91 odie5533: http://stackoverflow.com/questions/9317281/how-to-get-a-random-line-from-within-a-gzip-compressed-file-in-python-without-re
18:02 🔗 joepie91 first answer may have relevant info
18:02 🔗 joepie91 wrt seeking in gzip
18:04 🔗 odie5533 Doesn't seem very helpful. I think there is way to do it based on the fact that warc.gz are concatenated .gz files.
18:06 🔗 joepie91 odie5533: huh?
18:07 🔗 joepie91 warc.gz is a gzipped WARC, not concatenated gzips?
18:07 🔗 odie5533 Each Warc Record is gzipped individually
18:26 🔗 odie5533 joepie91: Seems like you can determine where an entry ended by using zlib's Decompress.unused_data.
20:41 🔗 joepie91 odie5533: where do I find the qt warc viewer?
20:42 🔗 zenguy_pc does anyone have a torrent for the videos deleted form conservatives.com ? just found out about this scandal
21:07 🔗 adam_arch List of URLs from Conan O'Brien 20th anniversary that will go away at the end of day today: http://archive.org/~adam/conan20.txt
21:08 🔗 SmileyG adam_arch: #archivebot - i'll be adding them
21:08 🔗 adam_arch I've used the wayback save feature to grab the pages, but the videos will be a bit more labor intensive. They do seem to be mp4s eventually, so those could be piped through wayback /save as well but as far as I can tell, they need user interaction to get the video URL
21:09 🔗 SketchCow Hi, gang.
21:09 🔗 adam_arch I was planning on just using firebug to grab the video URLs but I'm not sure I have the time to get it done by myself
21:09 🔗 SketchCow adam_arch works for the Internet Archive with me and wants to grab these Conan O'Brien videos before they're all deleted today.
21:10 🔗 SketchCow Can someone get on that? He is able to grab the website but we want the videos as well.
21:11 🔗 SmileyG ahhh
21:11 🔗 * SmileyG goes to find some people to throw at it
21:11 🔗 adam_arch sweet
21:12 🔗 SmileyG flu has blown my mind away but i can at least try shouting at people to do it
21:12 🔗 SketchCow In other news, I've uploaded 3000 blip.tv videos to archive today.
21:12 🔗 DFJustin $ ~/youtube-dl -g http://teamcoco.com/video/first-ten-years-of-conan-package-1
21:12 🔗 DFJustin http://cdn.teamcococdn.com/storage/2013-11/72871-Conan20_102813_Conan20_First10Comedy_Pkg1-1080p.mp4
21:12 🔗 DFJustin you should be able to do the same for the rest
21:12 🔗 SmileyG DFJustin: leave it with you then?
21:13 🔗 SketchCow Really? OK, I can do that.
21:13 🔗 DFJustin no I'm at work
21:13 🔗 SmileyG for x in $(cat ./file); do youtube-dl $x; done
21:13 🔗 SmileyG \o/ for bad use opf cats
21:14 🔗 SketchCow SmileyG: I would have gotten that one
21:14 🔗 SketchCow Do we have a best youtube-dl these days?
21:14 🔗 SketchCow Is https://yt-dl.org/downloads/2013.11.15.1/youtube-dl the best?
21:14 🔗 DFJustin I use http://rg3.github.io/youtube-dl/
21:15 🔗 DFJustin which I guess is the same
21:19 🔗 SketchCow [Teamcoco] 72965: Downloading data webpage
21:19 🔗 SketchCow [Teamcoco] 72965: Extracting information
21:19 🔗 SketchCow [Teamcoco] last-5-ln-guest-pkg01: Downloading webpage
21:19 🔗 SketchCow [download] 18.4% of 141.96MiB at 1.87MiB/s ETA 01:01
21:19 🔗 SketchCow [download] Destination: The Best of 'Late Night' Guests, Vol. 2-72965.mp4
21:19 🔗 SketchCow etc.
21:19 🔗 SketchCow adam_arch: OK, I got this. I'm downloading the videos more than the websites.
21:19 🔗 DFJustin well downloading them with youtube-dl doesn't get the warc goodness
21:20 🔗 SketchCow No. But we did scrape them into wayback today. Just the videos didn't go
21:20 🔗 SketchCow ERROR: unable to download video data: HTTP Error 400: Bad Request
21:20 🔗 SketchCow [Teamcoco] 72980: Downloading data webpage
21:20 🔗 SketchCow [Teamcoco] 72980: Extracting information
21:20 🔗 SketchCow [Teamcoco] in-the-year-2000: Downloading webpage
21:33 🔗 adam_arch found that one: cdn.teamcococdn.com/storage/2013-11/72980-IntheYear2000-XDCAM%20HD%201080i60%2050mbs-1000k.mp4
21:34 🔗 adam_arch wayback /save doesn't seem to like it though
22:17 🔗 odie5533 joepie91: on my github
22:17 🔗 odie5533 joepie91: https://github.com/odie5533/WarcQtViewer
22:19 🔗 odie5533 joepie91: where'd you hear about it? :)
22:32 🔗 odie5533 Sorta related to Archiveteam: How do you guys archive personal files?
22:36 🔗 nico_ hi
22:36 🔗 odie5533 hi
22:36 🔗 Marcelo hi
22:36 🔗 joepie91 odie5533: you mentioned it in here
22:36 🔗 joepie91 :P
22:36 🔗 joepie91 thanks
22:36 🔗 odie5533 So I did :)
22:36 🔗 odie5533 Sadly, it doesn't work well with large files. which is what I'm trying to fix
22:37 🔗 * joepie91 stares at .exe
22:37 🔗 joepie91 :P
22:37 🔗 joepie91 what is this kind of file? how do I open it?
22:37 🔗 odie5533 joepie91: just double click it!
22:38 🔗 nico_ hum a .exe
22:38 🔗 joepie91 odie5533: doesn't work
22:38 🔗 joepie91 :P
22:38 🔗 joepie91 :P :P :P
22:38 🔗 nico_ virt-sandbox su nobody -c 'wine /tmp/a.exe'
22:38 🔗 odie5533 Try clicking harder
22:38 🔗 joepie91 nico_: oh, that looks like it might work :D
22:38 🔗 odie5533 really weigh down on the mouse
22:38 🔗 joepie91 haha
22:38 🔗 nico_ joepie91: safety first ;)
22:38 🔗 yipdw odie5533: I tarsnap them
22:39 🔗 odie5533 yipdw: hmm?
22:39 🔗 odie5533 oh, personal file archiving
22:39 🔗 joepie91 my god the pyside pypi package is noisy
22:40 🔗 odie5533 joepie91: just so you know it was actually a lot of work to get it all to compile into a single exe. Which I think is important to creating a truly portable program that everyone can use.
22:40 🔗 yipdw that's Windows-ist
22:40 🔗 yipdw :P
22:46 🔗 odie5533 Yes, it works on any Windows
22:46 🔗 odie5533 I tried it in a sandbox running XP and it works fine
22:47 🔗 nico_ try it under windows 2000
22:47 🔗 nico_ :)
22:47 🔗 odie5533 probably will work
22:47 🔗 nico_ probably not
22:47 🔗 nico_ lacks of api :(
22:47 🔗 nico_ everything require at least windows xp sp2 nowadays
22:48 🔗 odie5533 I don't think I have a 98 sandbox running to test that
22:48 🔗 odie5533 so maybe it doesn't work on EVERY windows system, but all major opreating systems currently in use
22:50 🔗 odie5533 joepie91: did you get it working?
22:50 🔗 nico_ odie5533: i want to run it under 2k on alpha
22:50 🔗 joepie91 odie5533: have to do other things first
22:50 🔗 joepie91 pyside is still building anyway
22:50 🔗 joepie91 29%
22:50 🔗 joepie91 slow build is slow
22:51 🔗 odie5533 joepie91: just apt-get install it?
22:51 🔗 nico_ http://www.xanthos.se/~joachim/PWS-Info.GIF
22:52 🔗 nico_ just for fun
22:53 🔗 xmc woop woop woop off-topic siren
22:53 🔗 xmc take it to #archiveteam-bs
22:55 🔗 * joepie91 woops xmc
23:17 🔗 odie5533 Written in Python as opposed to e.g. the zlib library, which I believe is just a wrapper for the C/C++ zlib library.
23:17 🔗 odie5533 joepie91: I found a solution to getting the offsets in the gzip files. The regular Python Gzip library is written in Python and reads "members" one at a time, so it basically determines the offsets as it goes: http://hg.python.org/cpython/file/2.7/Lib/gzip.py
23:18 🔗 joepie91 The user of the file doesn't have to worry about the compression,
23:18 🔗 joepie91 but random access is not allowed."""
23:43 🔗 nico_ anyone know if there are a mean to make the wayback machine take a snapshot of a website now ?
23:43 🔗 odie5533_ only of single pages
23:43 🔗 nico_ anything else i need to make a warc ?
23:44 🔗 odie5533_ What do you mean?
23:45 🔗 nico_ i want to backup a whole domain
23:45 🔗 nico_ because its owner is dead
23:45 🔗 nico_ and the archive.org is way out of date
23:46 🔗 odie5533_ definitely want to crawl it then
23:46 🔗 odie5533_ If it's a small-ish site someone might run archivebot on it.
23:52 🔗 nico_ anyone has a working wget line ?
23:52 🔗 nico_ i will try
23:53 🔗 nico_ wget --warc-file="sid" --warc-cdx=sid --domains="domain.tld" -l inf -m -p -U "Mozilla/5.0 (Photon; U; QNX x86pc; en-US; rv:1.6) Gecko/20040429"
23:55 🔗 nico_ --random-wait -w 5 --retry-connrefused
23:55 🔗 nico_ -t 25
23:56 🔗 nico_ ho
23:56 🔗 nico_ http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget
23:56 🔗 nico_ it is already on the wiki
23:57 🔗 ivan` what's the domain?
23:58 🔗 nico_ sid.rstack.org

irclogger-viewer