[09:25] http://instagram.com/p/gszkewiOJq/ [09:45] lol SketchCow [11:32] The uploading of what we got from blip.tv so far has been going well now. [11:32] Thanks again to joepie for the metadata extraction routines. [11:32] Over 2,500 videos added [12:05] SketchCow: did that kill s3 or something? It's almost idle since 12 hours ago. [12:32] I'm pushing some stuff too https://archive.org/services/collection-rss.php?collection=wikiteam [14:45] it would be nice to archive http://vetusware.com but all the downloads are severely limited [14:46] even with the highest "membership" you're limited to 10 downloads per day [15:20] balrog: and you need to have karma first before getting there [15:20] yes... [15:21] would be great to archive tho, they have alot of old software [16:06] How can I determine the boundary offsets for a warc.gz file in Python? [16:11] odie5533: there's some WARC modules for Python that can do that afaik [16:12] 10 downloads per day? with 5 or 6 members at that level the site could be mirrored in a few months [16:13] joepie91: There are. Sadly, I couldn't quite make sense of the code =/ [16:13] odie5533: you could just use them :) [16:14] I am using them, but some are quite slow, and I was hoping to make a really lightweight fast version. [16:15] it works, but it doesn't determine offsets, and now I'm worried that the others are slow simply because they determine offsets and that's the only reason mine's at all fast. But I don't actually know this for sure since I can't determine offsets in my code. [16:16] initial testing showed mine was approximately 5x as fast as Hanzo warctools, but again, that could be 100% because of the offsets. [16:19] Lord_Nigh: sounds like a job for a RECAP type extension [16:20] http://wdl2.winworldpc.com/Abandonware%20Applications/PC/ there's some abandonware stuff [16:41] odie5533: I'd imagine it's just looking for headers and storing the pos in the file [16:41] other than having to read the full file, I can't see it being slow [17:59] joepie91: finding the position in the file is not that easy the compressed position is different than the uncompressed position [18:00] odie5533: right... WARC itself however isn't compressed [18:00] and I'm not sure how to find the compressed position [18:00] so your issue is with gzip indexing rather than warc indexing :) [18:00] joepie91: I am trying to find the offset in the compressed one. yes. gzip indexing. [18:01] odie5533: http://stackoverflow.com/questions/9317281/how-to-get-a-random-line-from-within-a-gzip-compressed-file-in-python-without-re [18:02] first answer may have relevant info [18:02] wrt seeking in gzip [18:04] Doesn't seem very helpful. I think there is way to do it based on the fact that warc.gz are concatenated .gz files. [18:06] odie5533: huh? [18:07] warc.gz is a gzipped WARC, not concatenated gzips? [18:07] Each Warc Record is gzipped individually [18:26] joepie91: Seems like you can determine where an entry ended by using zlib's Decompress.unused_data. [20:41] odie5533: where do I find the qt warc viewer? [20:42] does anyone have a torrent for the videos deleted form conservatives.com ? just found out about this scandal [21:07] List of URLs from Conan O'Brien 20th anniversary that will go away at the end of day today: http://archive.org/~adam/conan20.txt [21:08] adam_arch: #archivebot - i'll be adding them [21:08] I've used the wayback save feature to grab the pages, but the videos will be a bit more labor intensive. They do seem to be mp4s eventually, so those could be piped through wayback /save as well but as far as I can tell, they need user interaction to get the video URL [21:09] Hi, gang. [21:09] I was planning on just using firebug to grab the video URLs but I'm not sure I have the time to get it done by myself [21:09] adam_arch works for the Internet Archive with me and wants to grab these Conan O'Brien videos before they're all deleted today. [21:10] Can someone get on that? He is able to grab the website but we want the videos as well. [21:11] ahhh [21:11] * SmileyG goes to find some people to throw at it [21:11] sweet [21:12] flu has blown my mind away but i can at least try shouting at people to do it [21:12] In other news, I've uploaded 3000 blip.tv videos to archive today. [21:12] $ ~/youtube-dl -g http://teamcoco.com/video/first-ten-years-of-conan-package-1 [21:12] http://cdn.teamcococdn.com/storage/2013-11/72871-Conan20_102813_Conan20_First10Comedy_Pkg1-1080p.mp4 [21:12] you should be able to do the same for the rest [21:12] DFJustin: leave it with you then? [21:13] Really? OK, I can do that. [21:13] no I'm at work [21:13] for x in $(cat ./file); do youtube-dl $x; done [21:13] \o/ for bad use opf cats [21:14] SmileyG: I would have gotten that one [21:14] Do we have a best youtube-dl these days? [21:14] Is https://yt-dl.org/downloads/2013.11.15.1/youtube-dl the best? [21:14] I use http://rg3.github.io/youtube-dl/ [21:15] which I guess is the same [21:19] [Teamcoco] 72965: Downloading data webpage [21:19] [Teamcoco] 72965: Extracting information [21:19] [Teamcoco] last-5-ln-guest-pkg01: Downloading webpage [21:19] [download] 18.4% of 141.96MiB at 1.87MiB/s ETA 01:01 [21:19] [download] Destination: The Best of 'Late Night' Guests, Vol. 2-72965.mp4 [21:19] etc. [21:19] adam_arch: OK, I got this. I'm downloading the videos more than the websites. [21:19] well downloading them with youtube-dl doesn't get the warc goodness [21:20] No. But we did scrape them into wayback today. Just the videos didn't go [21:20] ERROR: unable to download video data: HTTP Error 400: Bad Request [21:20] [Teamcoco] 72980: Downloading data webpage [21:20] [Teamcoco] 72980: Extracting information [21:20] [Teamcoco] in-the-year-2000: Downloading webpage [21:33] found that one: cdn.teamcococdn.com/storage/2013-11/72980-IntheYear2000-XDCAM%20HD%201080i60%2050mbs-1000k.mp4 [21:34] wayback /save doesn't seem to like it though [22:17] joepie91: on my github [22:17] joepie91: https://github.com/odie5533/WarcQtViewer [22:19] joepie91: where'd you hear about it? :) [22:32] Sorta related to Archiveteam: How do you guys archive personal files? [22:36] hi [22:36] hi [22:36] hi [22:36] odie5533: you mentioned it in here [22:36] :P [22:36] thanks [22:36] So I did :) [22:36] Sadly, it doesn't work well with large files. which is what I'm trying to fix [22:37] * joepie91 stares at .exe [22:37] :P [22:37] what is this kind of file? how do I open it? [22:37] joepie91: just double click it! [22:38] hum a .exe [22:38] odie5533: doesn't work [22:38] :P [22:38] :P :P :P [22:38] virt-sandbox su nobody -c 'wine /tmp/a.exe' [22:38] Try clicking harder [22:38] nico_: oh, that looks like it might work :D [22:38] really weigh down on the mouse [22:38] haha [22:38] joepie91: safety first ;) [22:38] odie5533: I tarsnap them [22:39] yipdw: hmm? [22:39] oh, personal file archiving [22:39] my god the pyside pypi package is noisy [22:40] joepie91: just so you know it was actually a lot of work to get it all to compile into a single exe. Which I think is important to creating a truly portable program that everyone can use. [22:40] that's Windows-ist [22:40] :P [22:46] Yes, it works on any Windows [22:46] I tried it in a sandbox running XP and it works fine [22:47] try it under windows 2000 [22:47] :) [22:47] probably will work [22:47] probably not [22:47] lacks of api :( [22:47] everything require at least windows xp sp2 nowadays [22:48] I don't think I have a 98 sandbox running to test that [22:48] so maybe it doesn't work on EVERY windows system, but all major opreating systems currently in use [22:50] joepie91: did you get it working? [22:50] odie5533: i want to run it under 2k on alpha [22:50] odie5533: have to do other things first [22:50] pyside is still building anyway [22:50] 29% [22:50] slow build is slow [22:51] joepie91: just apt-get install it? [22:51] http://www.xanthos.se/~joachim/PWS-Info.GIF [22:52] just for fun [22:53] woop woop woop off-topic siren [22:53] take it to #archiveteam-bs [22:55] * joepie91 woops xmc [23:17] Written in Python as opposed to e.g. the zlib library, which I believe is just a wrapper for the C/C++ zlib library. [23:17] joepie91: I found a solution to getting the offsets in the gzip files. The regular Python Gzip library is written in Python and reads "members" one at a time, so it basically determines the offsets as it goes: http://hg.python.org/cpython/file/2.7/Lib/gzip.py [23:18] The user of the file doesn't have to worry about the compression, [23:18] but random access is not allowed.""" [23:43] anyone know if there are a mean to make the wayback machine take a snapshot of a website now ? [23:43] only of single pages [23:43] anything else i need to make a warc ? [23:44] What do you mean? [23:45] i want to backup a whole domain [23:45] because its owner is dead [23:45] and the archive.org is way out of date [23:46] definitely want to crawl it then [23:46] If it's a small-ish site someone might run archivebot on it. [23:52] anyone has a working wget line ? [23:52] i will try [23:53] wget --warc-file="sid" --warc-cdx=sid --domains="domain.tld" -l inf -m -p -U "Mozilla/5.0 (Photon; U; QNX x86pc; en-US; rv:1.6) Gecko/20040429" [23:55] --random-wait -w 5 --retry-connrefused [23:55] -t 25 [23:56] ho [23:56] http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget [23:56] it is already on the wiki [23:57] what's the domain? [23:58] sid.rstack.org