#archiveteam 2014-01-30,Thu

↑back Search

Time Nickname Message
00:01 🔗 midas first thing first. bed.
01:03 🔗 chfoo does anyone know how to save all the comments/metadata from a youtube channel? dascottjr's channel is at risk of being suspended due to copyright strikes
01:20 🔗 bauruine midas, i think downloading petabytes of data (a lot(?) from adsl lines) is pretty hard.
02:42 🔗 dashcloud comments no, but youtube-dl looks like it handles most of the other metadata- descriptions and thumbnails
03:16 🔗 SketchCow ------------------------------------------
03:17 🔗 SketchCow Internet Archive is Rewriting Wayback Machine from Scratch in Python.
03:17 🔗 SketchCow Invites testing, comments, archiveteam bastardry: https://github.com/ikreymer/pywb
03:17 🔗 SketchCow ------------------------------------------
03:19 🔗 dashcloud whoa!
03:20 🔗 godane i also hope that wayback machine will be made easier for home usage
03:20 🔗 SketchCow that is what they are going for.
03:20 🔗 SketchCow The ability to start it up on a machine and start banging through downloaded WARCs
03:33 🔗 ilya hi, i'm @ikreymer on github. this is a brand new project to try to make wayback machine easy to use, as well as extensible. there's a lot more to do, but welcome any comments you have so far
04:58 🔗 xmc coolio
05:08 🔗 godane so i found xml data for cnet and cbsnews videos
05:20 🔗 godane so my xml dump of api.cnet.com is moving very fast
05:21 🔗 godane its already 12k+ downloaded
05:30 🔗 godane i'm fucking awesome
05:30 🔗 godane i'm grabbing 2003 cbsnews videos
05:51 🔗 xmc rad
06:34 🔗 arkiver SketchCow: are they also going to add more features to the wayback machine?
06:34 🔗 arkiver like
06:34 🔗 arkiver will there ever be some kind of search function in the wayback machine?
06:36 🔗 godane search would be nice
06:36 🔗 godane i want to be able to search stuff like meda files
06:37 🔗 DFJustin I think the issue with search is the huge amount of indexing infrastructure that would be required
06:37 🔗 DFJustin I would love even like a basic filename search though
06:37 🔗 godane basic filename search would be great too
06:38 🔗 godane anyways i'm starting to upload more episodes of wilkow
06:39 🔗 SketchCow I think that people need to get into their heads that the Internet Archive is not a business.
06:39 🔗 SketchCow This isn't amazon and being able to go "fuck, man, make my purchases show up in pinterest, yo"
06:40 🔗 SketchCow If people want a feature, they should ask for it by providing code or links to code
06:40 🔗 SketchCow Otherwise the team moves at their pace
06:40 🔗 SketchCow Keeping the ship together is a task in itself. I sit in the infrastructre channel.
06:40 🔗 SketchCow Blowups every week
06:40 🔗 SketchCow It's like the last third of the titanic movie
06:41 🔗 SketchCow People falling and bouncing off decks, old people holding each other inside flooding cabins and crying
06:41 🔗 SketchCow Leo falling in love with kate
06:43 🔗 arkiver Haha, ok, I see
09:55 🔗 ersi SketchCow: Awesome rad re new dev Wayback
09:58 🔗 SketchCow Ilya wants help and information
10:02 🔗 ersi ikreymer: Anything specific things? :-) I'm def willing to shake trees as well as help out.
10:06 🔗 ikreymer thanks! well, right now, just basic testing.. especially if you local warc files that you have crawled.. it would be interesting to see if you can get them to replay in a local install of the new wayback
10:08 🔗 ikreymer there are instructions in the readme on how to generate cdx files for them.. the cdx_writer tool is maintained by another engineer at this point, but hopefully steps are clear enough that its possible to create cdx index files and test your warcs. hoping to improve that process
10:08 🔗 ikreymer to start, see if the deployment steps in the readme work, to at least replay the sample data
10:12 🔗 ikreymer I'm planning to add a lot more documentation of the project in the upcoming days/weeks, so far it supports basic replay of a typical page. there will be additional customizations involving javascript, and certain domain-specific sites. stay tuned!
10:14 🔗 ikreymer but at this point, any testing feedback would be really helpful!
10:20 🔗 midas1 my fiber! she is back!
10:27 🔗 ersi ikreymer: Sure thing - that sounds completely reasonable. I'll give it a spin :-)
10:35 🔗 Nemo_bis I *love* when I upload at 14 MiB/s to archive.org from Europe
10:35 🔗 Smiley 652333333 452
10:35 🔗 Smiley CAT, SRY
16:53 🔗 joepie91 red alert
16:53 🔗 joepie91 http://wallbase.cc/
16:53 🔗 joepie91 in the process of dying
16:53 🔗 joepie91 warriors activate
16:53 🔗 joepie91 cc yipdw and chfoo
16:53 🔗 joepie91 (yipdw: every time I read your nick I mentally think "yip yip yip", heh)
16:55 🔗 joepie91 "owner abandoned it, forums and new uploads have been shut off until owner returns, methinks owner won't return whatsoever"
16:56 🔗 joepie91 good news: incrementally ordered IDs
16:57 🔗 joepie91 bad news: appears javascript-heavy
17:01 🔗 arkiver http://wallbase.cc/forum already down?
17:02 🔗 joepie91 appears so
17:04 🔗 arkiver http://wallbase.cc/wallpaper/3033009
17:04 🔗 arkiver this one is the highest number
17:04 🔗 arkiver they go up to 3033009
17:05 🔗 anounyymi looks like i should try save copy of finnish gaming site www.peliplaneetta.net/
17:05 🔗 arkiver and they seem to start from 1000000
17:05 🔗 godane http://wallpapers.wallbase.cc/rozne/wallpaper-2903212.jpg
17:05 🔗 anounyymi www.peliplaneetta.net/tietokonepelit/uutiset/15112/ILMOITUS-Peliplaneetta-sulkeutuu/ (if someone is interested look it with google translate)
17:06 🔗 arkiver no!!
17:06 🔗 godane i think its just http://wallpapers.wallbase.cc/rozne/wallpaper-${id}.jpg for images
17:06 🔗 arkiver they don't start from 1000000
17:06 🔗 arkiver godane: depends on the kind of images
17:06 🔗 balrog who asked about aol?
17:06 🔗 balrog someone did
17:06 🔗 arkiver those are the numbers for rozne images
17:06 🔗 arkiver other images have other urls
17:07 🔗 arkiver example:
17:07 🔗 arkiver http://wallpapers.wallbase.cc/high-resolution/wallpaper-2119864.jpg
17:07 🔗 godane okd
17:09 🔗 godane maybe a i will do a dump of rozne images
17:10 🔗 arkiver would be great
17:10 🔗 arkiver going to start multiple crawls
17:10 🔗 arkiver I'll see if I get banned or not
17:15 🔗 arkiver going to start 30 simultaneous crawls
17:15 🔗 arkiver 100.000 wallpapers per crawl
17:16 🔗 arkiver testing http://wallbase.cc/wallpaper/3033000
17:16 🔗 arkiver done
17:17 🔗 arkiver will uplaod the warc of http://wallbase.cc/wallpaper/3033000 now
17:17 🔗 arkiver so people can test it
17:18 🔗 arkiver done.
17:18 🔗 arkiver https://www.filepicker.io/api/file/BhHu8SK4SM2XwZWS3dGn
17:19 🔗 arkiver can someone please download and open the warc in the wayback machine?
17:19 🔗 arkiver (not on IA server since robots.txt are still blocking access)
17:19 🔗 arkiver please tell me how it turned out
17:24 🔗 joepie91 wat
17:24 🔗 joepie91 done?
17:24 🔗 joepie91 alraedy?
17:24 🔗 joepie91 oh
17:24 🔗 joepie91 for one ID
17:25 🔗 nyu Lol done already
17:27 🔗 arkiver yes
17:27 🔗 arkiver a test for one ID
17:27 🔗 arkiver joepie91 can you test it?
17:27 🔗 arkiver nyu: no not done yet
17:27 🔗 arkiver just one ID
17:27 🔗 arkiver a test
17:28 🔗 arkiver to see how it turns out with the javascript and stuff
17:28 🔗 arkiver running 2 crawls for first 200.000 urls now
17:28 🔗 joepie91 arkiver: just a moment
17:28 🔗 joepie91 I forgot where I put warcviewer
17:28 🔗 arkiver joepie91: thanks
17:28 🔗 joepie91 (I'm a terrible archivist, heh)
17:28 🔗 arkiver haha
17:28 🔗 arkiver lol
17:28 🔗 joepie91 found it
17:30 🔗 arkiver cool
17:30 🔗 arkiver I'll hear the result from you
17:30 🔗 arkiver it is crawled with heritrix
17:30 🔗 arkiver same crawler IA uses
17:30 🔗 arkiver and it can even unpack and find urls in swf files
17:30 🔗 arkiver heritrix 3.3.0 version of 2014-01-28
17:31 🔗 joepie91 arkiver: it contains several wallpapers
17:31 🔗 joepie91 what was the original URL
17:31 🔗 joepie91 it seems odd for one ID to contain multiple wallpapers
17:31 🔗 arkiver http://wallbase.cc/wallpaper/3033000
17:31 🔗 arkiver I know
17:31 🔗 arkiver heritrix also download the urls that are linked to from that page
17:31 🔗 arkiver so let's say
17:32 🔗 arkiver 3033000 contains a link to 1234567
17:32 🔗 joepie91 arkiver: where do the others come from?
17:32 🔗 arkiver then it also download 1234567
17:32 🔗 joepie91 also, I should point out that I don't have a full-fledged wayback machine here
17:32 🔗 arkiver wioth the wallpaper from that page
17:32 🔗 joepie91 so I can't really test beyond "does it have these and these files"
17:32 🔗 joepie91 ah
17:32 🔗 arkiver ah well
17:32 🔗 arkiver I can see that too
17:32 🔗 joepie91 but why is it only a few then
17:32 🔗 arkiver however
17:32 🔗 joepie91 it should infinitely recurse
17:32 🔗 joepie91 because of "related wallpapers"
17:32 🔗 arkiver stopping other projects of mine
17:33 🔗 arkiver and giving all power to wallbase.cc
17:33 🔗 arkiver n
17:33 🔗 arkiver no*
17:33 🔗 arkiver as I said it downloads the pages next to that page too
17:33 🔗 arkiver which are linked too from that page
17:33 🔗 arkiver and it doesn't go further
17:33 🔗 arkiver and well
17:33 🔗 arkiver the test crawl of that one ID proves it
17:33 🔗 arkiver otherwise it wouldn't have finished
17:34 🔗 joepie91 also I should note that my internet is shitty and laggy
17:35 🔗 arkiver brb
17:39 🔗 joepie91 arkhive: oh, only one deep?
17:40 🔗 joepie91 hmm
18:03 🔗 arkiver arkiver*
18:05 🔗 arkiver so
18:05 🔗 arkiver going totally fine
18:05 🔗 arkiver not banned o r anything
18:05 🔗 arkiver or*
18:06 🔗 Konata_ that's good
18:06 🔗 arkiver if someone here can actually view a warc in a small wayback machine, please test this warc file:
18:06 🔗 arkiver https://www.filepicker.io/api/file/BhHu8SK4SM2XwZWS3dGn
18:06 🔗 arkiver and tell me how it looks
18:07 🔗 arkiver since there is a lot of javascript
18:11 🔗 yipdw arkiver: https://github.com/ArchiveTeam/warc-proxy <-- you can test it yourself
18:11 🔗 yipdw highly encouraged to do so, as well
18:11 🔗 arkiver I have windows
18:11 🔗 arkiver not sure if it will work there?
18:11 🔗 yipdw VirtualBox
18:11 🔗 arkiver ...
18:11 🔗 yipdw it should work in Windows, though
18:11 🔗 arkiver well
18:12 🔗 arkiver I really have zero experience with virtualbox or anything
18:12 🔗 arkiver I should start learning it
18:12 🔗 yipdw if you have a Python installation with all required libraries
18:12 🔗 arkiver but maybe someone else can tet it for me right now
18:12 🔗 arkiver I will learn it yipdw, promised! :)
18:12 🔗 Konata_ If you know how to set up a computer, running virtualbox/vmmware player is for the most part like clockwork
18:13 🔗 yipdw I'd recommend learning it now; there's no real point in continuing a grab if what you're grabbing is grossly incomplete and/or unreadable
18:13 🔗 arkiver no
18:13 🔗 yipdw and warc-proxy is a very good tool to figure that out
18:13 🔗 arkiver I know it is readable
18:13 🔗 yipdw how do you know it's readable if you haven't tested it?
18:13 🔗 arkiver the only thing I want to know how it is turning out because of all the javascript
18:13 🔗 arkiver well
18:14 🔗 arkiver still know that thing with jason? when my files didn't seem to work?
18:14 🔗 yipdw there is, I suppose, a trivial definition of "readability" which means "your shit isn't corrupt"
18:14 🔗 yipdw but there is a higher standard that is not only possible but indeed is now feasible thanks to alard etc.
18:14 🔗 yipdw and I'm just saying "here's a tool that makes it possible, please use it"
18:15 🔗 arkiver back then they didn't show in the wayback machine because of, as I later found out, the torrent I uploaded them with had spaces " "
18:15 🔗 arkiver I later uploaded some by hand
18:15 🔗 arkiver and those worked actually
18:15 🔗 arkiver totally fine
18:15 🔗 arkiver but I'll use that warc-proxy... ;)
18:16 🔗 yipdw in any case, I have looked at that WARC in warc-proxy
18:16 🔗 yipdw the pages look okay
18:17 🔗 yipdw but the full-size wallpapers do not appear to be in there
18:17 🔗 yipdw the zoom feature on the wallpapers does not seem to work
18:17 🔗 arkiver hmm
18:17 🔗 arkiver I think that's the javascript...
18:17 🔗 arkiver :(
18:18 🔗 yipdw you're also not fetching wallpapers.wallbase.cc/wallpapers/, it seems
18:18 🔗 yipdw oh, wait
18:18 🔗 yipdw there it is
18:18 🔗 yipdw ok
18:19 🔗 yipdw it appears that each page references a file at URL http://static.wallbase.cc/js/jquery-1.10.2.min.map
18:29 🔗 arkiver so
18:30 🔗 arkiver they are woroking apart from the zoom not working?
18:30 🔗 arkiver wish we had a way to do javascript well....
18:31 🔗 yipdw if the appropriate files are fetched, it will work
18:31 🔗 yipdw the only issue here is that a file is missing
18:35 🔗 arkiver well
18:35 🔗 arkiver I don't think I can change that
18:35 🔗 arkiver you mean the http://static.wallbase.cc/js/jquery-1.10.2.min.map url right?
18:35 🔗 arkiver I think that url is dynamic and created by javascript
18:36 🔗 arkiver but I'll see if I can download those manually
18:36 🔗 arkiver well
18:36 🔗 arkiver create the links and download those
18:38 🔗 yipdw that URL actually 404s out when you try to access it
18:38 🔗 yipdw I suspect there's something else going wrong
18:38 🔗 yipdw please do investigate
18:39 🔗 arkiver hmm
18:39 🔗 arkiver maybe if I go deeper into the external urls
18:58 🔗 joepie91 arkhive: .map is just a dev tools thing
18:58 🔗 joepie91 ignore any. map files
18:58 🔗 joepie91 it's not actually referenced in the page
19:06 🔗 arkiver oke
19:06 🔗 arkiver so that's not the problem
19:06 🔗 arkiver joepie91: do you think the zoom issue is a javascript thing?
19:07 🔗 joepie91 arkhive: probably, but no idea
19:08 🔗 joepie91 ideally compare the list of URLs in the warc with those in your browser
19:08 🔗 joepie91 (except for .map)
19:08 🔗 joepie91 and see if anything is missing
19:08 🔗 joepie91 it's possible that one .js imports another .js
19:08 🔗 joepie91 in that case a browser would get it, but heritrix wouldn't
19:12 🔗 arkiver joepie91: it's ARKIVER
19:12 🔗 joepie91 lol sorry
19:12 🔗 joepie91 arkhive comes first in completion
19:14 🔗 arkiver is there any indication how long wallbase will still be online?
19:16 🔗 joepie91 arkiver: none
19:16 🔗 joepie91 according to their own claims it will stay online
19:16 🔗 joepie91 but that seems unlikely
19:16 🔗 arkiver hmm
19:16 🔗 arkiver will also use another way to do the website
19:16 🔗 arkiver which is probably faster
19:16 🔗 arkiver as long as it doens't crash
19:21 🔗 Dud1 So http://don.na/ is going to be shut down by yahoo...
19:31 🔗 arkiver don.na will be done in a few minutes
19:37 🔗 arkiver is it possible to do a wide crawl with heritrix?
19:37 🔗 arkiver does someone know how to do that?
19:40 🔗 joepie91 wide crawl as in?
19:41 🔗 arkiver just the internet
19:41 🔗 arkiver like alexa is doing
19:41 🔗 arkiver when I get faster internet
19:41 🔗 arkiver it would be cool to do that I think
19:41 🔗 Konata_ Howdy :)
19:42 🔗 arkiver het Konata_
19:42 🔗 arkiver hey*
19:42 🔗 arkiver crawl going fine
19:42 🔗 arkiver but
19:42 🔗 arkiver for some reason they pause every few minutes for some minutes
19:42 🔗 arkiver but well
19:42 🔗 arkiver they are going
19:42 🔗 arkiver that's what counts
19:42 🔗 Konata_ That's a good thing
19:42 🔗 Konata_ Yeah, it's the fact that it's working that matters
19:42 🔗 arkiver doing around 1 percent every 1-2 hours
19:42 🔗 arkiver not extremely much
19:43 🔗 arkiver should be done in around 4-8 days
19:43 🔗 arkiver and the owners say it won't go offline
19:43 🔗 arkiver so we may have time enough
19:43 🔗 Konata_ So they've been notified that we're doing this?
19:43 🔗 arkiver hehe
19:43 🔗 arkiver nope
19:43 🔗 arkiver they won't allow it
19:43 🔗 arkiver I think
19:44 🔗 Konata_ Oh well lol
19:44 🔗 arkiver looking at their robots.txt
19:44 🔗 arkiver their robots.txt hides almost everything from web crawlers
19:44 🔗 joepie91 <arkiver>just the internet
19:44 🔗 Dud1 Wouldn't they notice it though?
19:44 🔗 joepie91 lol
19:44 🔗 arkiver so why would they allow us then to crawl the website and ignoring the robots.txt?
19:44 🔗 joepie91 framing this one on my wall
19:44 🔗 joepie91 "what are you archiving?" "just the internet"
19:44 🔗 joepie91 haha
19:44 🔗 arkiver I mean like the alexa crawls
19:45 🔗 arkiver or the IA web wide crawls
19:45 🔗 Konata_ They'll probably notice at one point or another
19:45 🔗 arkiver haha yep
19:45 🔗 arkiver then I just change my ip and start again
19:45 🔗 joepie91 Konata_: yes, but that point is usually the point where there are so many warriors running that you can't just block a single IP
19:45 🔗 arkiver or maybe we have already saved the full site by then
19:45 🔗 arkiver but remember]
19:45 🔗 Konata_ wait, is this a warrior project right now?
19:45 🔗 arkiver the website won't be visible in the wayback machine till the real website is gone
19:45 🔗 joepie91 Konata_: not yet
19:45 🔗 Konata_ Ah okay
19:46 🔗 arkiver because of their robots.txt blocking it from being viewed in the wayback machine
19:47 🔗 Konata_ Surely they must think that a sudden interest in just purely random wallpapers is suspicious
19:48 🔗 arkiver haha yes
19:48 🔗 arkiver think so
20:11 🔗 SadDM Does anybody know: If archive.org takes something down... can the uploader still access it?
20:11 🔗 SadDM or does the uploader get any warning?
20:12 🔗 arkiver well
20:12 🔗 arkiver getting a warning or not depends on the upload of course....
20:14 🔗 SadDM erm, not "a warning": You shouldn't upload mp3s of that album that isn't out yet
20:15 🔗 SadDM but rather "warning" in the sense of: hey, we've got to take this thing down... you've got 24 hours
20:27 🔗 Nemo_bis well, archive.org is not your personal storage you know
20:27 🔗 Nemo_bis under DMCA you can oppose and bring the downtaker to court ;)
20:28 🔗 Dud1 Wait it isn't? :( *goes and removes 2tb of files*
20:35 🔗 DFJustin uploader doesn't get a warning and you can't access it anymore
20:38 🔗 SadDM That all makes sense... thanks guys
20:41 🔗 DFJustin you just get a notice that it was taken down
20:46 🔗 Jonimus For some definitions of "Taken Down" depending on the material.
20:47 🔗 DFJustin they do retain all the files still it just can't be accessed from the outside
21:20 🔗 arkiver going to bed now
21:27 🔗 arkiver before other people are going to do wallbase.cc
21:27 🔗 arkiver please let me do it
21:27 🔗 arkiver I just want to that to have my first really big website done
21:32 🔗 godane ok
21:32 🔗 godane i'm still going to put up the first 100k ids
21:32 🔗 godane only cause i'm half way there
21:33 🔗 godane note its only a image grab based on the fixed wallpapers.wallbase.cc/rozne/wallpaper-$id.jpg urls
21:35 🔗 Smiley you wanna ping me a list of the rest godane ?
21:36 🔗 Smiley are they just 100,001 - 200,000 ?
21:47 🔗 godane i'm brute forcing the grab
21:48 🔗 godane i'm going thur every number between 1 to 100000
22:01 🔗 arkhive 4 notifications that my name has been mentioned on IRC! omg.. oh wait.. it's arkiver they wanted. not me
22:02 🔗 arkhive oh btw
22:03 🔗 arkhive http://arstechnica.com/gadgets/2014/01/intel-closes-appup-its-pc-app-store-intel-had-a-pc-app-store/
22:03 🔗 arkhive don't know if it has been mentinoed. but there.
22:05 🔗 arkhive http://software.intel.com/sites/landingpage/intelappup/
22:05 🔗 arkhive March 11th 2014
22:08 🔗 RedType ha. appup gave me a free meego intel tablet
22:08 🔗 RedType it was a piece of shit
23:34 🔗 dashcloud balrog: I asked about AOL- ping me you're available

irclogger-viewer