#archiveteam-bs 2016-08-23,Tue

↑back Search

Time Nickname Message
00:21 🔗 kristian_ has quit IRC (Leaving)
01:13 🔗 aschmitz Anyone have experience archiving Disqus fora? I count 55 for NPR after dropping those with "dev" or "stage" in the name.
01:24 🔗 r3c0d3x aschmitz: I was actually looking into this a bit already and I'll write up a GitHub gist about it in a minute. One note to preface all this: the comments will stay on Disqus for quite a while longer after NPR removes the embeds from their site. We probably don't need to rush on this.
01:26 🔗 aschmitz Yeah, it looked like that when I was digging into it a bit.
01:34 🔗 r3c0d3x aschmitz: Quickly threw this together, it has all the info I was able to gather: https://gist.github.com/r3c0d3x/ff33ff59bd2432a5a81a32669eb5a390
01:47 🔗 HCross has quit IRC (Ping timeout: 246 seconds)
01:47 🔗 HCross has joined #archiveteam-bs
01:59 🔗 aschmitz r3c0d3x: Cool, thanks. Added a bit in my fork: https://gist.github.com/aschmitz/19dfb67be5d0d71c74431074191062dc
02:10 🔗 tomwsmf has quit IRC (Read error: Operation timed out)
02:26 🔗 mr-b has left
03:14 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
03:15 🔗 BartoCH has joined #archiveteam-bs
04:09 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
04:12 🔗 BartoCH has joined #archiveteam-bs
04:17 🔗 JesseW has joined #archiveteam-bs
04:17 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:24 🔗 Sk1d has joined #archiveteam-bs
04:35 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
04:35 🔗 HCross has quit IRC (Ping timeout: 246 seconds)
04:35 🔗 HCross has joined #archiveteam-bs
04:43 🔗 DFJustin has quit IRC (Ping timeout: 260 seconds)
04:43 🔗 Meroje has quit IRC (Quit: bye!)
04:44 🔗 Meroje has joined #archiveteam-bs
04:53 🔗 DFJustin has joined #archiveteam-bs
04:53 🔗 swebb sets mode: +o DFJustin
05:05 🔗 DFJustin has quit IRC (Remote host closed the connection)
05:10 🔗 DFJustin has joined #archiveteam-bs
05:15 🔗 HCross has quit IRC (Read error: Operation timed out)
05:15 🔗 HCross has joined #archiveteam-bs
05:57 🔗 phuzion has quit IRC (Read error: Operation timed out)
05:58 🔗 phuzion has joined #archiveteam-bs
05:59 🔗 SketchCow Intense Floppy Grabs continue
05:59 🔗 JesseW That sounds like some kind of sex toy
06:00 🔗 JesseW Buy "Intense Floppy Grabs" today for deep, sensual pleasure!
06:05 🔗 phuzion has quit IRC (Read error: Operation timed out)
06:05 🔗 sep332 has quit IRC (Read error: Operation timed out)
06:05 🔗 midas1 has quit IRC (Read error: Operation timed out)
06:07 🔗 godane just know there are sex toys that are senting data back to the company
06:07 🔗 godane also this: http://www.dailydot.com/layer8/hackers-and-vibrators-oh-my/
06:07 🔗 midas1 has joined #archiveteam-bs
06:07 🔗 sep332 has joined #archiveteam-bs
06:09 🔗 JesseW thank you for that
06:10 🔗 godane your welcome
06:11 🔗 godane i remember reading something about that and could find the exact article
06:11 🔗 godane but that was close enough to it
06:13 🔗 BlueMaxim has joined #archiveteam-bs
06:13 🔗 phuzion has joined #archiveteam-bs
06:23 🔗 godane turns out sploid.gizmodo.com sitemaps was big
06:23 🔗 godane i think about 10gb for all of it
06:24 🔗 godane maybe its 9gb
06:24 🔗 godane but its still big
07:03 🔗 Honno has joined #archiveteam-bs
07:11 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
07:31 🔗 REiN^ has quit IRC (Read error: Connection reset by peer)
07:33 🔗 phuzion has quit IRC (Read error: Operation timed out)
07:36 🔗 phuzion has joined #archiveteam-bs
07:52 🔗 schbirid has joined #archiveteam-bs
07:53 🔗 fie__ SketchCow, floppy grabs what?
08:03 🔗 Medowar ...floppy disks?
08:22 🔗 SketchCow Apple II floppies
08:27 🔗 SketchCow Work on http://fos.textfiles.com/pipeline.html began
08:27 🔗 SketchCow Lots to do
08:33 🔗 godane SketchCow: turns out we don't have all of gawker.com
08:33 🔗 SketchCow ?
08:33 🔗 godane or kotaku or lifehacker
08:33 🔗 SketchCow Really.
08:33 🔗 SketchCow Why?
08:33 🔗 godane dump sitemap
08:33 🔗 SketchCow Well, have they deleted it all now?
08:33 🔗 godane http://kotaku.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59
08:34 🔗 godane no they have not deleted it yet
08:35 🔗 godane i have noticed that sitemap by date hacks weird
08:35 🔗 godane but cause i tested on maybe 2005 or 2006 sitemaps it looks like it had everything
08:37 🔗 REiN^ has joined #archiveteam-bs
08:41 🔗 godane ok the sitemap urls a funking with us
08:41 🔗 godane kotaku.com for 2008-11 (one above) has 3034 urls
08:42 🔗 godane but if you use gawker.com in its place you get 1971
08:43 🔗 godane so when i say the sitemap acts weird it does act weird
08:45 🔗 godane SketchCow: also i think archivebot when after gawker.com and other sites own by gawker back in 2014 or 2015
08:45 🔗 godane so my sitemap grab just maybe incomplete
08:48 🔗 godane even my sitemap grab of sploid.gizmodo.com is incomplete
08:48 🔗 godane :'(
08:48 🔗 SketchCow Dust yourself and go for it again
08:49 🔗 godane i'm doing that now
08:52 🔗 SketchCow I'm watching classic movies and ripping Apple II disks, and both are going swimmingly.
08:52 🔗 godane curl -s 'http://gawker.com/sitemap_bydate.xml?startTime=2008-11-01T00:00:00&endTime=2008-11-30T23:59:59' | sed 's|><|>\n<|g' | grep 'http' | sed 's|.*http://|http://|g' | sed 's|.*https://|http://|g' | sed "s|</image:loc>||g" | sed 's|]]>||g'
08:52 🔗 godane thats my code for grabbing the urls
08:54 🔗 godane after 2006-01 is done will try setup my script to attack each month of 2006 for gawker
09:02 🔗 BartoCH has joined #archiveteam-bs
09:17 🔗 HCross has quit IRC (Ping timeout: 246 seconds)
09:17 🔗 HCross has joined #archiveteam-bs
09:34 🔗 GE has joined #archiveteam-bs
09:34 🔗 HCross2 has quit IRC (Quit: Connection closed for inactivity)
09:52 🔗 Selavi has quit IRC (Ping timeout: 260 seconds)
09:53 🔗 Kksmkrn has joined #archiveteam-bs
09:53 🔗 Kksmkrn has quit IRC (Connection closed)
09:53 🔗 Kksmkrn has joined #archiveteam-bs
10:00 🔗 Selavi has joined #archiveteam-bs
10:09 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
10:14 🔗 luckcolor someone may want to go after this https://twitter.com/antisec_ita/status/767856654486503424
10:16 🔗 BartoCH has joined #archiveteam-bs
10:23 🔗 divingk has quit IRC (ChatZilla 0.9.92 [Firefox 47.0/20160604131506])
10:31 🔗 Kksmkrn has quit IRC (Quit: leaving)
11:35 🔗 HCross has quit IRC (Ping timeout: 246 seconds)
11:35 🔗 HCross has joined #archiveteam-bs
11:44 🔗 atrocity https://www.reddit.com/r/Minecraft/comments/4z36un/mojangs_official_youtube_channel_was_suspended/
11:44 🔗 atrocity stay classy, youtube
11:47 🔗 joepie91 lol.
12:35 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
12:37 🔗 BartoCH has joined #archiveteam-bs
12:57 🔗 GE_ has joined #archiveteam-bs
12:59 🔗 GE has quit IRC (Ping timeout: 255 seconds)
12:59 🔗 GE_ is now known as GE
13:03 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:04 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:04 🔗 BartoCH has joined #archiveteam-bs
13:27 🔗 beardicus has quit IRC (bye)
13:28 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:31 🔗 beardicus has joined #archiveteam-bs
13:35 🔗 beardicus has quit IRC (Client Quit)
13:37 🔗 beardicus has joined #archiveteam-bs
13:45 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:45 🔗 BartoCH has joined #archiveteam-bs
13:46 🔗 wp494 has quit IRC (Read error: Operation timed out)
13:47 🔗 dashcloud has joined #archiveteam-bs
14:16 🔗 GE has quit IRC (Remote host closed the connection)
14:42 🔗 tomwsmf has joined #archiveteam-bs
14:47 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
15:15 🔗 wp494 has joined #archiveteam-bs
15:18 🔗 JesseW has joined #archiveteam-bs
15:25 🔗 JesseW has quit IRC (Read error: Operation timed out)
15:34 🔗 BartoCH has joined #archiveteam-bs
15:56 🔗 VADemon has joined #archiveteam-bs
16:03 🔗 GE has joined #archiveteam-bs
16:14 🔗 SketchCow http://fos.textfiles.com/pipeline.html is OK but needs another run! Which it will get shortly.
16:32 🔗 sep332 Does archive.org have a policy about which YouTube pages get saved?
16:32 🔗 sep332 "Gangnam Style" was getting saved like 6x per day https://web.archive.org/web/*/https://www.youtube.com/watch?v=9bZkp7q19f0
16:33 🔗 sep332 (it doesn't seem to have video data or any comments though)
16:33 🔗 SketchCow That's being worked on internally
16:33 🔗 sep332 by "policy" I mean for auto-crawling
16:33 🔗 sep332 ok
16:42 🔗 HCross2 has joined #archiveteam-bs
16:42 🔗 SketchCow The goal is in the future it will deduplicate these.
16:42 🔗 arkiver SketchCow: awesome!!
16:43 🔗 arkiver hmm
16:43 🔗 sep332 Is the video data being saved and just not rendered correctly, or plain not collected?
16:43 🔗 arkiver SketchCow: you can remove extratorrent from there
16:44 🔗 SketchCow I'll do additional work after I finish the script.
16:44 🔗 SketchCow A little time to go
16:44 🔗 arkiver ok
16:51 🔗 irl has joined #archiveteam-bs
16:51 🔗 irl SketchCow: hi
16:51 🔗 SketchCow Hiiiiiiiiii
16:51 🔗 irl hiiiiiiiiiiiiiiiiiii
16:51 🔗 xmc hi.
16:52 🔗 irl SketchCow: i hear you like manuals
16:52 🔗 SketchCow I do.
16:52 🔗 irl cool
16:52 🔗 SketchCow I heard you like scanning them
16:52 🔗 irl i have manuals
16:52 🔗 irl and a scanner coming
16:52 🔗 irl in the ebay post
16:52 🔗 SketchCow Try not to damage the originals too much and have a fantastic time.
16:53 🔗 SketchCow Scan at 600dpi TIFF files, put into either .ZIPs or into directories.
16:53 🔗 irl X.25 interface cards, network simulation software, and other things relevant to internet engineering
16:53 🔗 SketchCow I can give you an FTP drop
16:53 🔗 irl ok awesome
16:53 🔗 irl so i don't go directly to IA?
16:53 🔗 irl you'll help out with metadata maybe?
16:53 🔗 xmc you do your own metadata
16:53 🔗 xmc don't make SketchCow do it
16:53 🔗 irl hehe
16:53 🔗 xmc it's not that hard
16:54 🔗 irl got a link for how to do metadata in a nice format?
16:54 🔗 irl also, i have some reel-to-reel tapes and 8" floppies
16:54 🔗 SketchCow I can give general information.
16:54 🔗 xmc like, how to type in the title and date and author?
16:54 🔗 SketchCow In a best case, it's:
16:55 🔗 SketchCow Title, date of creation, creator (company or individual), and then a capsule description.
16:55 🔗 irl that seems reasonable
16:55 🔗 irl so i'm not understanding FTP drop then, because that sounds like i'm creating an IA collection
17:00 🔗 tomaspark has quit IRC (Ping timeout: 255 seconds)
17:03 🔗 HCross2 SketchCow: who should I contact if I need to change the payment method for an archive.org donation?
17:12 🔗 SketchCow mail info@archive.org
17:12 🔗 SketchCow irl: So there's two ways to upload
17:12 🔗 SketchCow You can upload yourself, or you can build up a pile of directories and I can give you an FTP drop and I shove them in.
17:12 🔗 SketchCow A collection can be made and you can work on it, but I can do that initial upload process using scripts. I find that helps for bulk uploaders.
17:13 🔗 VerifiedJ has joined #archiveteam-bs
17:21 🔗 irl SketchCow: ah ok cool
17:21 🔗 irl so how should the metadata be done within the pile of directories?
17:22 🔗 irl is there some json or yaml or something format?
17:24 🔗 SketchCow Whatever you're comfortable with, I can work with
17:28 🔗 irl SketchCow: i could do a csv like http://internetarchive.readthedocs.io/en/latest/cli.html#modifying-metadata-in-bulk
17:29 🔗 SketchCow Entirely up to you. However you want.
17:29 🔗 irl ok, just trying to find the easiest way for you
17:55 🔗 godane SketchCow: daily sitemaps of gawker.com is happening
17:55 🔗 SketchCow Great
17:56 🔗 godane i just hope they sitemap does going crazy like before with monthly ones
17:59 🔗 SketchCow Any amount of Gawker functioning right now is a gift.
17:59 🔗 SketchCow Or any of the properties.
17:59 🔗 SketchCow And when Univision steps in, it's going to be a bloodbath
18:00 🔗 SketchCow The Univision buy is so insane I'm assuming it's some corrupt reason we don't understand
18:00 🔗 SketchCow Or Denton invented some snow-job that Univision bought
18:59 🔗 bzc6p has joined #archiveteam-bs
18:59 🔗 swebb sets mode: +o bzc6p
19:00 🔗 bzc6p ErkDog: you reported that yahoo answers items get stuck. Do you use the new wget-lua?
19:01 🔗 bzc6p --version
19:01 🔗 bzc6p there is a new one from 20160530
19:19 🔗 VerifiedJ has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
19:52 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
19:58 🔗 BartoCH has joined #archiveteam-bs
21:04 🔗 HCross2 has quit IRC (Quit: Connection closed for inactivity)
21:12 🔗 hook54321 SketchCow: Is there a minimum education background requirement (other than experience) for jobs at the internet archive?
21:13 🔗 xmc you should apply
21:13 🔗 xmc http://archive.org/about/jobs.php
21:19 🔗 hook54321 I'm not located in California unfortunately. I would consider applying for many of them if I had more programming experience.
21:19 🔗 xmc then why are you asking?
21:29 🔗 hook54321 Wondering for possible jobs in the future, and not all of say that on-site presence is required.
21:31 🔗 hook54321 *all of them
21:40 🔗 Honno has quit IRC (Read error: Operation timed out)
22:07 🔗 schbirid2 has joined #archiveteam-bs
22:10 🔗 schbirid has quit IRC (Read error: Operation timed out)
22:13 🔗 whydomain has joined #archiveteam-bs
22:15 🔗 whydomain PurpleSym: what design DIY book scanner did you make? (I'm considering https://linearbookscanner.org/ )
22:25 🔗 * FalconK looks around
22:25 🔗 FalconK hey, look! https://archive.org/details/cbcnews201607-201608
22:26 🔗 xmc :)
22:26 🔗 xmc that's a lot of hourly news
22:33 🔗 RichardG has joined #archiveteam-bs
22:40 🔗 FalconK I've got a cronjob pulling it down every hour
22:42 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
22:45 🔗 schbirid2 has joined #archiveteam-bs
22:46 🔗 FalconK I started doing it mostly because I thought it provided an interesting perspective on Trump, and I noticed CBC didn't keep a public archive of them
22:46 🔗 xmc huh
22:46 🔗 FalconK they do *have* an archive of them
22:47 🔗 FalconK not sure how to access it. probably in person.
22:47 🔗 xmc :|
23:09 🔗 whydomain has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client)
23:16 🔗 JW_work1 has joined #archiveteam-bs
23:18 🔗 JW_work has quit IRC (Read error: Operation timed out)
23:23 🔗 RichardG has quit IRC (Read error: Operation timed out)
23:38 🔗 rchrch has joined #archiveteam-bs
23:45 🔗 kristian_ has joined #archiveteam-bs
23:48 🔗 RichardG has joined #archiveteam-bs
23:56 🔗 Stiletto has quit IRC (Ping timeout: 246 seconds)

irclogger-viewer