#archiveteam 2016-08-03,Wed

↑back Search

Time Nickname Message
00:01 🔗 DoomTay has quit IRC (Quit: Page closed)
00:05 🔗 BlueMaxim has joined #archiveteam
00:26 🔗 nightpool has joined #archiveteam
00:34 🔗 tomwsmf has quit IRC (Read error: Operation timed out)
00:47 🔗 DiscantX has joined #archiveteam
01:06 🔗 Lord_Nigh SketchCow: it is in /0/cdrom/
01:14 🔗 JesseW has joined #archiveteam
01:19 🔗 philpem has quit IRC (Ping timeout: 260 seconds)
01:38 🔗 atrocity has joined #archiveteam
01:45 🔗 pguth_ has quit IRC (Remote host closed the connection)
01:45 🔗 pguth_ has joined #archiveteam
01:45 🔗 pguth_ has quit IRC (Remote host closed the connection)
01:45 🔗 pguth_ has joined #archiveteam
01:46 🔗 jmad980 has quit IRC (Ping timeout: 250 seconds)
01:48 🔗 jmad980 has joined #archiveteam
01:54 🔗 tomwsmf has joined #archiveteam
02:11 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
02:25 🔗 pguth_ has quit IRC (Remote host closed the connection)
02:25 🔗 pguth_ has joined #archiveteam
02:48 🔗 pguth_ has quit IRC (Remote host closed the connection)
02:48 🔗 pguth_ has joined #archiveteam
02:48 🔗 ndiddy has quit IRC (Leaving)
02:49 🔗 ndiddy has joined #archiveteam
03:10 🔗 RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
03:11 🔗 RichardG has joined #archiveteam
03:13 🔗 tomwsmf has quit IRC (Ping timeout: 258 seconds)
03:13 🔗 RichardG has quit IRC (Read error: Connection timed out)
03:14 🔗 RichardG has joined #archiveteam
03:26 🔗 zenguy has quit IRC (Ping timeout: 370 seconds)
03:28 🔗 robink has quit IRC (Ping timeout: 260 seconds)
03:29 🔗 zenguy has joined #archiveteam
03:35 🔗 zenguy has quit IRC (Read error: Operation timed out)
03:42 🔗 ndiddy has quit IRC (Leaving)
03:46 🔗 zenguy has joined #archiveteam
03:57 🔗 robink has joined #archiveteam
03:59 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
04:28 🔗 RichardG has quit IRC (Read error: Operation timed out)
04:28 🔗 RichardG has joined #archiveteam
04:32 🔗 JesseW has joined #archiveteam
04:33 🔗 zenguy has quit IRC (Read error: Operation timed out)
04:36 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:38 🔗 zenguy has joined #archiveteam
04:45 🔗 Sk1d has joined #archiveteam
04:49 🔗 pguth_ has quit IRC (Remote host closed the connection)
04:49 🔗 pguth_ has joined #archiveteam
05:06 🔗 zenguy has quit IRC (Ping timeout: 246 seconds)
06:03 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
06:12 🔗 Stiletto has quit IRC ()
06:27 🔗 Atom-- has quit IRC (Ping timeout: 190 seconds)
06:50 🔗 jmad980 has quit IRC (Remote host closed the connection)
07:37 🔗 Stiletto has joined #archiveteam
07:52 🔗 Discant has joined #archiveteam
07:54 🔗 Honno has joined #archiveteam
07:56 🔗 DiscantX has quit IRC (Read error: Operation timed out)
08:15 🔗 schbirid has joined #archiveteam
08:24 🔗 Rondom_ has quit IRC (Remote host closed the connection)
08:24 🔗 Rondom has joined #archiveteam
08:27 🔗 MMovie1 has joined #archiveteam
08:28 🔗 MMovie has quit IRC (Read error: Operation timed out)
08:29 🔗 DiscantX has joined #archiveteam
08:32 🔗 Discant has quit IRC (Read error: Operation timed out)
08:59 🔗 Discant has joined #archiveteam
09:04 🔗 DiscantX has quit IRC (Ping timeout: 501 seconds)
09:24 🔗 pguth_ has quit IRC (Remote host closed the connection)
09:25 🔗 pguth_ has joined #archiveteam
09:51 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
09:51 🔗 JW_work1 has quit IRC (Read error: Connection reset by peer)
09:52 🔗 espes__ has quit IRC (Read error: Operation timed out)
09:52 🔗 espes__ has joined #archiveteam
09:52 🔗 JW_work has joined #archiveteam
09:52 🔗 Lord_Nigh has joined #archiveteam
09:58 🔗 DiscantX has joined #archiveteam
10:00 🔗 pguth_ has quit IRC (Remote host closed the connection)
10:00 🔗 pguth_ has joined #archiveteam
10:01 🔗 Discant has quit IRC (Read error: Operation timed out)
10:03 🔗 Discant has joined #archiveteam
10:05 🔗 DiscantX has quit IRC (Ping timeout: 244 seconds)
10:06 🔗 JW_work has quit IRC (Read error: Connection reset by peer)
10:06 🔗 JW_work has joined #archiveteam
10:12 🔗 Simpbrain has joined #archiveteam
10:15 🔗 Medowar has quit IRC (Ping timeout: 244 seconds)
10:15 🔗 Rye has quit IRC (Ping timeout: 244 seconds)
10:16 🔗 PurpleSym has quit IRC (Ping timeout: 244 seconds)
10:16 🔗 PotcFdk has quit IRC (Ping timeout: 506 seconds)
10:17 🔗 toddf has quit IRC (Read error: Connection reset by peer)
10:17 🔗 toddf has joined #archiveteam
10:18 🔗 Medowar has joined #archiveteam
10:18 🔗 Rye has joined #archiveteam
10:19 🔗 PotcFdk has joined #archiveteam
10:19 🔗 PurpleSym has joined #archiveteam
10:28 🔗 jk[SVP] has quit IRC (zoop)
10:36 🔗 pguth_ has quit IRC (Remote host closed the connection)
10:36 🔗 pguth_ has joined #archiveteam
10:37 🔗 jk[SVP] has joined #archiveteam
10:44 🔗 pguth_ has quit IRC (Remote host closed the connection)
10:44 🔗 pguth_ has joined #archiveteam
11:42 🔗 Emcy has joined #archiveteam
11:57 🔗 SadDM has quit IRC (leaving)
11:57 🔗 SadDM has joined #archiveteam
11:57 🔗 swebb sets mode: +o SadDM
12:26 🔗 Meroje has quit IRC (Quit: bye!)
12:28 🔗 Meroje has joined #archiveteam
12:48 🔗 Meroje has quit IRC (Quit: bye!)
12:48 🔗 Meroje has joined #archiveteam
12:50 🔗 Meroje has quit IRC (Client Quit)
12:51 🔗 Meroje has joined #archiveteam
12:53 🔗 Meroje has quit IRC (Client Quit)
12:53 🔗 Meroje has joined #archiveteam
12:55 🔗 Meroje has quit IRC (Client Quit)
12:55 🔗 Meroje has joined #archiveteam
12:57 🔗 Meroje has quit IRC (Client Quit)
12:57 🔗 Meroje has joined #archiveteam
12:58 🔗 Meroje has quit IRC (Client Quit)
12:59 🔗 Meroje has joined #archiveteam
13:06 🔗 vitzli has joined #archiveteam
13:07 🔗 tomwsmf has joined #archiveteam
13:24 🔗 nightpool has quit IRC (Read error: Operation timed out)
13:33 🔗 midas https://tweakers.net/geek/114249/internet-archive-zet-dertien-jaargangen-nintendo-power-magazine-online.html
13:40 🔗 Discant has quit IRC (Read error: Operation timed out)
13:47 🔗 Simpbrain has quit IRC (Quit: Leaving)
14:38 🔗 nightpool has joined #archiveteam
14:41 🔗 pguth_ has quit IRC (Remote host closed the connection)
14:42 🔗 pguth_ has joined #archiveteam
15:03 🔗 Simpbrain has joined #archiveteam
15:36 🔗 BlueMaxim has quit IRC (Quit: Leaving)
16:22 🔗 ndiddy has joined #archiveteam
16:25 🔗 TC02 has joined #archiveteam
16:30 🔗 DoomTay has joined #archiveteam
16:35 🔗 anjacks0n has joined #archiveteam
16:36 🔗 JesseW has joined #archiveteam
16:40 🔗 Simpbrain has quit IRC (Quit: Leaving)
16:41 🔗 pguth_ has quit IRC (Remote host closed the connection)
16:41 🔗 pguth_ has joined #archiveteam
16:44 🔗 Phoen1x has joined #archiveteam
16:49 🔗 philpem has joined #archiveteam
17:08 🔗 DoomTay has quit IRC (Quit: Page closed)
17:11 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
17:13 🔗 zenguy has joined #archiveteam
17:18 🔗 mr-b has quit IRC (Read error: Operation timed out)
17:38 🔗 DoomTay has joined #archiveteam
17:45 🔗 JW_work has quit IRC (Quit: Leaving.)
17:45 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
17:46 🔗 RichardG has joined #archiveteam
17:46 🔗 zenguy has quit IRC (Read error: Operation timed out)
17:49 🔗 RichardG_ has joined #archiveteam
17:51 🔗 RichardG has quit IRC (Ping timeout: 244 seconds)
17:53 🔗 RichardG has joined #archiveteam
17:58 🔗 pguth_ has quit IRC (Remote host closed the connection)
17:58 🔗 pguth_ has joined #archiveteam
17:59 🔗 RichardG_ has quit IRC (Read error: Operation timed out)
18:07 🔗 PepsiMax has joined #archiveteam
18:12 🔗 JW_work has joined #archiveteam
18:13 🔗 JW_work https://www.getdatajoy.com/ <- shutting down Jan 2, 2017 ; unclear what public data is available
18:15 🔗 JW_work https://news.ycombinator.com/item?id=12216896
19:00 🔗 ErkDog Is there a way to maybe increase the items/hour on goodlecode?
19:03 🔗 Phoen1x has quit IRC (Read error: Operation timed out)
19:04 🔗 Phoen1x has joined #archiveteam
19:11 🔗 vitzli has quit IRC (Read error: Operation timed out)
19:22 🔗 Start_ is now known as Start
19:40 🔗 Phoen1x has quit IRC (Quit: Leaving)
19:55 🔗 Hybrid_ has joined #archiveteam
19:56 🔗 Hybrid_ has quit IRC (Client Quit)
20:23 🔗 Aranje has joined #archiveteam
20:25 🔗 Discant has joined #archiveteam
20:34 🔗 khaoohs_ has quit IRC (Read error: Operation timed out)
20:52 🔗 Discant has quit IRC (Ping timeout: 633 seconds)
20:58 🔗 schbirid has quit IRC (Quit: Leaving)
21:12 🔗 Medowar Do we have a wiki page to dump all infos regarding the turkey newspapers crackdown?
21:19 🔗 JW_work1 has joined #archiveteam
21:19 🔗 JW_work1 Medowar: not that I know of — you could make one
21:26 🔗 JW_work has quit IRC (Ping timeout: 633 seconds)
21:29 🔗 Medowar lol: https://www.youtube.com/watch?v=HvXtPk8gjYE
21:29 🔗 Medowar Turkish News Channel Mistakes GTA Cheats for Coup Codes
21:30 🔗 Simpbrain has joined #archiveteam
21:31 🔗 Medowar current dump: http://archiveteam.org/index.php?title=Turkey_Media_Crackdown
21:52 🔗 kristian_ has joined #archiveteam
22:00 🔗 pguth_ has quit IRC (Remote host closed the connection)
22:00 🔗 pguth_ has joined #archiveteam
22:10 🔗 khaoohs has joined #archiveteam
22:30 🔗 khaoohs has quit IRC (Quit: Leaving)
22:30 🔗 khaoohs has joined #archiveteam
22:37 🔗 ErkDog Is there a way to maybe increase the items/hour on goodlecode?
22:45 🔗 Honno has quit IRC (Read error: Operation timed out)
23:08 🔗 toddf probably old hat around here, but yahoo! I just succeeded in getting my genealogy scraping routine to use the https://web.archive.org/save/$url api and then verify its available via the https://archive.org/wayback/available?url=$url api (though one must transform & into %26 for this api) and then retrieve for my own analysis via the https://web.archive.org/web/<timestamp>id_/$url api ..
23:11 🔗 xmc :)
23:11 🔗 toddf archive.org can hit them up better and faster than I can for some reason, and so instead of 6 urls / min I'm now doing 46 urls / min .. reducing my time to complete my current set of urls (before I scrape and find more) from 2y to 3mo and some change
23:11 🔗 toddf presuming they don't blacklist archive.org
23:12 🔗 xmc oh nice i didn't know archive.org had better thruput
23:13 🔗 toddf I got 50% syn syn syn ack delays so 42s / url and sometimes even can't connect issues at the tcp layer from my laptop directly to the target site, archive.org seems to have those issues licked, at least right now
23:13 🔗 toddf it is not a whole lot to recode to turn this into a generic site scraper ..
23:13 🔗 xmc very strange
23:14 🔗 toddf archive.org claims to have unlimited api use for v1 of its api, 'for now, may have to limit in the future'
23:14 🔗 toddf so I'll presume they don't mind me doing this in serial. if someone has a link to any limits I should impose to calling the above listed api urls please let me know, I'd rather not get blacklisted from archive.org ;-)
23:15 🔗 toddf I only have 6.4 million urls to scrape until I rinse and repeat and find more links inside there
23:15 🔗 xmc not bad
23:15 🔗 xmc go for it
23:16 🔗 toddf pretty sure I won't blow my sqlite3 max db size for this, going on 8gb now and if my math is right 120gb or so is the max in my env
23:16 🔗 xmc sqlite might be the wrong choice for that but,
23:16 🔗 xmc if it works for you then do it
23:16 🔗 toddf I think I'll be stuffing it into a postgresql db before its all said and done
23:18 🔗 toddf it was an easy first pick, and I've been trying really hard to tune the network code and scientifically experiment with self imposed delays to see if it effected the afliction of syn syn syn ack delay and/or can't connect
23:18 🔗 xmc _nod_
23:19 🔗 toddf then I stumbled upon the archive.org api to save a page and retrieve the original content unadulterated with the id_ bit after the timestamp, and .. the day is lost for anything prodctive except running the scraper full tilt now ;-)
23:20 🔗 xmc i knw THAT feeling
23:21 🔗 toddf perhaps I've simply reinvented a sqlite3+perl version of the team vm thingie that is serialized but hey, learned a lot in the process
23:21 🔗 toddf $ wc -l git/sw/genscripts/genwebscrape
23:21 🔗 toddf 3580 git/sw/genscripts/genwebscrape
23:21 🔗 toddf lots o learning in that bit o code ;-)
23:21 🔗 xmc o my
23:22 🔗 arkiver ErkDog: we got complains from google for going too fast, it'll stay at this speed
23:22 🔗 toddf this way I don't even have to learn about warc's (I read a bit, format seems a bit overkill but anyway) .. archive.org can handle that bit for me in this scenario ;-)
23:22 🔗 arkiver the current batch of items is the last batch, after that we are done
23:36 🔗 nightpool has quit IRC (Ping timeout: 501 seconds)
23:47 🔗 DoomTay has quit IRC (Quit: Page closed)
23:47 🔗 lytv has quit IRC (Read error: Operation timed out)
23:53 🔗 DoomTay has joined #archiveteam

irclogger-viewer