[00:05] *** Sk1d has quit IRC (Read error: Operation timed out) [00:08] *** twoTBHetz has joined #archiveteam-bs [00:08] *** Sk1d has joined #archiveteam-bs [00:13] *** m007a83 has joined #archiveteam-bs [00:21] *** Sk1d has quit IRC (Read error: Operation timed out) [00:23] *** Sk1d has joined #archiveteam-bs [00:24] betamax: Are there any other URLs that need snscraping at the moment? Or candidate websites that need archiving? [00:24] On a related note, I was looking around for other sites that might have candidate links, and found this: https://vote-usa.org/forresearch.aspx [00:24] Seems comprehensive and well-structured, will probably try writing a scraper for it later. [00:45] *** VerifiedJ has quit IRC (Quit: Leaving) [00:51] *** Mateon1 has quit IRC (Ping timeout: 265 seconds) [00:52] *** Mateon1 has joined #archiveteam-bs [00:58] *** Sk1d has quit IRC (Read error: Operation timed out) [01:01] *** Sk1d has joined #archiveteam-bs [01:15] *** Sk1d has quit IRC (Read error: Operation timed out) [01:19] *** Sk1d has joined #archiveteam-bs [01:31] *** twoTBHetz has quit IRC (Ping timeout: 260 seconds) [01:35] *** Sk1d has quit IRC (Read error: Operation timed out) [01:39] *** Sk1d has joined #archiveteam-bs [01:56] *** Sk1d has quit IRC (Read error: Operation timed out) [01:59] *** Sk1d has joined #archiveteam-bs [02:05] *** Stilett0 has joined #archiveteam-bs [02:10] *** Stiletto has quit IRC (Ping timeout: 633 seconds) [02:15] *** Sk1d has quit IRC (Read error: Operation timed out) [02:20] *** Sk1d has joined #archiveteam-bs [02:32] *** Ryz has quit IRC (west.us.hub irc.Prison.NET) [02:32] *** achip has quit IRC (west.us.hub irc.Prison.NET) [02:33] *** Sk1d has quit IRC (Read error: Operation timed out) [02:37] *** Sk1d has joined #archiveteam-bs [02:50] *** Sk1d has quit IRC (Read error: Operation timed out) [02:54] *** Sk1d has joined #archiveteam-bs [03:05] *** achip has joined #archiveteam-bs [03:07] *** Ryz has joined #archiveteam-bs [03:32] *** Sk1d has quit IRC (Read error: Operation timed out) [03:36] *** Sk1d has joined #archiveteam-bs [03:40] *** bakJAA_ has joined #archiveteam-bs [03:40] *** swebb sets mode: +o bakJAA_ [03:40] *** JAA sets mode: +o bakJAA_ [03:40] *** bakJAA has quit IRC (Read error: Connection reset by peer) [03:41] *** kyounko has quit IRC (Ping timeout: 492 seconds) [03:43] *** mgrytbak_ has quit IRC (Ping timeout: 492 seconds) [03:47] *** mgrytbak_ has joined #archiveteam-bs [03:50] *** Sk1d has quit IRC (Read error: Operation timed out) [03:54] *** Sk1d has joined #archiveteam-bs [04:00] hi, it's possible to convert .warc to static html? [04:06] ggus: If the WARC is a WARC of an HTML file, sure. You just need to extract the contents. [04:06] I think there are a bunch of tools, but take a look at something like https://github.com/chfoo/warcat [04:07] *** Sk1d has quit IRC (Read error: Operation timed out) [04:12] *** Sk1d has joined #archiveteam-bs [04:19] jodizzle: thanks! i'll take a look [04:28] *** Martle__ has quit IRC (Ping timeout: 252 seconds) [04:43] *** qw3rty114 has joined #archiveteam-bs [04:50] *** qw3rty113 has quit IRC (Read error: Operation timed out) [04:50] *** Sk1d has quit IRC (Read error: Operation timed out) [04:54] *** Sk1d has joined #archiveteam-bs [04:57] *** odemg has quit IRC (Read error: Operation timed out) [05:06] *** Sk1d has quit IRC (Read error: Operation timed out) [05:10] *** Sk1d has joined #archiveteam-bs [05:11] *** odemg has joined #archiveteam-bs [05:23] *** Sk1d has quit IRC (Read error: Operation timed out) [05:27] *** Sk1d has joined #archiveteam-bs [06:01] *** Sk1d has quit IRC (Read error: Operation timed out) [06:05] *** Sk1d has joined #archiveteam-bs [06:17] *** Sk1d has quit IRC (Read error: Operation timed out) [06:22] *** Sk1d has joined #archiveteam-bs [06:36] *** Sk1d has quit IRC (Read error: Operation timed out) [06:41] *** Sk1d has joined #archiveteam-bs [06:53] *** Sk1d has quit IRC (Read error: Operation timed out) [06:58] *** Sk1d has joined #archiveteam-bs [07:43] *** Sk1d has quit IRC (Read error: Operation timed out) [07:48] *** Sk1d has joined #archiveteam-bs [08:00] *** Sk1d has quit IRC (Read error: Operation timed out) [08:04] *** Sk1d has joined #archiveteam-bs [08:12] *** adinbied has quit IRC (Remote host closed the connection) [08:12] *** adinbied has joined #archiveteam-bs [08:15] *** Sk1d has quit IRC (Read error: Operation timed out) [08:18] *** adinbied_ has joined #archiveteam-bs [08:19] *** adinbied has quit IRC (Ping timeout: 252 seconds) [08:20] *** Sk1d has joined #archiveteam-bs [08:32] *** Sk1d has quit IRC (Read error: Operation timed out) [08:36] *** Sk1d has joined #archiveteam-bs [08:51] *** Sk1d has quit IRC (Read error: Operation timed out) [08:56] *** Sk1d has joined #archiveteam-bs [09:12] *** Sk1d has quit IRC (Read error: Operation timed out) [09:15] *** Sk1d has joined #archiveteam-bs [09:18] *** BlueMax has quit IRC (Quit: Leaving) [09:28] *** Sk1d has quit IRC (Read error: Operation timed out) [09:30] Awww yeah, Flashpoint - archiving Adobe Flash games and animations and making it work: https://old.reddit.com/r/emulation/comments/9jt2yc/flashpoint_50_brand_new_launcher_two_new_plugins/ [09:30] And a more recent topic: https://old.reddit.com/r/emulation/comments/9s0lz6/flashpoint_51_the_great_filter_playlists_a_new/ [09:32] *** Sk1d has joined #archiveteam-bs [09:37] Bluemax you have a fan [09:39] Oh, I known about this for over a 2-4 months now, and even though I can't play Adobe Flash games right now, I casually keep getting informed about it [09:41] I may plan to help out on archiving Adobe Flash games too or doing anything related to the project [09:45] *** Sk1d has quit IRC (Read error: Operation timed out) [09:48] *** Sk1d has joined #archiveteam-bs [10:02] *** Sk1d has quit IRC (Read error: Operation timed out) [10:07] *** Sk1d has joined #archiveteam-bs [10:08] *** godane has quit IRC (Ping timeout: 265 seconds) [10:10] Bluemax hangs out around here somewhere [10:18] *** Sk1d has quit IRC (Read error: Operation timed out) [10:24] *** Sk1d has joined #archiveteam-bs [10:54] *** Sk1d has quit IRC (Read error: Operation timed out) [10:58] jodizzle: your approach of looking for sites that might have candidate links is a good one [10:59] *** Sk1d has joined #archiveteam-bs [11:00] basically, it will get to a point where the databases of candidate info have been exhausted and then the effort required to discover more urls becomes too great [11:00] eg: googling candidate names and looking for their websites is a bit infeasable - at least in my opinion! [11:07] Betamax give me a small list of candidates. Combining it with my FTP search I may come up with a few useful things to throw into archivebot [11:11] *** Sk1d has quit IRC (Read error: Operation timed out) [11:14] *** Sk1d has joined #archiveteam-bs [11:27] *** Sk1d has quit IRC (Read error: Operation timed out) [11:31] *** Sk1d has joined #archiveteam-bs [11:44] *** Sk1d has quit IRC (Read error: Operation timed out) [11:49] *** Sk1d has joined #archiveteam-bs [12:08] *** Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805]) [12:53] *** t2t2 has quit IRC (Read error: Operation timed out) [12:56] *** t2t2 has joined #archiveteam-bs [12:59] *** godane has joined #archiveteam-bs [13:08] *** adinbied_ is now known as adinbied [13:12] Free Music Archive is returning a 503 for me, did it go down a day early? [13:13] NVM, working again - was just a temp issue [13:34] Probably because AB is slamming it, I think [13:39] still far from done too :/ [13:42] Currently there is: "622,025 in queue" [13:44] *** Mateon1 has quit IRC (Remote host closed the connection) [13:44] *** LFlare has joined #archiveteam-bs [13:44] *** Mateon1 has joined #archiveteam-bs [13:47] Looking at the AB log is showing a bunch of 503s at the moment [13:51] With an 18 sec delay it seems to be doing better [14:04] It was 500-700ms that was doing better [14:21] Flashfire: when I did the discovery I just grabbed urls - so I don't actually have a list of candidate names [14:31] *** Sk1d has quit IRC (Read error: Operation timed out) [14:34] *** Sk1d has joined #archiveteam-bs [14:48] *** Sk1d has quit IRC (Read error: Operation timed out) [14:49] *** VerifiedJ has joined #archiveteam-bs [14:50] *** Sk1d has joined #archiveteam-bs [15:06] *** Sk1d has quit IRC (Read error: Operation timed out) [15:11] *** Sk1d has joined #archiveteam-bs [15:16] *** SketchCo1 is now known as SketchCow [15:16] *** LFlare has quit IRC (Read error: Operation timed out) [15:23] *** Sk1d has quit IRC (Read error: Operation timed out) [15:26] *** LFlare has joined #archiveteam-bs [15:28] *** Sk1d has joined #archiveteam-bs [15:43] *** Martle has joined #archiveteam-bs [16:19] *** Sk1d has quit IRC (Read error: Operation timed out) [16:23] *** Sk1d has joined #archiveteam-bs [16:24] *** Somebody2 has quit IRC (Read error: Operation timed out) [16:25] *** DFJustin has quit IRC (Read error: Connection reset by peer) [16:26] *** DFJustin has joined #archiveteam-bs [16:26] *** swebb sets mode: +o DFJustin [16:27] *** _Verified has joined #archiveteam-bs [16:29] *** _Verified has quit IRC (Client Quit) [16:30] *** VerifiedJ has quit IRC (Quit: Leaving) [16:34] *** VerifiedJ has joined #archiveteam-bs [16:38] *** Sk1d has quit IRC (Read error: Operation timed out) [16:41] *** Sk1d has joined #archiveteam-bs [16:57] *** schbirid has joined #archiveteam-bs [17:07] *** LFlare has quit IRC (Ping timeout: 252 seconds) [17:12] *** wp494 has quit IRC (Ping timeout: 260 seconds) [17:12] *** wp494 has joined #archiveteam-bs [17:14] *** Somebody2 has joined #archiveteam-bs [17:29] *** LFlare has joined #archiveteam-bs [17:47] re gitorious [17:48] i mean. the storage layer is real infrastructure. [17:48] but the way we punched a hole thru the layers in the middle to present it to a virtual machine, is, a bit unusual [17:57] *** Sk1d has quit IRC (Read error: Operation timed out) [18:02] *** Sk1d has joined #archiveteam-bs [18:06] *** Ryz has joined #archiveteam-bs [18:07] *** Martle_ has joined #archiveteam-bs [18:08] *** Martle_ has quit IRC (Client Quit) [18:09] *** Martle has quit IRC (Ping timeout: 252 seconds) [19:03] *** Sk1d has quit IRC (Read error: Operation timed out) [19:06] *** Sk1d has joined #archiveteam-bs [19:22] *** SimpBrain has quit IRC (Read error: Operation timed out) [19:38] *** m007a83_ has joined #archiveteam-bs [19:41] *** m007a83 has quit IRC (Read error: Operation timed out) [19:43] *** Sk1d has quit IRC (Read error: Operation timed out) [19:46] *** Sk1d has joined #archiveteam-bs [19:48] *** m007a83_ is now known as m007a83 [19:59] *** Sk1d has quit IRC (Read error: Operation timed out) [20:03] *** Sk1d has joined #archiveteam-bs [20:16] *** Sk1d has quit IRC (Read error: Operation timed out) [20:20] *** Sk1d has joined #archiveteam-bs [20:26] *** icedice has joined #archiveteam-bs [20:32] *** Sk1d has quit IRC (Read error: Operation timed out) [20:34] *** Sk1d has joined #archiveteam-bs [20:42] *** dashcloud has quit IRC (Read error: Operation timed out) [20:48] *** Sk1d has quit IRC (Read error: Operation timed out) [20:52] *** Sk1d has joined #archiveteam-bs [20:53] *** BlueMax has joined #archiveteam-bs [21:16] right, I just found a flaw in the regex I was using to extract URLS from the campaign sites I downloaded: [21:16] it doesn't deal with 'src="' which is what youtube , etc... use for embeded content [21:17] whatever I do will undoubtedly be wrong, does anyone have a good grep command for extracting urls from HTML [21:17] don't need to be inline URLs, just ones that are links or embedded content [21:18] preferably one that uses grep rather than egrep, for speed [21:22] *** BlueMaxim has joined #archiveteam-bs [21:22] i like doing poop like: grep -Po 'src=".*?"' index.html [21:23] or sed 's#src="#\nsrc=#g' index.html | grep src= [21:23] :D [21:23] first one aint poop actually, it's the best i know [21:25] *** BlueMax has quit IRC (Ping timeout: 260 seconds) [21:25] *** Sk1d has quit IRC (Read error: Operation timed out) [21:26] coming from a regex idiot, is there any way to alter the first one so it doesn't have the 'src="' and '"' in it? [21:27] betamax: At that point I would just look around for a command line utility that does processing similar to what jq does for json [21:28] or look up some sed or awk command [21:28] If it's slow, you can always try and use parallel to speed things up [21:30] grep is low-memory, though (only have ~2GB to work with) [21:30] *** Sk1d has joined #archiveteam-bs [21:30] betamax: grep -Po 'src="\K.*?"' index.html [21:30] via https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match [21:31] oops, trailing " [21:31] grep -Po 'src="\K.*?(?=")' index.html [21:31] wicked cool [21:31] i always just used a sed pipe for that before. thanks [21:32] that's it! thanks [21:33] np! [21:38] wait, that one doesn't work for single-quotes [21:38] is that an issue do you think? [21:41] grep -Po 'src=(["'"'"'])\K.*?(?=\1)' index.html [21:57] *** w0rmybak has quit IRC (Quit: Ping timeout (120 seconds)) [21:57] *** kiskabak has quit IRC (Quit: Ping timeout (120 seconds)) [21:57] *** Flashback has quit IRC (Quit: Ping timeout (120 seconds)) [21:58] *** w0rmybak has joined #archiveteam-bs [22:00] *** kiskabak has joined #archiveteam-bs [22:00] *** w0rmybak has quit IRC (Client Quit) [22:00] JAA: How does that one work? Are the quotes escaping eachother? [22:00] *** w0rmybak has joined #archiveteam-bs [22:01] jodizzle: It's actually three separate strings: 'src=(["' + "'" + '])...' [22:01] That's because you can't have a single quote inside a single-quoted string. There is no backslash escaping or similar. [22:02] This is on the shell level, i.e. the argument actually sent to grep is just src=(["'])\K... [22:03] *** Flashback has joined #archiveteam-bs [22:06] JAA: Ahh took me a second but I get it now. [22:06] That's pretty cool, thanks. [22:06] :-) [22:10] *** schbirid has quit IRC (Remote host closed the connection) [22:11] *** Sk1d has quit IRC (Read error: Operation timed out) [22:11] *** bakJAA_ is now known as bakJAA [22:15] *** Sk1d has joined #archiveteam-bs [22:30] *** Sk1d has quit IRC (Read error: Operation timed out) [22:33] *** Sk1d has joined #archiveteam-bs [22:40] JAA: regex n00b again. Any way to add that regex to one that works with 'href' links as well? [22:41] combining it with the one I was using before won't work: in my cluelessness it consisted of about 5 grep commands all chained together [22:45] *** Sk1d has quit IRC (Read error: Operation timed out) [22:48] betamax: Don't quote me, but I think if you did this it would work: grep -Po '(href|src)=(["'"'"'])\K.*?(?=\2)' [22:49] thanks, will try [22:51] *** Sk1d has joined #archiveteam-bs [22:51] Seems to work on a random test html file for me. [22:55] *** Mateon1 has quit IRC (Read error: Operation timed out) [22:56] *** dashcloud has joined #archiveteam-bs [22:56] *** Mateon1 has joined #archiveteam-bs [23:18] *** BlueMaxim has quit IRC (Quit: Leaving) [23:19] *** adinbied has quit IRC (Left Channel.) [23:39] *** adinbied has joined #archiveteam-bs