#archiveteam-bs 2018-11-08,Thu

↑back Search

Time Nickname Message
00:05 🔗 Sk1d has quit IRC (Read error: Operation timed out)
00:08 🔗 twoTBHetz has joined #archiveteam-bs
00:08 🔗 Sk1d has joined #archiveteam-bs
00:13 🔗 m007a83 has joined #archiveteam-bs
00:21 🔗 Sk1d has quit IRC (Read error: Operation timed out)
00:23 🔗 Sk1d has joined #archiveteam-bs
00:24 🔗 jodizzle betamax: Are there any other URLs that need snscraping at the moment? Or candidate websites that need archiving?
00:24 🔗 jodizzle On a related note, I was looking around for other sites that might have candidate links, and found this: https://vote-usa.org/forresearch.aspx
00:24 🔗 jodizzle Seems comprehensive and well-structured, will probably try writing a scraper for it later.
00:45 🔗 VerifiedJ has quit IRC (Quit: Leaving)
00:51 🔗 Mateon1 has quit IRC (Ping timeout: 265 seconds)
00:52 🔗 Mateon1 has joined #archiveteam-bs
00:58 🔗 Sk1d has quit IRC (Read error: Operation timed out)
01:01 🔗 Sk1d has joined #archiveteam-bs
01:15 🔗 Sk1d has quit IRC (Read error: Operation timed out)
01:19 🔗 Sk1d has joined #archiveteam-bs
01:31 🔗 twoTBHetz has quit IRC (Ping timeout: 260 seconds)
01:35 🔗 Sk1d has quit IRC (Read error: Operation timed out)
01:39 🔗 Sk1d has joined #archiveteam-bs
01:56 🔗 Sk1d has quit IRC (Read error: Operation timed out)
01:59 🔗 Sk1d has joined #archiveteam-bs
02:05 🔗 Stilett0 has joined #archiveteam-bs
02:10 🔗 Stiletto has quit IRC (Ping timeout: 633 seconds)
02:15 🔗 Sk1d has quit IRC (Read error: Operation timed out)
02:20 🔗 Sk1d has joined #archiveteam-bs
02:32 🔗 Ryz has quit IRC (west.us.hub irc.Prison.NET)
02:32 🔗 achip has quit IRC (west.us.hub irc.Prison.NET)
02:33 🔗 Sk1d has quit IRC (Read error: Operation timed out)
02:37 🔗 Sk1d has joined #archiveteam-bs
02:50 🔗 Sk1d has quit IRC (Read error: Operation timed out)
02:54 🔗 Sk1d has joined #archiveteam-bs
03:05 🔗 achip has joined #archiveteam-bs
03:07 🔗 Ryz has joined #archiveteam-bs
03:32 🔗 Sk1d has quit IRC (Read error: Operation timed out)
03:36 🔗 Sk1d has joined #archiveteam-bs
03:40 🔗 bakJAA_ has joined #archiveteam-bs
03:40 🔗 swebb sets mode: +o bakJAA_
03:40 🔗 JAA sets mode: +o bakJAA_
03:40 🔗 bakJAA has quit IRC (Read error: Connection reset by peer)
03:41 🔗 kyounko has quit IRC (Ping timeout: 492 seconds)
03:43 🔗 mgrytbak_ has quit IRC (Ping timeout: 492 seconds)
03:47 🔗 mgrytbak_ has joined #archiveteam-bs
03:50 🔗 Sk1d has quit IRC (Read error: Operation timed out)
03:54 🔗 Sk1d has joined #archiveteam-bs
04:00 🔗 ggus hi, it's possible to convert .warc to static html?
04:06 🔗 jodizzle ggus: If the WARC is a WARC of an HTML file, sure. You just need to extract the contents.
04:06 🔗 jodizzle I think there are a bunch of tools, but take a look at something like https://github.com/chfoo/warcat
04:07 🔗 Sk1d has quit IRC (Read error: Operation timed out)
04:12 🔗 Sk1d has joined #archiveteam-bs
04:19 🔗 ggus jodizzle: thanks! i'll take a look
04:28 🔗 Martle__ has quit IRC (Ping timeout: 252 seconds)
04:43 🔗 qw3rty114 has joined #archiveteam-bs
04:50 🔗 qw3rty113 has quit IRC (Read error: Operation timed out)
04:50 🔗 Sk1d has quit IRC (Read error: Operation timed out)
04:54 🔗 Sk1d has joined #archiveteam-bs
04:57 🔗 odemg has quit IRC (Read error: Operation timed out)
05:06 🔗 Sk1d has quit IRC (Read error: Operation timed out)
05:10 🔗 Sk1d has joined #archiveteam-bs
05:11 🔗 odemg has joined #archiveteam-bs
05:23 🔗 Sk1d has quit IRC (Read error: Operation timed out)
05:27 🔗 Sk1d has joined #archiveteam-bs
06:01 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:05 🔗 Sk1d has joined #archiveteam-bs
06:17 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:22 🔗 Sk1d has joined #archiveteam-bs
06:36 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:41 🔗 Sk1d has joined #archiveteam-bs
06:53 🔗 Sk1d has quit IRC (Read error: Operation timed out)
06:58 🔗 Sk1d has joined #archiveteam-bs
07:43 🔗 Sk1d has quit IRC (Read error: Operation timed out)
07:48 🔗 Sk1d has joined #archiveteam-bs
08:00 🔗 Sk1d has quit IRC (Read error: Operation timed out)
08:04 🔗 Sk1d has joined #archiveteam-bs
08:12 🔗 adinbied has quit IRC (Remote host closed the connection)
08:12 🔗 adinbied has joined #archiveteam-bs
08:15 🔗 Sk1d has quit IRC (Read error: Operation timed out)
08:18 🔗 adinbied_ has joined #archiveteam-bs
08:19 🔗 adinbied has quit IRC (Ping timeout: 252 seconds)
08:20 🔗 Sk1d has joined #archiveteam-bs
08:32 🔗 Sk1d has quit IRC (Read error: Operation timed out)
08:36 🔗 Sk1d has joined #archiveteam-bs
08:51 🔗 Sk1d has quit IRC (Read error: Operation timed out)
08:56 🔗 Sk1d has joined #archiveteam-bs
09:12 🔗 Sk1d has quit IRC (Read error: Operation timed out)
09:15 🔗 Sk1d has joined #archiveteam-bs
09:18 🔗 BlueMax has quit IRC (Quit: Leaving)
09:28 🔗 Sk1d has quit IRC (Read error: Operation timed out)
09:30 🔗 Ryz Awww yeah, Flashpoint - archiving Adobe Flash games and animations and making it work: https://old.reddit.com/r/emulation/comments/9jt2yc/flashpoint_50_brand_new_launcher_two_new_plugins/
09:30 🔗 Ryz And a more recent topic: https://old.reddit.com/r/emulation/comments/9s0lz6/flashpoint_51_the_great_filter_playlists_a_new/
09:32 🔗 Sk1d has joined #archiveteam-bs
09:37 🔗 Flashfire Bluemax you have a fan
09:39 🔗 Ryz Oh, I known about this for over a 2-4 months now, and even though I can't play Adobe Flash games right now, I casually keep getting informed about it
09:41 🔗 Ryz I may plan to help out on archiving Adobe Flash games too or doing anything related to the project
09:45 🔗 Sk1d has quit IRC (Read error: Operation timed out)
09:48 🔗 Sk1d has joined #archiveteam-bs
10:02 🔗 Sk1d has quit IRC (Read error: Operation timed out)
10:07 🔗 Sk1d has joined #archiveteam-bs
10:08 🔗 godane has quit IRC (Ping timeout: 265 seconds)
10:10 🔗 Flashfire Bluemax hangs out around here somewhere
10:18 🔗 Sk1d has quit IRC (Read error: Operation timed out)
10:24 🔗 Sk1d has joined #archiveteam-bs
10:54 🔗 Sk1d has quit IRC (Read error: Operation timed out)
10:58 🔗 betamax jodizzle: your approach of looking for sites that might have candidate links is a good one
10:59 🔗 Sk1d has joined #archiveteam-bs
11:00 🔗 betamax basically, it will get to a point where the databases of candidate info have been exhausted and then the effort required to discover more urls becomes too great
11:00 🔗 betamax eg: googling candidate names and looking for their websites is a bit infeasable - at least in my opinion!
11:07 🔗 Flashfire Betamax give me a small list of candidates. Combining it with my FTP search I may come up with a few useful things to throw into archivebot
11:11 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:14 🔗 Sk1d has joined #archiveteam-bs
11:27 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:31 🔗 Sk1d has joined #archiveteam-bs
11:44 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:49 🔗 Sk1d has joined #archiveteam-bs
12:08 🔗 Ryz has quit IRC (Quit: ChatZilla 0.9.92-rdmsoft [XULRunner 35.0.1/20150122214805])
12:53 🔗 t2t2 has quit IRC (Read error: Operation timed out)
12:56 🔗 t2t2 has joined #archiveteam-bs
12:59 🔗 godane has joined #archiveteam-bs
13:08 🔗 adinbied_ is now known as adinbied
13:12 🔗 adinbied Free Music Archive is returning a 503 for me, did it go down a day early?
13:13 🔗 adinbied NVM, working again - was just a temp issue
13:34 🔗 kiska Probably because AB is slamming it, I think
13:39 🔗 anarcat still far from done too :/
13:42 🔗 kiska Currently there is: "622,025 in queue"
13:44 🔗 Mateon1 has quit IRC (Remote host closed the connection)
13:44 🔗 LFlare has joined #archiveteam-bs
13:44 🔗 Mateon1 has joined #archiveteam-bs
13:47 🔗 adinbied Looking at the AB log is showing a bunch of 503s at the moment
13:51 🔗 adinbied With an 18 sec delay it seems to be doing better
14:04 🔗 kiska It was 500-700ms that was doing better
14:21 🔗 betamax Flashfire: when I did the discovery I just grabbed urls - so I don't actually have a list of candidate names
14:31 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:34 🔗 Sk1d has joined #archiveteam-bs
14:48 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:49 🔗 VerifiedJ has joined #archiveteam-bs
14:50 🔗 Sk1d has joined #archiveteam-bs
15:06 🔗 Sk1d has quit IRC (Read error: Operation timed out)
15:11 🔗 Sk1d has joined #archiveteam-bs
15:16 🔗 SketchCo1 is now known as SketchCow
15:16 🔗 LFlare has quit IRC (Read error: Operation timed out)
15:23 🔗 Sk1d has quit IRC (Read error: Operation timed out)
15:26 🔗 LFlare has joined #archiveteam-bs
15:28 🔗 Sk1d has joined #archiveteam-bs
15:43 🔗 Martle has joined #archiveteam-bs
16:19 🔗 Sk1d has quit IRC (Read error: Operation timed out)
16:23 🔗 Sk1d has joined #archiveteam-bs
16:24 🔗 Somebody2 has quit IRC (Read error: Operation timed out)
16:25 🔗 DFJustin has quit IRC (Read error: Connection reset by peer)
16:26 🔗 DFJustin has joined #archiveteam-bs
16:26 🔗 swebb sets mode: +o DFJustin
16:27 🔗 _Verified has joined #archiveteam-bs
16:29 🔗 _Verified has quit IRC (Client Quit)
16:30 🔗 VerifiedJ has quit IRC (Quit: Leaving)
16:34 🔗 VerifiedJ has joined #archiveteam-bs
16:38 🔗 Sk1d has quit IRC (Read error: Operation timed out)
16:41 🔗 Sk1d has joined #archiveteam-bs
16:57 🔗 schbirid has joined #archiveteam-bs
17:07 🔗 LFlare has quit IRC (Ping timeout: 252 seconds)
17:12 🔗 wp494 has quit IRC (Ping timeout: 260 seconds)
17:12 🔗 wp494 has joined #archiveteam-bs
17:14 🔗 Somebody2 has joined #archiveteam-bs
17:29 🔗 LFlare has joined #archiveteam-bs
17:47 🔗 astrid re gitorious
17:48 🔗 astrid i mean. the storage layer is real infrastructure.
17:48 🔗 astrid but the way we punched a hole thru the layers in the middle to present it to a virtual machine, is, a bit unusual
17:57 🔗 Sk1d has quit IRC (Read error: Operation timed out)
18:02 🔗 Sk1d has joined #archiveteam-bs
18:06 🔗 Ryz has joined #archiveteam-bs
18:07 🔗 Martle_ has joined #archiveteam-bs
18:08 🔗 Martle_ has quit IRC (Client Quit)
18:09 🔗 Martle has quit IRC (Ping timeout: 252 seconds)
19:03 🔗 Sk1d has quit IRC (Read error: Operation timed out)
19:06 🔗 Sk1d has joined #archiveteam-bs
19:22 🔗 SimpBrain has quit IRC (Read error: Operation timed out)
19:38 🔗 m007a83_ has joined #archiveteam-bs
19:41 🔗 m007a83 has quit IRC (Read error: Operation timed out)
19:43 🔗 Sk1d has quit IRC (Read error: Operation timed out)
19:46 🔗 Sk1d has joined #archiveteam-bs
19:48 🔗 m007a83_ is now known as m007a83
19:59 🔗 Sk1d has quit IRC (Read error: Operation timed out)
20:03 🔗 Sk1d has joined #archiveteam-bs
20:16 🔗 Sk1d has quit IRC (Read error: Operation timed out)
20:20 🔗 Sk1d has joined #archiveteam-bs
20:26 🔗 icedice has joined #archiveteam-bs
20:32 🔗 Sk1d has quit IRC (Read error: Operation timed out)
20:34 🔗 Sk1d has joined #archiveteam-bs
20:42 🔗 dashcloud has quit IRC (Read error: Operation timed out)
20:48 🔗 Sk1d has quit IRC (Read error: Operation timed out)
20:52 🔗 Sk1d has joined #archiveteam-bs
20:53 🔗 BlueMax has joined #archiveteam-bs
21:16 🔗 betamax right, I just found a flaw in the regex I was using to extract URLS from the campaign sites I downloaded:
21:16 🔗 betamax it doesn't deal with 'src="' which is what youtube , etc... use for embeded content
21:17 🔗 betamax whatever I do will undoubtedly be wrong, does anyone have a good grep command for extracting urls from HTML
21:17 🔗 betamax don't need to be inline URLs, just ones that are links or embedded content
21:18 🔗 betamax preferably one that uses grep rather than egrep, for speed
21:22 🔗 BlueMaxim has joined #archiveteam-bs
21:22 🔗 schbirid i like doing poop like: grep -Po 'src=".*?"' index.html
21:23 🔗 schbirid or sed 's#src="#\nsrc=#g' index.html | grep src=
21:23 🔗 schbirid :D
21:23 🔗 schbirid first one aint poop actually, it's the best i know
21:25 🔗 BlueMax has quit IRC (Ping timeout: 260 seconds)
21:25 🔗 Sk1d has quit IRC (Read error: Operation timed out)
21:26 🔗 betamax coming from a regex idiot, is there any way to alter the first one so it doesn't have the 'src="' and '"' in it?
21:27 🔗 jodizzle betamax: At that point I would just look around for a command line utility that does processing similar to what jq does for json
21:28 🔗 jodizzle or look up some sed or awk command
21:28 🔗 jodizzle If it's slow, you can always try and use parallel to speed things up
21:30 🔗 betamax grep is low-memory, though (only have ~2GB to work with)
21:30 🔗 Sk1d has joined #archiveteam-bs
21:30 🔗 schbirid betamax: grep -Po 'src="\K.*?"' index.html
21:30 🔗 schbirid via https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match
21:31 🔗 schbirid oops, trailing "
21:31 🔗 schbirid grep -Po 'src="\K.*?(?=")' index.html
21:31 🔗 schbirid wicked cool
21:31 🔗 schbirid i always just used a sed pipe for that before. thanks
21:32 🔗 betamax that's it! thanks
21:33 🔗 schbirid np!
21:38 🔗 betamax wait, that one doesn't work for single-quotes
21:38 🔗 betamax is that an issue do you think?
21:41 🔗 JAA grep -Po 'src=(["'"'"'])\K.*?(?=\1)' index.html
21:57 🔗 w0rmybak has quit IRC (Quit: Ping timeout (120 seconds))
21:57 🔗 kiskabak has quit IRC (Quit: Ping timeout (120 seconds))
21:57 🔗 Flashback has quit IRC (Quit: Ping timeout (120 seconds))
21:58 🔗 w0rmybak has joined #archiveteam-bs
22:00 🔗 kiskabak has joined #archiveteam-bs
22:00 🔗 w0rmybak has quit IRC (Client Quit)
22:00 🔗 jodizzle JAA: How does that one work? Are the quotes escaping eachother?
22:00 🔗 w0rmybak has joined #archiveteam-bs
22:01 🔗 JAA jodizzle: It's actually three separate strings: 'src=(["' + "'" + '])...'
22:01 🔗 JAA That's because you can't have a single quote inside a single-quoted string. There is no backslash escaping or similar.
22:02 🔗 JAA This is on the shell level, i.e. the argument actually sent to grep is just src=(["'])\K...
22:03 🔗 Flashback has joined #archiveteam-bs
22:06 🔗 jodizzle JAA: Ahh took me a second but I get it now.
22:06 🔗 jodizzle That's pretty cool, thanks.
22:06 🔗 JAA :-)
22:10 🔗 schbirid has quit IRC (Remote host closed the connection)
22:11 🔗 Sk1d has quit IRC (Read error: Operation timed out)
22:11 🔗 bakJAA_ is now known as bakJAA
22:15 🔗 Sk1d has joined #archiveteam-bs
22:30 🔗 Sk1d has quit IRC (Read error: Operation timed out)
22:33 🔗 Sk1d has joined #archiveteam-bs
22:40 🔗 betamax JAA: regex n00b again. Any way to add that regex to one that works with 'href' links as well?
22:41 🔗 betamax combining it with the one I was using before won't work: in my cluelessness it consisted of about 5 grep commands all chained together
22:45 🔗 Sk1d has quit IRC (Read error: Operation timed out)
22:48 🔗 jodizzle betamax: Don't quote me, but I think if you did this it would work: grep -Po '(href|src)=(["'"'"'])\K.*?(?=\2)'
22:49 🔗 betamax thanks, will try
22:51 🔗 Sk1d has joined #archiveteam-bs
22:51 🔗 jodizzle Seems to work on a random test html file for me.
22:55 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
22:56 🔗 dashcloud has joined #archiveteam-bs
22:56 🔗 Mateon1 has joined #archiveteam-bs
23:18 🔗 BlueMaxim has quit IRC (Quit: Leaving)
23:19 🔗 adinbied has quit IRC (Left Channel.)
23:39 🔗 adinbied has joined #archiveteam-bs

irclogger-viewer