| Time |
Nickname |
Message |
|
00:36
🔗
|
Despatche |
wait archive.org has winamp skins now? holy shittttt |
|
01:23
🔗
|
Flashfire |
https://motherboard.vice.com/en_us/article/d3q45v/bittorrent-usage-increases-netflix-streaming-sites |
|
01:24
🔗
|
|
bithippo has joined #archiveteam-ot |
|
01:56
🔗
|
|
bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) |
|
02:46
🔗
|
|
Despatche has quit IRC (Ping timeout: 633 seconds) |
|
02:55
🔗
|
|
Despatche has joined #archiveteam-ot |
|
03:02
🔗
|
|
Despatche has quit IRC (Read error: Operation timed out) |
|
03:49
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
|
03:53
🔗
|
|
icedice has joined #archiveteam-ot |
|
04:57
🔗
|
|
Despatche has joined #archiveteam-ot |
|
04:58
🔗
|
|
Sanqui has quit IRC (Read error: Operation timed out) |
|
05:00
🔗
|
|
Sanqui has joined #archiveteam-ot |
|
05:01
🔗
|
|
Despatche has quit IRC (Remote host closed the connection) |
|
05:02
🔗
|
|
Despatche has joined #archiveteam-ot |
|
05:33
🔗
|
w0rmhole |
sorry ivan i lost it in the tsunami of messages here, what did i need to use to stop `grab-site' from falling into link loops like this?: http://atarimusic.exxoshost.co.uk/forum/topic?f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24 ................ |
|
05:33
🔗
|
w0rmhole |
original command i ran: `grab-site --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://atarimusic.exxoshost.co.uk/"' |
|
05:34
🔗
|
|
Despatche has quit IRC (Remote host closed the connection) |
|
05:35
🔗
|
|
Despatche has joined #archiveteam-ot |
|
05:43
🔗
|
ivan |
w0rmhole: try an ignore that includes \?.+\? because well-formed URLs have at most one '?' |
|
05:43
🔗
|
ivan |
or even \?.*\? |
|
05:44
🔗
|
w0rmhole |
sorry im not sure how to do that |
|
05:44
🔗
|
w0rmhole |
do i literally type: |
|
05:44
🔗
|
w0rmhole |
\?.*\? |
|
05:44
🔗
|
ivan |
yep |
|
05:44
🔗
|
w0rmhole |
in ignores |
|
05:44
🔗
|
ivan |
into the ignores file |
|
05:44
🔗
|
ivan |
on a new line, hit save |
|
05:44
🔗
|
w0rmhole |
ok |
|
05:44
🔗
|
w0rmhole |
would you recommend one over the other? |
|
05:45
🔗
|
ivan |
\?.*\? |
|
05:45
🔗
|
w0rmhole |
okay, thank you for the help |
|
05:45
🔗
|
ivan |
.* matches zero or more characters |
|
05:45
🔗
|
ivan |
.+ matches one or more characters |
|
05:45
🔗
|
w0rmhole |
lets hope my comp. doesnt shit itself again |
|
05:46
🔗
|
w0rmhole |
i walked away for a day and came back to it choking on urls like that |
|
05:46
🔗
|
ivan |
\? is a literal '?' because an unescaped '?' means "make the last thing optional to match" |
|
05:46
🔗
|
w0rmhole |
and kept expanding |
|
05:46
🔗
|
w0rmhole |
okay i think i get it |
|
05:47
🔗
|
w0rmhole |
i thought your original suggestion included brackets |
|
05:47
🔗
|
w0rmhole |
[ |
|
05:47
🔗
|
w0rmhole |
] |
|
05:47
🔗
|
ivan |
<ivan> |
|
05:47
🔗
|
ivan |
w0rmhole: \?topic.+\?topic ignore? |
|
05:47
🔗
|
ivan |
<ivan> |
|
05:47
🔗
|
ivan |
and/or [\?&]p=.+[\?&]p= ignore |
|
05:48
🔗
|
w0rmhole |
yeah yeah that |
|
05:48
🔗
|
ivan |
maybe you were seeing a different loop then |
|
05:48
🔗
|
w0rmhole |
possibly |
|
05:48
🔗
|
w0rmhole |
let me checkl |
|
05:48
🔗
|
w0rmhole |
my terminal windows |
|
05:48
🔗
|
w0rmhole |
oh yep |
|
05:48
🔗
|
w0rmhole |
http://atarimusic.exxoshost.co.uk/forum/topic?p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108? |
|
05:48
🔗
|
w0rmhole |
....... |
|
05:49
🔗
|
ivan |
multiple ? in a URL will almost always be bad but I don't know if I want to make it a default ignore |
|
05:50
🔗
|
w0rmhole |
so would i apply both: |
|
05:50
🔗
|
w0rmhole |
\?.*\? |
|
05:50
🔗
|
w0rmhole |
and |
|
05:50
🔗
|
w0rmhole |
\?topic.+\?topic ignore? |
|
05:50
🔗
|
w0rmhole |
in the ignores file? sorry i am still learning |
|
05:50
🔗
|
ivan |
\?.*\? will ignore anything that \?topic.+\?topic would ignore |
|
05:51
🔗
|
w0rmhole |
ok so just using \?.*\? i'll be set for both? |
|
05:51
🔗
|
ivan |
sure |
|
05:51
🔗
|
w0rmhole |
thanks |
|
05:52
🔗
|
w0rmhole |
hopefully that will work |
|
05:53
🔗
|
w0rmhole |
so with the whole phpsessid shit (since it's in the url) i dont need to worry about that? |
|
05:54
🔗
|
ivan |
I don't know what you mean |
|
05:55
🔗
|
w0rmhole |
sorry, bad english, if i visit https://www.exxoshost.co.uk/forum/ and click on one of the links i get https://www.exxoshost.co.uk/forum/viewtopic.php?f=64&t=1241&sid=baa4aee6927b67f3d18a63e8f3e7edf8 note the sid=xxxxxxxxxxx...... in the url |
|
05:56
🔗
|
w0rmhole |
grab-site will prevent this automatically? |
|
05:56
🔗
|
ivan |
reload the page and you'll see it no longer has the ?sid= once the cookie is set |
|
05:56
🔗
|
ivan |
you can start the crawl with a cookie or pick some irrelevant page as the first url |
|
05:57
🔗
|
ivan |
grab-site takes multiple URLs if needed |
|
05:57
🔗
|
w0rmhole |
so something like `grab-site https://www.exxoshost.co.uk/forum?archiveteam --igset forums' ? |
|
05:57
🔗
|
ivan |
sure |
|
05:58
🔗
|
w0rmhole |
or do i need to do it like this: `grab-site https://www.exxoshost.co.uk/forum/?archiveteam --igset forums' or is no different? |
|
05:58
🔗
|
w0rmhole |
sorry for asking |
|
05:59
🔗
|
ivan |
you might need to quote the URL argument because shells expand ? to a matching character in a filename |
|
06:00
🔗
|
w0rmhole |
ok good idea |
|
06:13
🔗
|
|
Despatche has quit IRC (Quit: Error: Connection reset by peer) |
|
06:15
🔗
|
w0rmhole |
ivan: is `--wpull-args=--tries=1000' a bit extreme? |
|
06:16
🔗
|
ivan |
yes |
|
06:16
🔗
|
w0rmhole |
15 more reasonable? |
|
06:16
🔗
|
ivan |
even 5 should be plenty |
|
06:17
🔗
|
w0rmhole |
just the website im trying keeps erroring out but after 700 or so attempts (time out) it finally gets what it was trying to get |
|
06:17
🔗
|
ivan |
was the site down for a while? |
|
06:18
🔗
|
w0rmhole |
dont believe so |
|
06:18
🔗
|
w0rmhole |
wasn't for that site i said earlier, but i had it happen to another site and, well, that one too, but not THAT many retries |
|
06:18
🔗
|
ivan |
if something takes 700 tries to succeed that's not something grab-site is designed to archive |
|
06:19
🔗
|
ivan |
you can do whatever you feel you need to do but trying 1000 times might get you banned from some sites |
|
06:19
🔗
|
w0rmhole |
a .zip file |
|
06:19
🔗
|
w0rmhole |
good point |
|
06:19
🔗
|
w0rmhole |
3.2mb zip file |
|
06:43
🔗
|
w0rmhole |
ivan: this look like an acceptable cookies.txt for this to you? https://share.dmca.gripe/dQYlzLvcwDF8eWUP.txt |
|
06:43
🔗
|
w0rmhole |
generated it by visiting https://www.exxoshost.co.uk/forum then refreshing; and http://atarimusic.exxoshost.co.uk/ and refreshing. |
|
06:45
🔗
|
ivan |
you probably need to remove the #HttpOnly_ |
|
06:45
🔗
|
w0rmhole |
the entire entry or just that |
|
06:46
🔗
|
ivan |
just that string |
|
06:46
🔗
|
ivan |
start with the . on the domain |
|
06:46
🔗
|
ivan |
you can also try changing that 0 in the first cookie to 1570084624 |
|
06:46
🔗
|
w0rmhole |
whats that do? |
|
06:46
🔗
|
w0rmhole |
idk what TRUE and FALSE mean either in this |
|
06:47
🔗
|
ivan |
https://unix.stackexchange.com/questions/36531/format-of-cookies-when-using-wget |
|
06:47
🔗
|
w0rmhole |
so i do not touch TRUE/FALSE |
|
06:48
🔗
|
ivan |
yeah those should be fine |
|
06:50
🔗
|
w0rmhole |
okay but i am still confused as to why i replace 0 with 157...... |
|
06:50
🔗
|
w0rmhole |
i am sorry |
|
06:50
🔗
|
w0rmhole |
why is 0 bad |
|
06:50
🔗
|
ivan |
it might expire when the crawl starts |
|
06:50
🔗
|
ivan |
I haven't tested it |
|
06:50
🔗
|
w0rmhole |
oh ok |
|
06:51
🔗
|
w0rmhole |
and the number you suggested? |
|
06:51
🔗
|
w0rmhole |
is that a yr from now? |
|
06:53
🔗
|
w0rmhole |
yes |
|
06:54
🔗
|
w0rmhole |
ivan: look good? https://share.dmca.gripe/CJ1Jm0dcCyCgmnAw.txt |
|
06:54
🔗
|
ivan |
yes |
|
06:55
🔗
|
w0rmhole |
thank you very much for the helkp |
|
06:55
🔗
|
w0rmhole |
*help |
|
06:55
🔗
|
w0rmhole |
one last thing, does grab-site auto copy cookies.txt to folder where warc is stored? |
|
06:57
🔗
|
ivan |
it tells wpull to --load-cookies from a file and --save-cookies to DIR/cookies.txt |
|
06:58
🔗
|
ivan |
a lot of stuff may happen inside wpull |
|
06:58
🔗
|
w0rmhole |
thanks |
|
06:59
🔗
|
ivan |
https://github.com/ludios/grab-site/blob/29b9825dc5f49c25f01d93746cfb0638c724c22a/libgrabsite/main.py#L240-L259 |
|
07:00
🔗
|
ivan |
and up to line 216 above |
|
07:00
🔗
|
w0rmhole |
ty |
|
07:15
🔗
|
|
djsundog has quit IRC (Read error: Operation timed out) |
|
07:15
🔗
|
|
mal_ has quit IRC (Read error: Operation timed out) |
|
07:16
🔗
|
|
ivan has quit IRC (Read error: Operation timed out) |
|
07:17
🔗
|
|
ivan has joined #archiveteam-ot |
|
07:17
🔗
|
|
svchfoo1 sets mode: +o ivan |
|
07:17
🔗
|
|
Albardin has quit IRC (Read error: Operation timed out) |
|
07:18
🔗
|
|
vectr0n_ has joined #archiveteam-ot |
|
07:19
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
|
07:21
🔗
|
|
dxrt_ has quit IRC (Read error: Operation timed out) |
|
07:22
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
|
07:22
🔗
|
|
Mateon1 has joined #archiveteam-ot |
|
07:23
🔗
|
|
Albardin has joined #archiveteam-ot |
|
07:24
🔗
|
|
kiska1 has quit IRC (Read error: Connection reset by peer) |
|
07:24
🔗
|
|
kiska1 has joined #archiveteam-ot |
|
07:25
🔗
|
|
vectr0n has quit IRC (Ping timeout: 600 seconds) |
|
07:25
🔗
|
|
vectr0n_ is now known as vectr0n |
|
07:26
🔗
|
|
godane has joined #archiveteam-ot |
|
07:27
🔗
|
|
svchfoo1 sets mode: +o godane |
|
07:27
🔗
|
|
dxrt_ has joined #archiveteam-ot |
|
07:27
🔗
|
|
dxrt sets mode: +o dxrt_ |
|
07:29
🔗
|
|
mal_ has joined #archiveteam-ot |
|
07:33
🔗
|
|
djsundog has joined #archiveteam-ot |
|
07:39
🔗
|
w0rmhole |
ivan: if i want to use both --tries and --load-cookies in grab-site do i do: `--wpull-args=--tries=5 --wpull-args=--load-cookies=/PATH\ TO/cookies.txt' |
|
07:39
🔗
|
w0rmhole |
? |
|
07:41
🔗
|
w0rmhole |
nvm the github page tells me to wrap in quotes |
|
07:44
🔗
|
w0rmhole |
one question, does running something like `grab-site https://www.reddit.com/r/oculus/ --igsets=reddit' grab all of that subreddit or only as much as reddits api allows? |
|
07:46
🔗
|
Flashfire |
Is everyone else having trouble with instagram |
|
07:51
🔗
|
|
m007a83_ has joined #archiveteam-ot |
|
07:54
🔗
|
|
m007a83 has quit IRC (Read error: Operation timed out) |
|
07:56
🔗
|
|
icedice has quit IRC (Ping timeout: 252 seconds) |
|
08:01
🔗
|
|
wp494 has quit IRC (west.us.hub irc.Prison.NET) |
|
08:07
🔗
|
|
wp494 has joined #archiveteam-ot |
|
08:48
🔗
|
w0rmhole |
does anybody here know who danooct1 is |
|
08:49
🔗
|
Flashfire |
the malware guy? |
|
08:49
🔗
|
w0rmhole |
yeah |
|
08:49
🔗
|
Flashfire |
I know of him why? |
|
08:50
🔗
|
w0rmhole |
i found a guilty-pleasure song he co-produced |
|
08:50
🔗
|
w0rmhole |
https://www.youtube.com/watch?v=UJRt41HNLJw |
|
08:50
🔗
|
Flashfire |
Tubeup it |
|
08:50
🔗
|
w0rmhole |
its so bad it' |
|
08:50
🔗
|
w0rmhole |
is godlike |
|
08:50
🔗
|
w0rmhole |
just did |
|
08:53
🔗
|
Flashfire |
I have programs designed for windows NT |
|
08:53
🔗
|
w0rmhole |
need link nao pls |
|
08:53
🔗
|
w0rmhole |
lol |
|
08:54
🔗
|
Flashfire |
Looks like I will need to rip it later |
|
08:58
🔗
|
ivan |
w0rmhole: reddit is terrible and doesn't link to everything |
|
08:58
🔗
|
ivan |
there's an issue on snscrape for reddit support |
|
08:58
🔗
|
w0rmhole |
oh that sucks -_- |
|
09:02
🔗
|
eientei95 |
Flashfire: https://puu.sh/BEWV1/0e44e80636.mp3 |
|
09:49
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
|
10:20
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
|
12:18
🔗
|
|
arkiver has quit IRC (Read error: Operation timed out) |
|
12:18
🔗
|
|
kiska1 has quit IRC (Read error: Operation timed out) |
|
12:19
🔗
|
|
mal_ has quit IRC (Write error: Broken pipe) |
|
12:19
🔗
|
|
djsundog has quit IRC (Read error: Operation timed out) |
|
12:19
🔗
|
|
dxrt_ has quit IRC (Read error: Operation timed out) |
|
12:19
🔗
|
|
Albardin has quit IRC (Write error: Broken pipe) |
|
12:20
🔗
|
|
wp494 has quit IRC (Ping timeout: 255 seconds) |
|
12:20
🔗
|
|
wp494 has joined #archiveteam-ot |
|
12:21
🔗
|
|
arkiver has joined #archiveteam-ot |
|
12:31
🔗
|
|
Albardin has joined #archiveteam-ot |
|
12:32
🔗
|
|
kiska1 has joined #archiveteam-ot |
|
12:36
🔗
|
|
dxrt_ has joined #archiveteam-ot |
|
12:36
🔗
|
|
dxrt sets mode: +o dxrt_ |
|
12:40
🔗
|
|
mal_ has joined #archiveteam-ot |
|
12:41
🔗
|
|
djsundog has joined #archiveteam-ot |
|
14:03
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
|
14:26
🔗
|
|
odemg has joined #archiveteam-ot |
|
14:57
🔗
|
|
bithippo has joined #archiveteam-ot |
|
15:13
🔗
|
|
bithippo has quit IRC (Textual IRC Client: www.textualapp.com) |
|
15:36
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
|
16:01
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
|
16:07
🔗
|
|
schbirid has joined #archiveteam-ot |
|
16:13
🔗
|
|
odemg has joined #archiveteam-ot |
|
16:14
🔗
|
ivan |
twitter search is down right now in case anyone is running a bunch of snscrape and wonders why no results |
|
16:20
🔗
|
|
astrid has joined #archiveteam-ot |
|
18:07
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
|
18:18
🔗
|
|
odemg has joined #archiveteam-ot |
|
20:08
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 268 seconds) |
|
20:08
🔗
|
|
Mateon1 has joined #archiveteam-ot |
|
20:12
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
|
21:01
🔗
|
|
schbirid has quit IRC (Read error: Connection reset by peer) |
|
21:42
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
|
21:54
🔗
|
|
odemg has joined #archiveteam-ot |
|
22:04
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
|
22:05
🔗
|
|
Jens has joined #archiveteam-ot |
|
22:39
🔗
|
|
Stiletto has joined #archiveteam-ot |
|
22:40
🔗
|
|
Stilett0 has quit IRC (Ping timeout: 252 seconds) |
|
23:06
🔗
|
|
m007a83_ is now known as m007a83 |
|
23:19
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
|
23:23
🔗
|
|
dashcloud has joined #archiveteam-ot |
|
23:36
🔗
|
|
BlueMax has joined #archiveteam-ot |
|
23:46
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
|
23:48
🔗
|
|
Stilett0 has joined #archiveteam-ot |
|
23:48
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
|
23:57
🔗
|
|
odemg has joined #archiveteam-ot |
|
23:59
🔗
|
|
Stiletto has joined #archiveteam-ot |